For the pdf slides, click here

$R^{2}$ , also called coefficient of determination or multiple correlation coefficient, is defined for normal linear regression, as the proportion of variance “explained” by the regression model $R^{2} = \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}}$
Note that under the MLE, where ${\hat{σ}}^{2} = \sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2} / n$ , the deviance (i.e., negative two times log likelihood) is $\begin{aligned} - 2 l (\hat{β}) & = - 2 \log L (\hat{β}) \\ = n \log (2 π {\hat{σ}}^{2}) + \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{{\hat{σ}}^{2}} \\ = n [\log (\frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{n}) + \log (2 π) + 1] \end{aligned}$
- I list this derivation here to make clear that the following generalized $R^{2}$ contains as a special case for normal linear regression

Generalized $R^{2}$ by Cox and Snell

The genralized $R^{2}$ for more general models where
1. the concept of residual variance cannot be easily define, and
2. maximum likelihood is the criterion of fit, is $R^{2} = 1 - \exp {- \frac{2}{n} [l (\hat{β}) - l (\hat{0})]} = 1 - {[L (0) / L (\hat{β})]}^{2 / n}$
Here, $L (\hat{β})$ and $L (0)$ are the likelihood of the fitted and the null models, respectively.
For normal linear regression, this generalized $R^{2}$ becomes the classical $R^{2}$

Consistent with classical $R^{2}$
Consistent with maximum likelihood as an estimation method
Asymptotically independent of the sample size $n$
$1 - R^{2}$ has an interpretation as the propotion of unexplained “variation”
- For example, if we have three nested models, from smallest to largest, $M_{1}, M_{2}$ , and $M_{3}$ , then we have $(1 - R_{3, 1}^{2}) = (1 - R_{3, 2}^{2}) (1 - R_{2, 1}^{2})$

For more desirable properties (7 in total), please check out the Nagelkerke[1991] paper

Generalized $R^{2}$ by Nagelkerke

An undesirable property: for discrete models, the maximum $R^{2}$ is always less than 1 $max (R^{2}) = 1 - L (0)^{2 / n}$
- This is because the likelihood of discrete target variables are from pmf (rather than from pdf, as of continuous targets)
A new definition of the generalized $R^{2}$ ${\bar{R}}^{2} = \frac{R^{2}}{max (R^{2})} = \frac{1 - {[L (0) / L (\hat{β})]}^{2 / n}}{1 - L (0)^{2 / n}}$
- Majority of the desirable properties of , including the ones listed on the previous page, are still satisfied
- Nagelkerke’s general $R^{2}$ seems to be a popular version. For example, the biostat textbook by Steyerberg uses this version

Denote the estimated binary probabilities as ${\hat{p}}_{i}$ for the fitted model, and $\bar{p}$ for the null model
Cox and Snell $R^{2}$ $R^{2} = 1 - {[L (0) / L (\hat{β})]}^{2 / n} = 1 - {[\prod_{i} {(\frac{\bar{p}}{{\hat{p}}_{i}})}^{y_{i}} {(\frac{1 - \bar{p}}{1 - {\hat{p}}_{i}})}^{1 - y_{i}}]}^{2 / n}$
Nagelkerke $R^{2}$ ${\bar{R}}^{2} = \frac{1 - {[L (0) / L (\hat{β})]}^{2 / n}}{1 - L (0)^{2 / n}} = \frac{1 - {[\prod_{i} {(\frac{\bar{p}}{{\hat{p}}_{i}})}^{y_{i}} {(\frac{1 - \bar{p}}{1 - {\hat{p}}_{i}})}^{1 - y_{i}}]}^{2 / n}}{1 - {[\prod_{i} {\bar{p}}^{y_{i}} {(1 - \bar{p})}^{1 - y_{i}}]}^{2 / n}}$

Nagelkerke, N. J. D. (1991). A Note on a General Definition of the Coefficient of Determination. Biometrika, 78(3), 691-692.
A nice comparison of different versions of generalized $R^{2}$ : https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
Steyerberg, E. W. (2019). Clinical prediction models. Springer International Publishing.