*For the pdf slides, click here*

### Notations: in binary classification

We are interested in fitting a model \(q(\mathbf{x})\) for the true conditional class 1 probability \[ \eta(\mathbf{x}) = P(Y = 1 \mid \mathbf{X} = \mathbf{x}) \]

Two types of problems

- Classification: estimating a region of the form \(\{\eta(\mathbf{x}) > c\}\)
- Class probability estimation: approximate \(\eta(\mathbf{x})\), by fitting a model \(q(\mathbf{x}, \beta)\), where \(\beta\) are parameters to be estimated

Surrogate criteria for estimation, e.g.,

- Log-loss: \(L(y \mid q) = -y \log(q) - (1-y) \log (1-q)\)
- Squared error loss: \(L(y \mid q) = (y-q)^2 = y (1-q)^2 + (1-y) q^2\)

Surrogate criteria of classification are exactly the primary criteria of class probability estimation

# Proper Scoring Rules

### Proper scoring rule

Fitting a binary model is to minimize a loss function \[ \mathcal{L}\left(q()\right) = \frac{1}{N}\sum_{n=1}^N L(y_n \mid q_n) \]

In game theory, the agent’s goal is to maximize expected score (or minimize expected loss)

- A scoring rule is proper if truthfulness maximizes expected score
- It is strictly proper if truthfulness uniquely maximizes expected score

In the context of binary response data, Fisher consistency holds pointwise if \[ \arg\min_{q\in [0, 1]} E_{Y\sim \text{Bernoulli}(\eta)}L(Y\mid q) = \eta, \quad \forall \eta \in [0, 1] \]

Fisher consistency is the defining property of proper scoring rules

### Visualization of two proper scoring rules

- Left: log-loss, or Beta loss with \(\alpha=\beta=0\)
Right: Beta loss with \(\alpha=1, \beta=3\)

- Tailored for classification with false positive cost \(c = \frac{\alpha}{\alpha+\beta} = 0.25\) and false negative cost \(1-c = 0.75\)

# Commonly Used Proper Scoring Rules

### How to check property of a scoring rule for binary response data

Suppose the partial losses \(L_1(1-q), L_0(q)\) are smooth, then the proper scoring rule property implies \[\begin{align*} 0 & = \left.\frac{\partial}{\partial q}\right|_{q=\eta} R(\eta \mid q)\\ & = -\eta L_1'(1-\eta) + (1-\eta) L_0'(\eta) \end{align*}\]

Therefore, a scoring rule is proper if \[ \eta L_1'(1-\eta) = (1-\eta) L_0'(\eta) \]

A scoring rule is strictly proper if \[ \left.\frac{\partial^2}{\partial q^2}\right|_{q=\eta} R(\eta \mid q) > 0 \]

### Log-loss

Log-loss is the negative log likelihood of the Bernoulli distribution \[ \mathcal{L} = \frac{1}{N}\sum_{n=1}^N\left[-y_n \log(q_n) - (1-y_n) \log(1 - q_n) \right] \]

Partial losses for log-loss \[ L_1(1-q) = -\log(q), \quad L_0(q) = -\log(1-q) \]

Expected loss for log-loss \[ R(\eta \mid q) = -\eta \log(q) - (1-\eta)\log(1-q) \]

Log-loss is a strictly proper scoring rule

### Squared error loss

Squared error loss is also known as Brier score \[ \mathcal{L} = \frac{1}{N}\sum_{n=1}^N\left[y_n (1 - q_n)^2 - (1-y_n)q_n^2 \right] \]

Partial losses for squared error loss \[ L_1(1-q) = (1 - q)^2, \quad L_0(q) = q^2 \]

Expected loss for squared error loss \[ R(\eta \mid q) = \eta (1 - q)^2 + (1-\eta)q^2 \]

Squared error loss is a strictly proper scoring rule

### Misclassification loss

Usually, misclassification loss uses \(c=0.5\) as the cutoff \[ \mathcal{L} = \frac{1}{N}\sum_{n=1}^N\left[y_n \mathbf{1}_{\{q_n \leq 0.5\}} + (1-y_n)\mathbf{1}_{\{q_n > 0.5\}} \right] \]

Partial losses for misclassification loss \[ L_1(1-q) = \mathbf{1}_{\{q_n \leq 0.5\}}, \quad L_0(q) = \mathbf{1}_{\{q_n > 0.5\}} \]

Expected loss for misclassification loss \[ R(\eta \mid q) = \eta \mathbf{1}_{\{q \leq 0.5\}} + (1-\eta) \mathbf{1}_{\{q > 0.5\}} \]

Since any \(q>0.5\) for events and any \(q \leq 0.5\) for non-events minimize the misclassification loss, misclassification loss is a proper score rule, but it is not strictly proper

### A counter-example of proper scoring rule: absolute loss

Because \(y\in \{0, 1\}\), the absolute deviation \(L(y\mid q) = |y-q|\) becomes \[\begin{align*} L(y\mid q) & = y(1-q) + (1-y) q\\ R(\eta \mid q)& =\eta(1-q) + (1-\eta) q \end{align*}\]

Absolute deviation is not a proper scoring rule, because \(R(\eta \mid q)\) is minimized by \(q=1\) for \(\eta > 1/2\), and \(q=0\) for \(\eta < 1/2\)

# Structure of Proper Scoring Rules

### Structure of proper scoring rules

**Theorem**: Let \(\omega(dt)\) be a positive measure on \((0, 1)\) that is finite on intervals \((\epsilon, 1-\epsilon), \forall \epsilon > 0\). Then the following defines a proper scoring rule: \[ L_1(1-q) = \int_q^{f_1} (1-t)\omega(dt), \quad L_0(q) = \int_{f_0}^q t\omega(dt) \]The proper scoring rule is strict iff \(\omega(dt)\) has non-zero mass on every open interval of (0, 1)

The fixed limits \(f_0\geq 0\) and \(f_1 \leq 1\) are somewhat arbitrary

Note that for log-loss, \(L_1(1-q)\) is unbounded (goes to infinity) below near \(q=1\), and \(L_0(q)\) is unbounded below near \(q=0\)

Except for log-loss, all other common proper scoring rules seem to satisfy \[ \int_0^1 t (1-t) \omega(dt) < \infty \]

# Proper scoring rules are mixtures of cost-weighted misclassification losses

### Connection between the false positive (FP) / false negative (FN) costs and the classification cutoff

Suppose the costs of FP and FN sum up to 1:

- FP: has a cost \(c\), and expected cost \(c P(Y=0) = c(1-\eta)\)
- FN: has a cost \(1 - c\), and expected cost \((1-c) P(Y=1) = (1-c)\eta\)

The optimal classification is therefore class 1 iff \[ (1-c)\eta \geq c(1-\eta) \Longleftrightarrow \eta \geq c \]

- Since we don’t know the truth \(\eta\), we classify as class 1 when \(q \geq c\)

Therefore, the classification cutoff equals \[ \frac{\text{cost of FP}}{\text{cost of FP} + \text{cost of FN}} \]

- Standard classification assumes costs of FP and FN are the same, so the classification cutoff is \(0.5\)

### Cost-weighted misclassification errors

Cost-weighted misclassification errors: \[\begin{align*} L_c(y\mid q) & = y (1-c) \cdot \mathbf{1}_{\{q \leq c\}} + (1-y) c \cdot \mathbf{1}_{\{q > c\}} \\ L_{1, c}(1-q) &= (1-c) \cdot \mathbf{1}_{\{q \leq c\}}, \quad L_{0, c}(q) = c \cdot \mathbf{1}_{\{q > c\}} \end{align*}\]

**Shuford-Albert-Massengil-Savage-Schervish theorem**: an intergral representation of proper scoring rules \[ L(y\mid q) = \int_0^1 L_c(y\mid q) \omega(dc) = \int_0^1 L_c(y\mid q) \omega(c) dc \]- The second equality holds if \(w(dc)\) is absolutely continuous wrt Lebesgue measure
- This can be used to tailor losses to specific classification problems with cutoffs other than \(1/2\) of \(\eta(x)\), by designing suitable weight functions \(\omega()\)

The paper proposes to use Iterative Reweighted Least Squares (IRLS) to fit linear models with proper scoring rules

# Beta Family of Proper Scoring Rules

### Beta family of proper scoring rules

This paper introduced a flexible 2-parameter family of proper scoring rules \[ \omega(t) = t^{\alpha-1} (1-t)^{\beta-1}, \quad \text{where } \alpha > -1, \beta > -1 \]

**Loss function of the Beta family proper scoring rules**\[\begin{align*}L(y \mid q) =~& y \int_q^1 t^{\alpha-1} (1-t)^{\beta} dt + (1-y)\int_0^q t^{\alpha} (1-t)^{\beta-1} dt\\ =~& y B(\alpha, \beta+1) \left[1 - I_q(\alpha, \beta+1)\right]\\ & + (1-y) B(\alpha+1, \beta) I_q(\alpha+1, \beta) \end{align*}\]- See the definitions of \(B(a, b)\) and \(I_x(a, b)\) in the next page

Log-loss and squared error loss are special cases

- Log-loss: \(\alpha=\beta=0\)
- Squared error loss: \(\alpha=\beta=1\)
- Misclassification loss: \(\alpha=\beta \rightarrow \infty\)

### Special functions and Python / R implementations

Beta function \[ B(a, b) = \int_0^1 t^{a-1} (1-t)^{b-1} dt %= \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} \]

- Python implementation:
`scipy.special.beta(a,b)`

- R implementation:
`beta(a, b)`

- Python implementation:
Incomplete Beta function \[\begin{align*} I_x(a, b) %& = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)}\int_0^x t^{a-1} (1-t)^{b-1} dt \\ & = \frac{1}{B(a, b)} \int_0^x t^{a-1} (1-t)^{b-1} dt \end{align*}\]

- Python implementation:
`scipy.special.betainc(a, b, x)`

- R implementation:
`pbeta(x, a, b)`

- Python implementation:

### Tailor proper scoring rules for cost-weighted misclassification

We can use \(\alpha\neq \beta\) when FP and FN costs are not viewed equal

Since Beta family proper scoring rule is like adding a Beta distribution on the FP cost \(c\), we can use mean/variance matching to elicit \(\alpha\) and \(\beta\) \[\begin{align*} \mu & = \frac{\alpha}{\alpha + \beta} = c\\ \sigma^2 &= \frac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta+1)} = \frac{c(1-c)}{\alpha + \beta+1} \end{align*}\]

Alternatively, we can match the mode \[ c = q_{\text{mode}} = \frac{\alpha-1}{\alpha+\beta -2} \]

# Examples

### A simulation example

In the simulation data with bivariate \(x\), where decision boundaries of different \(\eta\) are not in parallel (grey lines)

The logit link Beta family linear model with \(\alpha = 6, \beta = 14\) estimates the \(c=0.3\) classification boundary better than the logistic regression

### On the Pima Indians diabetes data

Comparing logistic regression with a proper scoring rule tailored for high class 1 probabilities: \(\alpha = 9, \beta = 1\).

Black lines: empirical QQ curves of 200 cost-weighted misclassification costs computed on randomly selected test sets

### References

Buja, A., Stuetzle, W., & Shen, Y. (2005). Loss functions for binary class probability estimation and classification: Structure and applications. Working draft, November, 3. http://www-stat.wharton.upenn.edu/~buja/PAPERS/paper-proper-scoring.pdf

For a game theory definition of proper scoring rule, see https://www.cis.upenn.edu/~aaroth/courses/slides/agt17/lect23.pdf

Fitting linear models with custom loss functions in Python: https://alex.miller.im/posts/linear-model-custom-loss-function-regularization-python/

Fitting XGBoost with custom loss functions in Python: https://xgboost.readthedocs.io/en/latest/tutorials/custom_metric_obj.html