Proper Scoring Rules
Commonly Used Proper Scoring Rules
Structure of Proper Scoring Rules
Proper scoring rules are mixtures of cost-weighted misclassification losses
Beta Family of Proper Scoring Rules
Examples

For the pdf slides, click here

Notations: in binary classification

We are interested in fitting a model $q (x)$ for the true conditional class 1 probability $η (x) = P (Y = 1 ∣ X = x)$
Two types of problems
- Classification: estimating a region of the form ${η (x) > c}$
- Class probability estimation: approximate $η (x)$ , by fitting a model $q (x, β)$ , where $β$ are parameters to be estimated
Surrogate criteria for estimation, e.g.,
- Log-loss: $L (y ∣ q) = - y \log (q) - (1 - y) \log (1 - q)$
- Squared error loss: $L (y ∣ q) = (y - q)^{2} = y (1 - q)^{2} + (1 - y) q^{2}$
Surrogate criteria of classification are exactly the primary criteria of class probability estimation

Proper Scoring Rules

Proper scoring rule

Fitting a binary model is to minimize a loss function $L (q ()) = \frac{1}{N} \sum_{n = 1}^{N} L (y_{n} ∣ q_{n})$
In game theory, the agent’s goal is to maximize expected score (or minimize expected loss)
- A scoring rule is proper if truthfulness maximizes expected score
- It is strictly proper if truthfulness uniquely maximizes expected score
In the context of binary response data, Fisher consistency holds pointwise if $\arg min_{q \in [0, 1]} E_{Y \sim Bernoulli (η)} L (Y ∣ q) = η, \forall η \in [0, 1]$
Fisher consistency is the defining property of proper scoring rules

Visualization of two proper scoring rules

Left: log-loss, or Beta loss with $α = β = 0$
Right: Beta loss with $α = 1, β = 3$
- Tailored for classification with false positive cost $c = \frac{α}{α + β} = 0.25$ and false negative cost $1 - c = 0.75$

Commonly Used Proper Scoring Rules

How to check property of a scoring rule for binary response data

Suppose the partial losses $L_{1} (1 - q), L_{0} (q)$ are smooth, then the proper scoring rule property implies $\begin{aligned} 0 & = {\frac{\partial}{\partial q} |}_{q = η} R (η ∣ q) \\ = - η L_{1}^{'} (1 - η) + (1 - η) L_{0}^{'} (η) \end{aligned}$
Therefore, a scoring rule is proper if $η L_{1}^{'} (1 - η) = (1 - η) L_{0}^{'} (η)$
A scoring rule is strictly proper if ${\frac{\partial^{2}}{\partial q^{2}} |}_{q = η} R (η ∣ q) > 0$

Log-loss

Log-loss is the negative log likelihood of the Bernoulli distribution $L = \frac{1}{N} \sum_{n = 1}^{N} [- y_{n} \log (q_{n}) - (1 - y_{n}) \log (1 - q_{n})]$
Partial losses for log-loss $L_{1} (1 - q) = - \log (q), L_{0} (q) = - \log (1 - q)$
Expected loss for log-loss $R (η ∣ q) = - η \log (q) - (1 - η) \log (1 - q)$
Log-loss is a strictly proper scoring rule

Squared error loss

Squared error loss is also known as Brier score $L = \frac{1}{N} \sum_{n = 1}^{N} [y_{n} (1 - q_{n})^{2} - (1 - y_{n}) q_{n}^{2}]$
Partial losses for squared error loss $L_{1} (1 - q) = (1 - q)^{2}, L_{0} (q) = q^{2}$
Expected loss for squared error loss $R (η ∣ q) = η (1 - q)^{2} + (1 - η) q^{2}$
Squared error loss is a strictly proper scoring rule

Misclassification loss

Usually, misclassification loss uses $c = 0.5$ as the cutoff $L = \frac{1}{N} \sum_{n = 1}^{N} [y_{n} 1_{{q_{n} \leq 0.5}} + (1 - y_{n}) 1_{{q_{n} > 0.5}}]$
Partial losses for misclassification loss $L_{1} (1 - q) = 1_{{q_{n} \leq 0.5}}, L_{0} (q) = 1_{{q_{n} > 0.5}}$
Expected loss for misclassification loss $R (η ∣ q) = η 1_{{q \leq 0.5}} + (1 - η) 1_{{q > 0.5}}$
Since any $q > 0.5$ for events and any $q \leq 0.5$ for non-events minimize the misclassification loss, misclassification loss is a proper score rule, but it is not strictly proper

A counter-example of proper scoring rule: absolute loss

Because $y \in {0, 1}$ , the absolute deviation $L (y ∣ q) = | y - q |$ becomes $\begin{aligned} L (y ∣ q) & = y (1 - q) + (1 - y) q \\ R (η ∣ q) & = η (1 - q) + (1 - η) q \end{aligned}$
Absolute deviation is not a proper scoring rule, because $R (η ∣ q)$ is minimized by $q = 1$ for $η > 1 / 2$ , and $q = 0$ for $η < 1 / 2$

Structure of Proper Scoring Rules

Structure of proper scoring rules

Theorem: Let $ω (d t)$ be a positive measure on $(0, 1)$ that is finite on intervals $(ϵ, 1 - ϵ), \forall ϵ > 0$ . Then the following defines a proper scoring rule: $L_{1} (1 - q) = \int_{q}^{f_{1}} (1 - t) ω (d t), L_{0} (q) = \int_{f_{0}}^{q} t ω (d t)$
The proper scoring rule is strict iff $ω (d t)$ has non-zero mass on every open interval of (0, 1)
The fixed limits $f_{0} \geq 0$ and $f_{1} \leq 1$ are somewhat arbitrary
Note that for log-loss, $L_{1} (1 - q)$ is unbounded (goes to infinity) below near $q = 1$ , and $L_{0} (q)$ is unbounded below near $q = 0$
Except for log-loss, all other common proper scoring rules seem to satisfy $\int_{0}^{1} t (1 - t) ω (d t) < \infty$

Proper scoring rules are mixtures of cost-weighted misclassification losses

Connection between the false positive (FP) / false negative (FN) costs and the classification cutoff

Suppose the costs of FP and FN sum up to 1:
- FP: has a cost $c$ , and expected cost $c P (Y = 0) = c (1 - η)$
- FN: has a cost $1 - c$ , and expected cost $(1 - c) P (Y = 1) = (1 - c) η$
The optimal classification is therefore class 1 iff $(1 - c) η \geq c (1 - η) ⟺ η \geq c$
- Since we don’t know the truth $η$ , we classify as class 1 when $q \geq c$
Therefore, the classification cutoff equals $\frac{cost of FP}{cost of FP + cost of FN}$
- Standard classification assumes costs of FP and FN are the same, so the classification cutoff is $0.5$

Cost-weighted misclassification errors

Cost-weighted misclassification errors: $\begin{aligned} L_{c} (y ∣ q) & = y (1 - c) \cdot 1_{{q \leq c}} + (1 - y) c \cdot 1_{{q > c}} \\ L_{1, c} (1 - q) & = (1 - c) \cdot 1_{{q \leq c}}, L_{0, c} (q) = c \cdot 1_{{q > c}} \end{aligned}$
Shuford-Albert-Massengil-Savage-Schervish theorem: an intergral representation of proper scoring rules $L (y ∣ q) = \int_{0}^{1} L_{c} (y ∣ q) ω (d c) = \int_{0}^{1} L_{c} (y ∣ q) ω (c) d c$
- The second equality holds if $w (d c)$ is absolutely continuous wrt Lebesgue measure
- This can be used to tailor losses to specific classification problems with cutoffs other than $1 / 2$ of $η (x)$ , by designing suitable weight functions $ω ()$
The paper proposes to use Iterative Reweighted Least Squares (IRLS) to fit linear models with proper scoring rules

Beta Family of Proper Scoring Rules

Beta family of proper scoring rules

This paper introduced a flexible 2-parameter family of proper scoring rules $ω (t) = t^{α - 1} (1 - t)^{β - 1}, where α > - 1, β > - 1$
Loss function of the Beta family proper scoring rules $\begin{aligned} L (y ∣ q) = & y \int_{q}^{1} t^{α - 1} (1 - t)^{β} d t + (1 - y) \int_{0}^{q} t^{α} (1 - t)^{β - 1} d t \\ = & y B (α, β + 1) [1 - I_{q} (α, β + 1)] \\ + (1 - y) B (α + 1, β) I_{q} (α + 1, β) \end{aligned}$
- See the definitions of $B (a, b)$ and $I_{x} (a, b)$ in the next page
Log-loss and squared error loss are special cases
- Log-loss: $α = β = 0$
- Squared error loss: $α = β = 1$
- Misclassification loss: $α = β \to \infty$

Special functions and Python / R implementations

Beta function $B (a, b) = \int_{0}^{1} t^{a - 1} (1 - t)^{b - 1} d t$
- Python implementation: scipy.special.beta(a,b)
- R implementation: beta(a, b)
Incomplete Beta function $\begin{aligned} I_{x} (a, b) & = \frac{1}{B (a, b)} \int_{0}^{x} t^{a - 1} (1 - t)^{b - 1} d t \end{aligned}$
- Python implementation: scipy.special.betainc(a, b, x)
- R implementation: pbeta(x, a, b)

Tailor proper scoring rules for cost-weighted misclassification

We can use $α \neq β$ when FP and FN costs are not viewed equal
Since Beta family proper scoring rule is like adding a Beta distribution on the FP cost $c$ , we can use mean/variance matching to elicit $α$ and $β$ $\begin{aligned} μ & = \frac{α}{α + β} = c \\ σ^{2} & = \frac{α β}{(α + β)^{2} (α + β + 1)} = \frac{c (1 - c)}{α + β + 1} \end{aligned}$
Alternatively, we can match the mode $c = q_{mode} = \frac{α - 1}{α + β - 2}$

Examples

A simulation example

In the simulation data with bivariate $x$ , where decision boundaries of different $η$ are not in parallel (grey lines)
The logit link Beta family linear model with $α = 6, β = 14$ estimates the $c = 0.3$ classification boundary better than the logistic regression

On the Pima Indians diabetes data

Comparing logistic regression with a proper scoring rule tailored for high class 1 probabilities: $α = 9, β = 1$ .
Black lines: empirical QQ curves of 200 cost-weighted misclassification costs computed on randomly selected test sets

References

Buja, A., Stuetzle, W., & Shen, Y. (2005). Loss functions for binary class probability estimation and classification: Structure and applications. Working draft, November, 3. http://www-stat.wharton.upenn.edu/~buja/PAPERS/paper-proper-scoring.pdf
For a game theory definition of proper scoring rule, see https://www.cis.upenn.edu/~aaroth/courses/slides/agt17/lect23.pdf
Fitting linear models with custom loss functions in Python: https://alex.miller.im/posts/linear-model-custom-loss-function-regularization-python/
Fitting XGBoost with custom loss functions in Python: https://xgboost.readthedocs.io/en/latest/tutorials/custom_metric_obj.html

Paper Notes: Proper Scoring Rules and Cost Weighted Loss Functions for Binary Classification

Notations: in binary classification

Proper Scoring Rules

Proper scoring rule

Bernoulli related simplification on the scoring rules

Visualization of two proper scoring rules

Commonly Used Proper Scoring Rules

How to check property of a scoring rule for binary response data

Log-loss

Squared error loss

Misclassification loss

A counter-example of proper scoring rule: absolute loss

Structure of Proper Scoring Rules

Structure of proper scoring rules

Proper scoring rules are mixtures of cost-weighted misclassification losses

Connection between the false positive (FP) / false negative (FN) costs and the classification cutoff

Cost-weighted misclassification errors

Beta Family of Proper Scoring Rules

Beta family of proper scoring rules

Special functions and Python / R implementations

Tailor proper scoring rules for cost-weighted misclassification

Examples

A simulation example

On the Pima Indians diabetes data

References