Paper Notes: Proper Scoring Rules and Cost Weighted Loss Functions for Binary Classification

For the pdf slides, click here

Notations: in binary classification

  • We are interested in fitting a model \(q(\mathbf{x})\) for the true conditional class 1 probability \[ \eta(\mathbf{x}) = P(Y = 1 \mid \mathbf{X} = \mathbf{x}) \]

  • Two types of problems

    • Classification: estimating a region of the form \(\{\eta(\mathbf{x}) > c\}\)
    • Class probability estimation: approximate \(\eta(\mathbf{x})\), by fitting a model \(q(\mathbf{x}, \beta)\), where \(\beta\) are parameters to be estimated
  • Surrogate criteria for estimation, e.g.,

    • Log-loss: \(L(y \mid q) = -y \log(q) - (1-y) \log (1-q)\)
    • Squared error loss: \(L(y \mid q) = (y-q)^2 = y (1-q)^2 + (1-y) q^2\)
  • Surrogate criteria of classification are exactly the primary criteria of class probability estimation

Proper Scoring Rules

Proper scoring rule

  • Fitting a binary model is to minimize a loss function \[ \mathcal{L}\left(q()\right) = \frac{1}{N}\sum_{n=1}^N L(y_n \mid q_n) \]

  • In game theory, the agent’s goal is to maximize expected score (or minimize expected loss)

    • A scoring rule is proper if truthfulness maximizes expected score
    • It is strictly proper if truthfulness uniquely maximizes expected score
  • In the context of binary response data, Fisher consistency holds pointwise if \[ \arg\min_{q\in [0, 1]} E_{Y\sim \text{Bernoulli}(\eta)}L(Y\mid q) = \eta, \quad \forall \eta \in [0, 1] \]

  • Fisher consistency is the defining property of proper scoring rules

Visualization of two proper scoring rules

  • Left: log-loss, or Beta loss with \(\alpha=\beta=0\)
  • Right: Beta loss with \(\alpha=1, \beta=3\)

    • Tailored for classification with false positive cost \(c = \frac{\alpha}{\alpha+\beta} = 0.25\) and false negative cost \(1-c = 0.75\)

Commonly Used Proper Scoring Rules

How to check property of a scoring rule for binary response data

  • Suppose the partial losses \(L_1(1-q), L_0(q)\) are smooth, then the proper scoring rule property implies \[\begin{align*} 0 & = \left.\frac{\partial}{\partial q}\right|_{q=\eta} R(\eta \mid q)\\ & = -\eta L_1'(1-\eta) + (1-\eta) L_0'(\eta) \end{align*}\]

  • Therefore, a scoring rule is proper if \[ \eta L_1'(1-\eta) = (1-\eta) L_0'(\eta) \]

  • A scoring rule is strictly proper if \[ \left.\frac{\partial^2}{\partial q^2}\right|_{q=\eta} R(\eta \mid q) > 0 \]


  • Log-loss is the negative log likelihood of the Bernoulli distribution \[ \mathcal{L} = \frac{1}{N}\sum_{n=1}^N\left[-y_n \log(q_n) - (1-y_n) \log(1 - q_n) \right] \]

  • Partial losses for log-loss \[ L_1(1-q) = -\log(q), \quad L_0(q) = -\log(1-q) \]

  • Expected loss for log-loss \[ R(\eta \mid q) = -\eta \log(q) - (1-\eta)\log(1-q) \]

  • Log-loss is a strictly proper scoring rule

Squared error loss

  • Squared error loss is also known as Brier score \[ \mathcal{L} = \frac{1}{N}\sum_{n=1}^N\left[y_n (1 - q_n)^2 - (1-y_n)q_n^2 \right] \]

  • Partial losses for squared error loss \[ L_1(1-q) = (1 - q)^2, \quad L_0(q) = q^2 \]

  • Expected loss for squared error loss \[ R(\eta \mid q) = \eta (1 - q)^2 + (1-\eta)q^2 \]

  • Squared error loss is a strictly proper scoring rule

Misclassification loss

  • Usually, misclassification loss uses \(c=0.5\) as the cutoff \[ \mathcal{L} = \frac{1}{N}\sum_{n=1}^N\left[y_n \mathbf{1}_{\{q_n \leq 0.5\}} + (1-y_n)\mathbf{1}_{\{q_n > 0.5\}} \right] \]

  • Partial losses for misclassification loss \[ L_1(1-q) = \mathbf{1}_{\{q_n \leq 0.5\}}, \quad L_0(q) = \mathbf{1}_{\{q_n > 0.5\}} \]

  • Expected loss for misclassification loss \[ R(\eta \mid q) = \eta \mathbf{1}_{\{q \leq 0.5\}} + (1-\eta) \mathbf{1}_{\{q > 0.5\}} \]

  • Since any \(q>0.5\) for events and any \(q \leq 0.5\) for non-events minimize the misclassification loss, misclassification loss is a proper score rule, but it is not strictly proper

A counter-example of proper scoring rule: absolute loss

  • Because \(y\in \{0, 1\}\), the absolute deviation \(L(y\mid q) = |y-q|\) becomes \[\begin{align*} L(y\mid q) & = y(1-q) + (1-y) q\\ R(\eta \mid q)& =\eta(1-q) + (1-\eta) q \end{align*}\]

  • Absolute deviation is not a proper scoring rule, because \(R(\eta \mid q)\) is minimized by \(q=1\) for \(\eta > 1/2\), and \(q=0\) for \(\eta < 1/2\)

Structure of Proper Scoring Rules

Structure of proper scoring rules

  • Theorem: Let \(\omega(dt)\) be a positive measure on \((0, 1)\) that is finite on intervals \((\epsilon, 1-\epsilon), \forall \epsilon > 0\). Then the following defines a proper scoring rule: \[ L_1(1-q) = \int_q^{f_1} (1-t)\omega(dt), \quad L_0(q) = \int_{f_0}^q t\omega(dt) \]

  • The proper scoring rule is strict iff \(\omega(dt)\) has non-zero mass on every open interval of (0, 1)

  • The fixed limits \(f_0\geq 0\) and \(f_1 \leq 1\) are somewhat arbitrary

  • Note that for log-loss, \(L_1(1-q)\) is unbounded (goes to infinity) below near \(q=1\), and \(L_0(q)\) is unbounded below near \(q=0\)

  • Except for log-loss, all other common proper scoring rules seem to satisfy \[ \int_0^1 t (1-t) \omega(dt) < \infty \]

Proper scoring rules are mixtures of cost-weighted misclassification losses

Connection between the false positive (FP) / false negative (FN) costs and the classification cutoff

  • Suppose the costs of FP and FN sum up to 1:

    • FP: has a cost \(c\), and expected cost \(c P(Y=0) = c(1-\eta)\)
    • FN: has a cost \(1 - c\), and expected cost \((1-c) P(Y=1) = (1-c)\eta\)
  • The optimal classification is therefore class 1 iff \[ (1-c)\eta \geq c(1-\eta) \Longleftrightarrow \eta \geq c \]

    • Since we don’t know the truth \(\eta\), we classify as class 1 when \(q \geq c\)
  • Therefore, the classification cutoff equals \[ \frac{\text{cost of FP}}{\text{cost of FP} + \text{cost of FN}} \]

    • Standard classification assumes costs of FP and FN are the same, so the classification cutoff is \(0.5\)

Cost-weighted misclassification errors

  • Cost-weighted misclassification errors: \[\begin{align*} L_c(y\mid q) & = y (1-c) \cdot \mathbf{1}_{\{q \leq c\}} + (1-y) c \cdot \mathbf{1}_{\{q > c\}} \\ L_{1, c}(1-q) &= (1-c) \cdot \mathbf{1}_{\{q \leq c\}}, \quad L_{0, c}(q) = c \cdot \mathbf{1}_{\{q > c\}} \end{align*}\]

  • Shuford-Albert-Massengil-Savage-Schervish theorem: an intergral representation of proper scoring rules \[ L(y\mid q) = \int_0^1 L_c(y\mid q) \omega(dc) = \int_0^1 L_c(y\mid q) \omega(c) dc \]

    • The second equality holds if \(w(dc)\) is absolutely continuous wrt Lebesgue measure
    • This can be used to tailor losses to specific classification problems with cutoffs other than \(1/2\) of \(\eta(x)\), by designing suitable weight functions \(\omega()\)
  • The paper proposes to use Iterative Reweighted Least Squares (IRLS) to fit linear models with proper scoring rules

Beta Family of Proper Scoring Rules

Beta family of proper scoring rules

  • This paper introduced a flexible 2-parameter family of proper scoring rules \[ \omega(t) = t^{\alpha-1} (1-t)^{\beta-1}, \quad \text{where } \alpha > -1, \beta > -1 \]

  • Loss function of the Beta family proper scoring rules \[\begin{align*}L(y \mid q) =~& y \int_q^1 t^{\alpha-1} (1-t)^{\beta} dt + (1-y)\int_0^q t^{\alpha} (1-t)^{\beta-1} dt\\ =~& y B(\alpha, \beta+1) \left[1 - I_q(\alpha, \beta+1)\right]\\ & + (1-y) B(\alpha+1, \beta) I_q(\alpha+1, \beta) \end{align*}\]

    • See the definitions of \(B(a, b)\) and \(I_x(a, b)\) in the next page
  • Log-loss and squared error loss are special cases

    • Log-loss: \(\alpha=\beta=0\)
    • Squared error loss: \(\alpha=\beta=1\)
    • Misclassification loss: \(\alpha=\beta \rightarrow \infty\)

Special functions and Python / R implementations

  • Beta function \[ B(a, b) = \int_0^1 t^{a-1} (1-t)^{b-1} dt %= \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} \]

  • Incomplete Beta function \[\begin{align*} I_x(a, b) %& = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)}\int_0^x t^{a-1} (1-t)^{b-1} dt \\ & = \frac{1}{B(a, b)} \int_0^x t^{a-1} (1-t)^{b-1} dt \end{align*}\]

Tailor proper scoring rules for cost-weighted misclassification

  • We can use \(\alpha\neq \beta\) when FP and FN costs are not viewed equal

  • Since Beta family proper scoring rule is like adding a Beta distribution on the FP cost \(c\), we can use mean/variance matching to elicit \(\alpha\) and \(\beta\) \[\begin{align*} \mu & = \frac{\alpha}{\alpha + \beta} = c\\ \sigma^2 &= \frac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta+1)} = \frac{c(1-c)}{\alpha + \beta+1} \end{align*}\]

  • Alternatively, we can match the mode \[ c = q_{\text{mode}} = \frac{\alpha-1}{\alpha+\beta -2} \]


A simulation example

  • In the simulation data with bivariate \(x\), where decision boundaries of different \(\eta\) are not in parallel (grey lines)

  • The logit link Beta family linear model with \(\alpha = 6, \beta = 14\) estimates the \(c=0.3\) classification boundary better than the logistic regression

On the Pima Indians diabetes data

  • Comparing logistic regression with a proper scoring rule tailored for high class 1 probabilities: \(\alpha = 9, \beta = 1\).

  • Black lines: empirical QQ curves of 200 cost-weighted misclassification costs computed on randomly selected test sets