Paper Notes: Proper Scoring Rules and Cost Weighted Loss Functions for Binary Classification

For the pdf slides, click here

Notations: in binary classification

  • We are interested in fitting a model q(x) for the true conditional class 1 probability η(x)=P(Y=1X=x)

  • Two types of problems

    • Classification: estimating a region of the form {η(x)>c}
    • Class probability estimation: approximate η(x), by fitting a model q(x,β), where β are parameters to be estimated
  • Surrogate criteria for estimation, e.g.,

    • Log-loss: L(yq)=ylog(q)(1y)log(1q)
    • Squared error loss: L(yq)=(yq)2=y(1q)2+(1y)q2
  • Surrogate criteria of classification are exactly the primary criteria of class probability estimation

Proper Scoring Rules

Proper scoring rule

  • Fitting a binary model is to minimize a loss function L(q())=1Nn=1NL(ynqn)

  • In game theory, the agent’s goal is to maximize expected score (or minimize expected loss)

    • A scoring rule is proper if truthfulness maximizes expected score
    • It is strictly proper if truthfulness uniquely maximizes expected score
  • In the context of binary response data, Fisher consistency holds pointwise if argminq[0,1]EYBernoulli(η)L(Yq)=η,η[0,1]

  • Fisher consistency is the defining property of proper scoring rules

Visualization of two proper scoring rules

  • Left: log-loss, or Beta loss with α=β=0
  • Right: Beta loss with α=1,β=3

    • Tailored for classification with false positive cost c=αα+β=0.25 and false negative cost 1c=0.75

Commonly Used Proper Scoring Rules

How to check property of a scoring rule for binary response data

  • Suppose the partial losses L1(1q),L0(q) are smooth, then the proper scoring rule property implies 0=q|q=ηR(ηq)=ηL1(1η)+(1η)L0(η)

  • Therefore, a scoring rule is proper if ηL1(1η)=(1η)L0(η)

  • A scoring rule is strictly proper if 2q2|q=ηR(ηq)>0

Log-loss

  • Log-loss is the negative log likelihood of the Bernoulli distribution L=1Nn=1N[ynlog(qn)(1yn)log(1qn)]

  • Partial losses for log-loss L1(1q)=log(q),L0(q)=log(1q)

  • Expected loss for log-loss R(ηq)=ηlog(q)(1η)log(1q)

  • Log-loss is a strictly proper scoring rule

Squared error loss

  • Squared error loss is also known as Brier score L=1Nn=1N[yn(1qn)2(1yn)qn2]

  • Partial losses for squared error loss L1(1q)=(1q)2,L0(q)=q2

  • Expected loss for squared error loss R(ηq)=η(1q)2+(1η)q2

  • Squared error loss is a strictly proper scoring rule

Misclassification loss

  • Usually, misclassification loss uses c=0.5 as the cutoff L=1Nn=1N[yn1{qn0.5}+(1yn)1{qn>0.5}]

  • Partial losses for misclassification loss L1(1q)=1{qn0.5},L0(q)=1{qn>0.5}

  • Expected loss for misclassification loss R(ηq)=η1{q0.5}+(1η)1{q>0.5}

  • Since any q>0.5 for events and any q0.5 for non-events minimize the misclassification loss, misclassification loss is a proper score rule, but it is not strictly proper

A counter-example of proper scoring rule: absolute loss

  • Because y{0,1}, the absolute deviation L(yq)=|yq| becomes L(yq)=y(1q)+(1y)qR(ηq)=η(1q)+(1η)q

  • Absolute deviation is not a proper scoring rule, because R(ηq) is minimized by q=1 for η>1/2, and q=0 for η<1/2

Structure of Proper Scoring Rules

Structure of proper scoring rules

  • Theorem: Let ω(dt) be a positive measure on (0,1) that is finite on intervals (ϵ,1ϵ),ϵ>0. Then the following defines a proper scoring rule: L1(1q)=qf1(1t)ω(dt),L0(q)=f0qtω(dt)

  • The proper scoring rule is strict iff ω(dt) has non-zero mass on every open interval of (0, 1)

  • The fixed limits f00 and f11 are somewhat arbitrary

  • Note that for log-loss, L1(1q) is unbounded (goes to infinity) below near q=1, and L0(q) is unbounded below near q=0

  • Except for log-loss, all other common proper scoring rules seem to satisfy 01t(1t)ω(dt)<

Proper scoring rules are mixtures of cost-weighted misclassification losses

Connection between the false positive (FP) / false negative (FN) costs and the classification cutoff

  • Suppose the costs of FP and FN sum up to 1:

    • FP: has a cost c, and expected cost cP(Y=0)=c(1η)
    • FN: has a cost 1c, and expected cost (1c)P(Y=1)=(1c)η
  • The optimal classification is therefore class 1 iff (1c)ηc(1η)ηc

    • Since we don’t know the truth η, we classify as class 1 when qc
  • Therefore, the classification cutoff equals cost of FPcost of FP+cost of FN

    • Standard classification assumes costs of FP and FN are the same, so the classification cutoff is 0.5

Cost-weighted misclassification errors

  • Cost-weighted misclassification errors: Lc(yq)=y(1c)1{qc}+(1y)c1{q>c}L1,c(1q)=(1c)1{qc},L0,c(q)=c1{q>c}

  • Shuford-Albert-Massengil-Savage-Schervish theorem: an intergral representation of proper scoring rules L(yq)=01Lc(yq)ω(dc)=01Lc(yq)ω(c)dc

    • The second equality holds if w(dc) is absolutely continuous wrt Lebesgue measure
    • This can be used to tailor losses to specific classification problems with cutoffs other than 1/2 of η(x), by designing suitable weight functions ω()
  • The paper proposes to use Iterative Reweighted Least Squares (IRLS) to fit linear models with proper scoring rules

Beta Family of Proper Scoring Rules

Beta family of proper scoring rules

  • This paper introduced a flexible 2-parameter family of proper scoring rules ω(t)=tα1(1t)β1,where α>1,β>1

  • Loss function of the Beta family proper scoring rules L(yq)= yq1tα1(1t)βdt+(1y)0qtα(1t)β1dt= yB(α,β+1)[1Iq(α,β+1)]+(1y)B(α+1,β)Iq(α+1,β)

    • See the definitions of B(a,b) and Ix(a,b) in the next page
  • Log-loss and squared error loss are special cases

    • Log-loss: α=β=0
    • Squared error loss: α=β=1
    • Misclassification loss: α=β

Special functions and Python / R implementations

Tailor proper scoring rules for cost-weighted misclassification

  • We can use αβ when FP and FN costs are not viewed equal

  • Since Beta family proper scoring rule is like adding a Beta distribution on the FP cost c, we can use mean/variance matching to elicit α and β μ=αα+β=cσ2=αβ(α+β)2(α+β+1)=c(1c)α+β+1

  • Alternatively, we can match the mode c=qmode=α1α+β2

Examples

A simulation example

  • In the simulation data with bivariate x, where decision boundaries of different η are not in parallel (grey lines)

  • The logit link Beta family linear model with α=6,β=14 estimates the c=0.3 classification boundary better than the logistic regression

On the Pima Indians diabetes data

  • Comparing logistic regression with a proper scoring rule tailored for high class 1 probabilities: α=9,β=1.

  • Black lines: empirical QQ curves of 200 cost-weighted misclassification costs computed on randomly selected test sets

References