Survival Analysis

Life Table and Kaplan-Meier Estimate

An insurance company’s life table shows information of clients by their age. For each age $i$ , it contains
- $n_{i}$ : number of clients
- $y_{i}$ : number of death
- ${\hat{h}}_{i} = y_{i} / n_{i}$ : hazard rate
- ${\hat{S}}_{i}$ : survival probability estimate
An example life table

Age	$n$	$y$	$\hat{h}$	$\hat{S}$
34	120	0	0.000	1.000
35	71	1	0.014	0.986
36	125	0	0.000	0.986
…	…	…	…	…

A client’s lifetime (time until event): random variable $X$
- Also called failure time, survival time, or event time
Probability of dying at age $i$ $f_{i} = P (X = i)$
Probability of surviving past age $i$ $S_{i} = \sum_{j \geq i + 1} f_{j} = P (X > i)$
Hazard rate at age $i$ : conditional probability $h_{i} = \frac{f_{i}}{S_{i - 1}} = P (X = i ∣ X \geq i)$

Hazard rate estimation: binomial proportions ${\hat{h}}_{i} = \frac{y_{i}}{n_{i}}$
- Typical frequentist inference: probabilistic results $h_{i}$ is estimated by the plug-in principle
Probability of surviving past age $j$ given survival past age $i$ : $P (X > j ∣ X > i) = \prod_{k = i + 1}^{j} P (X > k ∣ X \geq k) = \prod_{k = i + 1}^{j} (1 - h_{k})$
Probability of survival estimation ${\hat{S}}_{j} = \prod_{k = i_{0}}^{j} (1 - {\hat{h}}_{k})$ where $i_{0}$ is the starting age

Time until event $T$ : a continuous positive random variable, with pdf $f (t)$ and cdf $F (t)$
Survival function (i.e., reverse cdf) $S (t) = \int_{t}^{\infty} f (x) d x = P (T > t) = 1 - F (t)$
Hazard rate, also called hazard function $h (t) = \frac{f (t)}{S (t)} = lim_{Δ t \to 0} \frac{P (t < T \leq t + Δ t ∣ T > t)}{Δ t}$
- In some other books, hazard rate is denoted as $λ (t)$

Connection between hazard rate $h (t)$ and survival function $S (t)$ $h (t) = - \frac{\partial \log S (t)}{\partial t} ⟺ S (t) = \exp {- \int_{0}^{t} h (x) d x}$
Cumulative hazard function $Λ (t) = \int_{0}^{t} h (x) d x = - \log S (t)$
Knowing any of $S (t)$ , $h (t)$ , $Λ (t)$ allows one to derive the other two
Example: exponential distributed $T$ $f (t) = λ e^{- λ t} ⟹ S (t) = e^{- λ t}, h (t) = λ$
- Constant hazard rate: menoryless

Censored data: survival times known only to exceed the reported value
- E.g., lost to followup, experiment ended with some patients still alive
- Usually denoted as “number+”
Observation $z_{i}$ for censored data: $z = (t_{i}, d_{i}),$ where $t_{i}$ is the survival time, and $d_{i}$ is the indicator $d_{i} = {\begin{cases} 1 & if death observed \\ 0 & if death not observed \end{cases}$

Among the censored data $z_{1}, \dots, z_{n}$ , we denote the ordered survival times as $t_{(1)} < t_{(2)} < \dots < t_{(n)},$ assuming no ties.
The Kaplan-Meier estimate for survival probability $S_{(j)} = P (X > t_{(j)})$ is the life table estimate ${\hat{S}}_{(j)} = \prod_{k \leq j} {(\frac{n - k}{n - k + 1})}^{d_{(k)}}$
Life table curves are nonparametric: no relationship is assumed between the hazard rates $h_{i}$

Death counts $y_{k}$ are independent Binomials $y_{k} \overset{i n d}{\sim} B (n_{k}, h_{k})$
Logistic regression $l o g (\frac{h_{k}}{1 - h_{k}}) = α x_{k}$
- E.g., cubic regression: $x_{k} = (1, k, k^{2}, k^{3})^{'}$
- E.g., cubic-linear spline: $x_{k} = (1, k, (k - k_{0})_{-}^{2}, (k - k_{0})_{-}^{3})^{'}$ where $x_{-} = x \cdot 1_{x \leq 0}$

Proportional hazards model assumes $h_{i} (t) = h_{0} (t) \cdot e^{x_{i}^{'} β},$ where $h_{0} (t)$ is a baseline hazard, which we don’t need to specify
Denote $θ_{i} = e^{x_{i}^{'} β}$ , then $S_{i} (t) = S_{0} (t)^{θ_{i}},$ where $S_{0} (t)$ is the baseline survival function
- Larger value of $θ_{i}$ indicates more quickly declining (i.e., worse) survival curves
- Positive value of the coefficient $β_{j}$ indicates increase of the corresponding covariate $x_{j}$ associating with worse survival curves

Let $J$ be the number of observed deaths, occurring at times $T_{(1)} < T_{(2)} < \dots < T_{(J)}$ assuming no ties
Just before time $T_{(j)}$ there is a risk set of individuals still under observation $R_{j} = {i, t_{i} \geq T_{(j)}}$
Key results of the proportional hazards model: given one person dies at time $T_{(j)}$ , the probablity it is person $i$ , among the set of people at risk, is $P (i_{j} = i ∣ R_{j}) = \frac{e^{x_{i}^{'} β}}{\sum_{k \in R_{j}} e^{x_{j}^{'} β}} = \frac{θ_{i}}{\sum_{k \in R_{j}} θ_{j}}$

Estimaiton of $β$ is to maximize the partial likelihood $L (β) = \prod_{j = 1}^{J} \frac{e^{x_{i_{j}}^{'} β}}{\sum_{k \in R_{j}} e^{x_{j}^{'} β}}$ where individual $i_{j}$ dies at time $T_{(j)}$
Semi-parametric: we do not need to specify the baseline $h_{0} (t)$ , since it is not contained in the objective function

Efron, Bradley and Hastie, Trevor (2016), Computer Age Statistical Inference. Cambridge University Press