A Complete Guide To Survival Analysis In Python, part 1
This three-part series covers a review with step-by-step explanations and code for how to perform statistical survival analysis used to investigate the time some event takes to occur, such as patient survival during the COVID-19 pandemic, the time to failure of engineering products, or even the time to closing a sale after an initial customer contact.
By Pratik Shukla, Aspiring machine learning engineer.
Survival Analysis Basics
Survival analysis is a set of statistical approaches used to find out the time it takes for an event of interest to occur. Survival analysis is used to study the time until some event of interest (often referred to as death) occurs. Time could be measured in years, months, weeks, days, etc. The event of interest could be anything of interest. It could be an actual death, a birth, a retirement, etc.
How it can be useful to analyze ongoing COVID-19 pandemic data?
(1) We can find the number of days until patients showed COVID-19 symptoms.
(2) We can find for which age group it’s deadlier.
(3) We can find which treatment has the highest survival probability.
(4) We can find whether a person’s sex has a significant effect on their survival time?
(5) We can also find the median number of days of survival for patients.
We are going to perform a thorough analysis of patients with lung cancer. Don’t worry once you understand the logic behind it, you’ll be able to perform it on any data set. Exciting, isn’t it?
Survival analysis is used in a variety of field such as:
- Cancer studies for patients survival time analyses.
- Sociology for “event-history analysis”.
- In Engineering for “failure-time analysis”.
- Time until product failure.
- Time until a warranty claim.
- Time until a process reaches a critical level.
- Time from initial sales contact to a sale.
- Time from employee hire to either termination or quit.
- Time from a salesperson hire to their first sale.
In cancer studies, typical research questions include:
(1) What is the impact of certain clinical characteristics on patient’s survival? For example, is there any difference between the group of people who has higher blood sugar and those who don’t?
(2) What is the probability that an individual survives a specific period (years, months, days)? For example, given a set of cancer patients, we will be able to tell that if 300(random number) days after the diagnosis of cancer has been passed, then the probability of that person being alive at that time will be 0.7 (random number).
(3) Are there differences in survival between groups of patients? For example, let’s say there are 2 groups of people diagnosed with cancer. Those 2 groups were given 2 different kinds of treatments. Now our goal here will be to find out if there is a significant difference between the survival time for those 2 different groups based on the treatment they were given.
In cancer studies, most of the survival analyses use the following methods.
(1) Kaplan-Meier plots to visualize survival curves.
(2) Nelson-Aalen plots to visualize the cumulative hazard.
(3) Log-rank test to compare the survival curves of two or more groups
(4) Cox proportional hazards regression to find out the effect of different variables like age, sex, weight on survival.
Here, we start by defining fundamental terms of survival analysis, including:
- Survival time and event.
- Censoring of data.
- Survival function and hazard function.
Survival time and type of events in cancer studies
Survival Time: referred to an amount of time until when a subject is alive or actively participates in a survey.
There are mainly three types of events, including:
(1) Relapse: a deterioration in someone’s state of health after a temporary improvement.
(2) Progression: the process of developing or moving gradually towards a more advanced state. ( Improvement in health.)
(3) Death: the destruction or permanent end of something.
As mentioned above, survival analysis focuses on the occurrence of an event of interest (e.g., birth, death, retirement). But there is still a possibility that the event may not be observed for various reasons. Such observations are known as censored observations.
Censoring may arise in the following ways:
- A patient has not (yet) experienced the event of interest (death or relapse in our case) within the study period.
- A patient is not followed anymore.
- If a patient moves to another city, then follow-up might not be possible for the hospital staff.
This type of censoring, named right censoring, is handled in survival analysis.
There are three general types of censoring, right-censoring, left-censoring, and interval-censoring.
Right Censoring: The death of the person.
Left Censoring: The event can’t be observed for some reason. It includes events that occurred before the experiment started. (e.g., number of days from birth when the kid started walking.)
Interval Censoring: When we have data for some intervals only.
Survival and hazard functions
We generally use two related probabilities to analyse survival data.
(1) The survival probability
(2) The hazard probability
To find survival probability, we’ll be using survivor function S(t), which is the Kaplan-Meier Estimator. Survival probability is the probability that an individual (e.g., patient) survives from the time origin (e.g., diagnosis of cancer) to a specified future time t. For example, S(200) = 0.7 means that after 200 days has passed since the diagnosis of cancer, the patient’s survival probability has dropped to 0.7. If the person stays alive at the end of an experiment, then that data will be censored.
The hazard probability, denoted by h(t), is the probability that an individual(e.g., patient) who is under observation at a time t has an event(e.g., death) at that time. For example, If h(200) = 0.7, then it means that the probability of that person being dead at time t=200 days is 0.7.
Note that, in contrast to the survivor function, which focuses on not having an event, the hazard function focuses on the event occurring. I think we can clearly see that higher survival probability and lower hazard probability is good for the patient.
Let’s move forward to the cool coding part!
You can download the dataset from here.
The Kaplan–Meier estimator is a non-parametric statistic used to estimate the survival function (probability of a person surviving) from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. For example, Calculating the amount of time(year, month, day) certain patient lived after he/she was diagnosed with cancer or his treatment starts. The estimator is named after Edward L. Kaplan and Paul Meier, whom each submitted similar manuscripts to the Journal of the American Statistical Association.
The formula for Kaplan-Meier is as follows:
The probability at time ti, S(ti), is calculated as
We can also write it as
In a more generalized way, we can say that,
Survival function simplified.
For example, we can say that,
In the next article, we’ll implement Kaplan-Meier fitter and Nelson-Aalen fitter using python.
At the end of this three-part series, you’ll be able to plot graphs like this from which we can extrapolate on the survival of a patient. Hang tight!
Original. Reposted with permission.
Bio: Pratik Shukla is an aspiring machine learning engineer who loves to put complex theories in simple ways. Pratik pursued his undergraduate in computer science and is going for a master's program in computer science at University of Southern California. “Shoot for the moon. Even if you miss it you will land among the stars. -- Les Brown”
- Survival Analysis for Business Analytics
- The 8 Basic Statistics Concepts for Data Science
- The Challenges of Building a Predictive Churn Model