*Contributed by: Patrick *

- What is the Kaplan-Meier curve?
- Survival Analysis
- Goals of Survival Analysis
- Basics of Kaplan Meier curve
- Censored data
- Understanding the KM Analysis
- KM Analysis using R
- Decoding the KM curve and Analysis
- Conclusion

In our everyday life, we come across many time-to-event examples. What does time-to-event mean? It is a course duration variable for each case/subject of interest having a beginning and an end anywhere along the timeline of the complete study. Some common examples are clinical study for a drug, wickets falling in an innings of a cricket match, overhauling a machine before it is decommissioned, etc., did you notice, there is something in common among the examples? Yes, it is the study of survival. One effective way to estimate the survival function is by using KM analysis. The Kaplan Meier Curve is an estimator used to estimate the survival function. The Kaplan Meier Curve is the visual representation of this function that shows the probability of an event at a respective time interval. The curve should approach the true survival function for the population under investigation, provided the sample size is large enough.

In this article, let’s see in detail what KM analysis is, how the Kaplan-Meier Curve is built, the math behind calculating the probabilities of survival. But before diving directly into the KM analysis, we shall have a quick and brief walk around on what is survival analysis and basic notations used in the analysis. Learning about Statistical Methods for Decision Making can help you understand the concepts more clearly.

**What is the Kaplan-Meier curve?**

The Kaplan-Meier curve is a graphical representation of the survival function. The curve is named after Edward Kaplan and Meier, who developed the technique in the 1950s. It is a non-parametric estimate of the survival function that does not make any assumptions about the underlying distribution of the data. The Kaplan-Meier curve is used to estimate the survival function from data that are censored, truncated, or have missing values. It shows the probability that a subject will survive up to time t. The curve is constructed by plotting the survival function against time.

**Survival Analysis**

Survival analysis is a statistical procedure for data analysis in which the outcome variable of interest is the time until an event occurs. The time can be any calendar time such as years, months, weeks or days from the beginning of follow-up until an event occurs. By event, we mean recovery, death, breakdown of a machine, wickets in an innings or any designated experience of interest that may happen to the case/subject.

**Goals of Survival Analysis**

Survival analysis has three goals to be addressed:

- To estimate and interpret survivor and/or, hazard functions from survival data
- To compare survivor and/or, hazard function
- To assess the relationship of explanatory variables to survival time

We hope you have got a picture of what survival analysis is and its goals. Next, we shall understand the notations used in the analysis and a basic interpretation of KM curve (a detailed explanation to be followed).

**Basics of Kaplan Meier curve**

When using Kaplan Meier analysis, we should concentrate on three variables:

- Serial time of the subject
- Their status at the end of their serial time (event occurrence or censored)
- The group of study they belong to

The serial time for the individual subjects should be arranged from the shortest to the longest, regardless of when they entered the study. The serial time duration of known survival is terminated by the event of interest. This is known as an interval. Only an occurrence of the event defines known survival time intervals. Whereas, censored subjects do not terminate the interval. Here, there is a possibility of two things to happen.

1. A subject can have the event of interest.

2. They are censored. As we discussed what event just above, this time we will define what censored data is.

Also Read: Multinomial Naive Bayes Explained

**Censored data**

The straight definition of censored data is the information about a subject’s survival time is incomplete. This is a problem which most survival analyses suffer from. This can happen when something negative for the study happens, such as:

- A person does not experience the event before the study ends
- A person is lost to follow up during the study period
- A person withdraws from the study because of some reason

**Understanding the KM Analysis**

After so much theory and explanations on KM analysis, we shall move into the creation and interpretation of the KM curve.

For this, let’s consider an example where a drug is being tested on two groups of people (male and female). There are six subjects in each group (for ease of understanding). The serial time and the status at the serial time are given in the table below. Status at the serial time of 1 means the occurrence of an event, and 0 means, the subject is censored. The objective is to find the cumulative probability of survival and to find is there any significant difference in the drug between the groups.

As discussed earlier, the basic elements required for the analysis are 1. Serial time, 2. Status at the serial time and the group to which the subject belongs to. The data are entered in a table and is sorted by ascending serial times beginning with the shortest times for each group. Notice, each group has one censored subject. In a group which has male subjects, it is at the end of the trial, and in the other group, the subject was censored within the study timeline.

After constructing the table, we can use any statistical tools such as SPSS, Sigmaplot, R, Excel to plot the KM curve. First, let us see how to plot the KM curve and analyse the results with R software, then let’s have a quick walk around through the stats and calculation behind the computation of survival probabilities.

**KM Analysis using R**

**Step1:** The packages used for the analysis are **survival** and **survminer.** Use install.packages( ) to install these libraries just in case if they are not pre installed in your R workspace.

**Step2:** The next step is to load the dataset and examine its structure. The data we will use for this analysis is the same as shown above. The data is saved as a csv file and the same is imported for the analysis in R.

**Step 3:** After this we are ready to create the survival object using the function **Surv **of the survival package. The object is stored in the surv_object as a destination. Survival object is basically a compiled version of the serial time and status. A + sign behind the survival time indicates censored data points.

**Step 4:** The next step is to fit the kaplan-Meier curves. For doing this we need to fit the survival function with the survival object and the group of interest. This fitting can be done using the **survfit** function of the survminer library. The survival object created in the previous step is given as a function of the group we have considered for the analysis.

The summary of the resulting fit_1 object shows, among other things, survival times, the proportion of surviving patients at every time point.

The table below is the table output of the survival analysis. It shows the time at which the event has taken place, number of subjects at risk after each event, cumulative survival probabilities, standard error associated with each probability and it’s upper and lower 95% confidence intervals for both the groups (the calculation behind the table and the stats are discussed later in this article).

**Step 5:** After the above step it is now time to plot the KM curve. The corresponding survival curve can be examined by passing the survival object to the **ggurvplot()** function with **pval = TRUE. **This argument is very useful, because it plots the p-value of a log rank test as well, which will help us to get an idea if the groups are significantly different or not.

In table 2, it can be seen that the last subject of the female group has no cumulative probability of survival assigned to it, and there is a sudden drop in the probability for the third subject. Whereas in the other group, the last subject has a probability associated with it and the fall in probability is little lesser than the former group. It is because in the female group there is a subject that got censored in the middle (after the second event) and hence there is no subject left at the end to calculate the probability scores. It is because of that the probability has fallen steeply after the second event. In the case of the male group, the subject that got censored is only at the end, and hence the probability will not approach zero.

I know this is a little confusing, but worry not we will get it cleared in the coming pages.

**Decoding the KM curve and Analysis**

Look at the KM curve in the figure. The survival duration of a subject is represented by the length of the horizontal lines along the X-axis of serial times. The occurrence of the event terminates the interval. The vertical lines are the event of interest happening, and the vertical distances between horizontals are important because they illustrate the change in the cumulative probability of surviving a given time as seen in the Y-axis. For example, if you belong to a group male, your probability of surviving 11 months is 100% ( x-axis in years); conversely, if you are in the other group, your probability of surviving the same time is slightly more than 66%. The steepness of the curve is determined by the survival durations.

Looking at the censored objects, the one subject that censored in group female materially reduced the cumulative survival between the intervals. Whereas, the terminally censored subject in the male group did not change the survival probability and the interval was not terminated by an event.

The table above shows what happens behind the production of the KM curve. When the above table is cross-referenced with the KM curve, it is evident that intervals and the attendant probabilities are only constructed for events of interest and not for censored subjects. Because an event ends one interval and begins another interval, there should be more intervals than events.

The table explains the way the curves end. In group male, the curve ends without creating another interval below. The cumulative probability of surviving this long is determined by the last horizontal, sixth interval and is 0.166. In the other group, the curve drops to zero after the fifth interval to cause the sixth interval horizontal to be on the X-axis.

Looking at the probabilities of survival, it could be a little confusing that there are two probabilities 1. Cumulative probability 2. Interval probability. The cumulative probability defines the probability at the beginning and throughout the interval. This is graphed along the Y-axis of the curve. The interval survival rate defines the probability of surviving past the interval. i.e. still surviving after the interval and beginning the next.

Censoring affects survival rates. Censored observations that coincide with an event are usually considered to fall immediately after the event. Censoring removes the subject from the denominator, i.e., individuals still at risk. For example, in Group 2, there were three surviving intervals four and available to be at risk in interval five. However, during interval four one was censored; therefore, only two were left to be at risk in interval five, i.e. as seen in Table II the denominator went from four in interval four to two in interval five.

**Quick check –** Introduction to Data Science

**Conclusion**

Thus, we calculated the survival probabilities of each subject of two different groups. Though it seems like the male group has a greater probability of survival than the female group, the log-rank test’s p-value of 0.19 tells us that there is no significant difference between the groups. The null hypothesis is there is no difference, and the alternate hypothesis is the groups are significantly different. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This brings us to the end of the blog on the Kaplan Meier Curve. We hope you enjoyed it.