*Contributed by: Netali Agrawal LinkedIn Profile: https://www.linkedin.com/in/netali-agrawal-31192a71/ *

The Chi-square test is also known as the name of the “goodness of fit test”. But questions that first strike our minds on hearing this name are:

- What is the Chi-Square test?
- When to use the Chi-Square test?
- What should be the data format for the Chi-Square test?
- How to perform the Chi-square test?

These are just a handful of questions, but the list is quite long. This article will give you knowledge about all the above questions and many more aspects of the chi-square test. But, before this, let’s understand a few basics on Hypothesis testing. You must be wondering why we need to know hypothesis testing when this is about the Chi-Square test? This is because the Chi-Square test is also a hypothesis test.

**Basics of Hypothesis Testing**

**Hypothesis Testing** could be used to interpret and draw conclusions about the population using sample data. It helps in deciding as to which mutually exclusive statement about the population is best supported by sample data.

**Null Hypothesis (H0) **– It is a statement that is commonly accepted or is considered to be the status quo. It is assumed that the observed result is due to the chance of factor. It is denoted by H0.

**Alternate Hypothesis(H1 or Ha)** – As previously mentioned, the Null Hypothesis and Alternative Hypotheses are mutually exclusive statements. If the Null Hypothesis is a commonly accepted fact then Alternate Hypothesis is a real fact-based observation from the sample data. It is denoted by H1 or Ha.

There are various types of Hypothesis Testing. To name a few, there are z-test, one-sample t-test, paired t-test, 2 sample t-test, ANOVA, and many more. All these are parametric tests of mean and variance. Amongst them, we have one more test which we are going to understand in detail., the Chi-Square test.

**What is a Chi-Square Test?**

The Chi-Square test is used to check how well the observed values for a given distribution fits with the distribution when the variables are independent. So, here the test is to see how good the fit of observed values is variable, independent distribution for the same data. This is why it is also known as the “**goodness of fit” **test.

With the above statement, we could see why this is a test of the hypothesis. Either of the two mutually exclusive statements are to be proven here. The null hypothesis would say that data fits independent variable distribution perfectly. This means that observed data is not biased. An alternate hypothesis would say that observed data is deviating from independent variable distribution, and thus, data is biased or variables are dependent. We will discuss this in detail in the later section.

**When to use the Chi-Square test?**

Chi-Square test is designed for a specific set of data types, and that is a categorical variable. This means the test could not be applied to continuous data types. If it is to be applied on a continuous data type, the data needs to be divided into buckets, and frequency or count for each bucket needs to be provided. Let’s understand the difference between categorical and continuous data types.

- Continuous Data Type – Continuous data types are ones that are infinite numerical value between any two values. For example, salary, time.
- Categorical Data Type – Categorical data types are ones that contain a finite set of distinct categories or groups. For example, gender, marital status.

If continuous data needs to be segregated into buckets/categories, then create categories with utmost precision. If the category is not selected carefully, the test results might not make any sense. Chi-Square test will tell you if data is following independent variable distribution or not. But, it will not tell you if categories created or chosen are correct or not.

Let’s consider a scenario, assume an app provides ratings to all the restaurants under 3 categories, good, okay, and not recommended. Now the challenge is to segregate restaurants under correct categories. They can be created under the name of the seating capacity of the restaurant. This is how the table would look-

Small | Medium | Large | |

Good | 30 | 10 | 20 |

Okay | 8 | 10 | 12 |

Not Recommended | 3 | 5 | 2 |

Total | 41 | 25 | 34 |

Small is for a restaurant with a sitting capacity of 20 people, the medium is for sitting capacity of 100 people and large is for sitting capacity of more than 100 people.

Here we changed continuous data into categorical data. Be very vigilant in doing so else conclusions from the test might not come out well.

If we talk about null and alternative hypothesis in the above-given case, then it could be formulated as below:

Null Hypothesis: Ratings for restaurants are independent of the size of a restaurant or in simple terms 2 variables, ratings and size are independent.

Alternate Hypothesis: Ratings and size for a restaurant are having dependency on one and other, and it is a biased observation as one variable is influencing another variable.

The test compares the observed data to a model that distributes the data according to the expectation that the variable is independent. If the observed data does not fit the model, the chances that the variables are dependent become stronger. In this scenario, we will reject the null hypothesis.

Also Read: What is Machine Learning? How does it work?

**What should be the data format for the Chi-Square test?**

Yes, you are correct that data type should be categorical but here the question is, in what format should data be fed as input for performing this test? Answers have already been provided. Yes, the data should be in a tabular format. All examples would be given in a 2×2 grid format but as long as it is the tabular format with proper categorization of data, we are good. It could be any size grid, 3×2, 4×4, 8×3, anything is good until it meets the above 2 criteria.

Small | Medium | Large | |

Good | 30 | 10 | 20 |

Okay | 8 | 10 | 12 |

Not Recommended | 3 | 5 | 2 |

Total | 41 | 25 | 34 |

Although data is in tabular format, it is incomplete for the test. Along with counts mentioned for each category, the total count of each column and row should also be provided, as well as the whole dataset:

Small | Medium | Large | Total | |

Good | 30 | 10 | 20 | 60 |

Okay | 8 | 10 | 12 | 30 |

Not Recommended | 3 | 5 | 2 | 10 |

Total | 41 | 25 | 34 | 100 |

We now have a complete dataset on the distribution of 100 restaurants based on category for rating (good/okay/not recommended) and restaurant size category (small/medium/large). A Chi-Square test could be performed on this data to check if the rating and size of the restaurant are completely independent or they are influencing one another.

**How to perform a Chi-Square test?**

Finally, we are here. Till now, we have understood what is the chi-square test and what should be the input to the chi-square test. Let’s now shift our focus to perform this test. Let’s list down what all is required to perform this test.

- Observed values
- Estimated values

Above are the components required to perform the first part of the test, which is to calculate Chi-Square statistics. We have been talking about observed values for a while, but again I will help you with the table below. This table gives the observed values for the problem at hand. It is denoted by **O.**

Small | Medium | Large | Total | |

Good | 30 | 10 | 20 | 60 |

Okay | 8 | 10 | 12 | 30 |

Not Recommended | 3 | 5 | 2 | 10 |

Total | 41 | 25 | 34 | 100 |

The question is, how to get estimated values now, let’s work on that. You would feel relieved to know that formula to calculate estimated value is available and it is pretty simple and straightforward. The estimated value is denoted by **E.**

The formula for estimated value for each cell is the total for row multiplied by the total for the column, then divided by the total for the table, or simply-

*Estimated values in each cell = (Row total * Column total)/Table total*

So, for above table for cell(1,1) expected value is (60*41)/100, or 24.6. This is an estimated value so if it is in decimal also, don’t worry!

For all the cells the estimated value can be calculated similarly. Let’s see how the estimated table looks like:

Small | Medium | Large | Total | |

Good | 24.6 | 15 | 20.4 | 60 |

Okay | 12.3 | 7.5 | 10.2 | 30 |

Not Recommended | 4.1 | 2.5 | 3.4 | 10 |

Total | 41 | 25 | 34 | 100 |

Let’s see this in a more understandable format, in one table both observed and estimated values. It will give a more compact view, as well as a better understanding.

Small | Medium | Large | Total | |

Good | 30 (24.6) | 10 (15) | 20 (20.4) | 60 |

Okay | 8 (12.3) | 10 (7.5) | 12 (10.2) | 30 |

Not Recommended | 3 (4.1) | 5 (2.5) | 2 (3.4) | 10 |

Total | 41 | 25 | 34 | 100 |

The above table looks like our observed and estimated values are pretty much inline, so can we straight away say that the null hypothesis is correct. Variables are independent of each other. As a Data Scientist, I won’t conclude it so easily without calculating any further.

So, let’s put our data to test now. But how to do that, do we have a tool for this? Yes, we can do it in excel or any other tool which we use for our statistical modeling. Before jumping directly to tools, we will understand this formula and do pen and paper calculation methods. The same formula is applied at the backend in any tool.

Let’s understand the above formula first. We have already seen what O and E. Still, for the sake of understanding, will iterate it again. O stands for observed value and E stands for estimated value (one which we calculated above). It is pretty much a straightforward relationship. We are subtracting observed from expected to get residual or error value. We can also see it as measuring the deviation of observed from estimated or vice versa. This residual value is squared to get rid of positive and negative values and have all of them in one format. This will lead to inflation in scaled values, so to normalize bigger values we divide it with the expected value. Why do we need to do a summation? It is just to tell you that you need to do this for every cell and then add it up to get a chi-square statistics. Yes, you read it correctly. This is the formula to calculate Chi-Square statistics and is denoted by χ (Chi). Since the test name itself is Chi-Squared, we calculate χ2 using the above formula.

Using this formula, we calculate the Chi-Square value for above given example and it is calculated as ((30-24.6)^2/24.6) + ((10-15)^2/15 ) +((20-20.4)^2/20.4) +((8-12.3)^2/12.3) + ((10-7.5)^2/7.5) + ((12-10.2)^2/10.2) + ((3-4.1)^2/4.1) + ((5-2.5)^2/2.5) + ((2-3.4) ^2/3.4) , which comes out to be 8.88.

Now, what to do with this value? How to conclude the result of the test? I know you want to know the conclusion but there is one more step involved here. You remember the basic concept of any hypothesis test, p-value. Yes, here also we have p-value. It is a benchmark to conclude on any hypothesis test. How to get P-value now? For calculating this p-value we need below-mentioned data points:

- The Chi-Square value
- The degrees of freedom

You might be wondering that the Chi-Square value is known but what is this degree of freedom.

Also Read: The Ultimate Guide to AdaBoost Algorithm

**Degrees of Freedom**

The *Degree of Freedom *is denoted by df or d generally. It tells you how many numbers of cells in a grid are independent. For the Chi-Square grid it could be understood like, how many cells you would need to fill in before, given the totals, you can fill in rest by using the formula. You can see it this way, if the total of rows and columns are given, then you have limited freedom to fill in the cells. The rest of the cells are filled by a formula, such that totals of rows and columns are met. You can fill only certain cells by random numbers and rest all are filled with the help of formula application on totals along with these random values. For direct calculation, it is the number of rows minus one times the number of columns minus one, (R-1) *(C-1). If we apply this in our example, the degree of freedom is (3-1)*(3-1) = 2*2 = 4. We can fill in 4 random values and rest would be calculated with the help of totals.

Now you have all the data points to proceed further along with conceptual knowledge of the Chi-Square test. There are Chi-Square tables like z-score and f-statistics tables, but let’s stick to excel calculation here.

The formula in excel to be used is:

*P -value = CHIDIST(x,degree_of_freedom)*

Put in the values, and this will give you a p-value for the given data points mentioned above. In the above example, x is 8.88, and df is 4. Substituting the values in the above formula, you will get a p-value of 0.06417.

What does the basic hypothesis rule say?

- Reject null hypothesis if p-value < alpha (0.05)
- Fail to reject null hypothesis if p-value >= alpha (0.05)

In the above example, the p-value is greater than alpha and thus, we fail to reject the null hypothesis and conclude that ratings given are independent of the size of the restaurant.

Now you know every bit of the Chi-Square test, remember this test tells you the relationship between observed and estimated. It tells you if the variables are independent or not, but it is not providing you insights on how variables are dependent or what kind of relationship exists between the variables. I hope you all had a good read through, and are now well equipped with the Chi-Square test.

I would like to conclude by quoting Daniel Keys Moran-

**“You can have data without information, but you cannot have information without data.”**

If you found this helpful and wish to learn more such concepts, you can upskill with Great Learning’s PGP- Machine Learning Course today!