Chi-square test (χ²): what it is and how it is used in statistics
In statistics, there are various tests to analyze the relationship between variables. Nominal variables are those that allow relationships of equality and inequality, such as gender.
In this article we will know one of the tests to analyze the independence between nominal or higher variables: the chi-square test, through hypothesis testing (Tests of goodness of fit).
- Related article: "Analysis of Variance (ANOVA): what it is and how it is used in statistics"
What is the chi-square test?
The chi-square test, also called Chi-square (Χ2), is within the tests pertaining to descriptive statistics, specifically descriptive statistics applied to the study of two variables. Descriptive statistics, for its part, focuses on extracting information about the sample. Instead, inferential statistics extract information about the population.
The name of the test is typical of the Chi-square distribution of probability on which it is based. This test was developed in 1900 by Karl Pearson.
The chi-square test is one of the best known and used to analyze nominal or qualitative variables, that is, to determine the existence or not of independence between two variables. That two variables are independent means that they have no relationship, and that therefore one does not depend on the other, nor vice versa.
Thus, with the study of independence, a method is also originated to verify if the frequencies observed in each category are compatible with the independence between both variables.
How is the independence between variables obtained?
To evaluate the independence between the variables, the values that would indicate the absolute independence are calculated, which is called “expected frequencies”, comparing them with the sample frequencies.
As usual, the null hypothesis (H0) indicates that both variables are independent, while the alternative hypothesis (H1) indicates that the variables have some degree of association or relationship.
Correlation between variables
Thus, like other tests for the same purpose, the chi-square test it is used to see the sense of the correlation between two nominal variables or of a higher level (For example, we can apply it if we want to know if there is a relationship between sex [being a man or a woman] and the presence of anxiety [yes or no]).
To determine this type of relationship, there is a table of frequencies to consult (also for other tests such as the Yule Q coefficient).
If the empirical frequencies and the theoretical or expected frequencies coincide, then there is no relationship between the variables, that is, they are independent. On the other hand, if they coincide, they are not independent (there is a relationship between the variables, for example between X and Y).
Considerations
The chi-square test, unlike other tests, does not establish restrictions on the number of modalities per variables, and the number of rows and the number of columns in the tables do not need to match.
However, it is necessary that it be applied to studies based on independent samples, and when all the expected values are greater than 5. As we have already mentioned, the expected values are those that indicate the absolute independence between both variables.
Also, to use the chi-square test, the measurement level must be nominal or higher. It does not have an upper limit, that is, does not allow us to know the intensity of the correlation. In other words, the chi-square takes values between 0 and infinity.
On the other hand, if the sample increases, the chi-square value increases, but we must be cautious in its interpretation, because that does not mean that there is more correlation.
Chi-square distribution
The chi-square test uses an approximation to the chi square distribution to evaluate the probability of a discrepancy equal to or greater than that existing between the data and the expected frequencies according to the null hypothesis.
The accuracy of this evaluation will depend on whether the expected values are not very small, and to a lesser extent that the contrast between them is not very high.
Yates Correction
Yates' correction is a mathematical formula that is applied with 2x2 tables and with a small theoretical frequency (less than 10), to correct the possible errors of the chi-square test.
Generally, the Yates correction or "continuity correction" is applied. when a discrete variable approximates a continuous distribution.
Hypothesis contrast
Furthermore, the chi-square test belongs to the so-called goodness-of-fit tests or contrasts, which have the objective of deciding whether the hypothesis that a given sample comes from a population with a fully specified probability distribution in the null hypothesis can be accepted.
The contrasts are based on the comparison of the observed frequencies (empirical frequencies) in the sample with those that would be expected (theoretical or expected frequencies) if the null hypothesis were true. A) Yes, the null hypothesis is rejected if there is a significant difference between the observed and expected frequencies.
Functioning
As we have seen, the chi-square test is used with data belonging to a nominal scale or higher. From chi-square, a null hypothesis is established that postulates a probability distribution specified as the mathematical model of the population that has generated the sample.
Once we have the hypothesis, we must perform the contrast, and for this we have the data in a frequency table. The absolute observed or empirical frequency is indicated for each value or range of values. Then, assuming that the null hypothesis is true, for each value or interval of values the absolute frequency that would be expected or expected frequency is calculated.
Interpretation
The chi-square statistic will take a value equal to 0 if there is perfect agreement between the observed and expected frequencies; by cons, the statistic will take a large value if there is a large discrepancy between these frequencies, and consequently the null hypothesis must be rejected.
Bibliographic references:
- Lubin, P. Macià, A. Rubio de Lerma, P. (2005). Mathematical psychology I and II. Madrid: UNED.
- Pardo, A. San Martín, R. (2006). Data analysis in psychology II. Madrid: Pyramid.