# BSCI 1511L Statistics Manual: 2 Joint probability and the Chi Squared Contingency test

Introduction to Biological Sciences lab, second semester

## Learning Objectives

At the end of this section, you should be able to:

• describe the type of data that can be analyzed in a chi squared contingency test, and the appropriate null and alternative hypotheses
• perform a chi squaredcontingency test to determine if there is an association between two factors
• interpret the results of a chi squared contingency test by relating the P value to whether there is an association or not
• perform a chi-squared contingency test using Excel

## 2.1 Joint probability under independence

What happens if one flips a penny and a quarter?  A basic principle of probability is that the probability of two independent outcomes co-occurring is the product of their individual probabilities.  This principle applies in the penny/quarter situation as long as the outcome of one flip doesn't influence the outcome of another (i.e. they are independent).  Another way of describing the situation is to say that probability of a penny outcome is not associated with the probability of a quarter outcome.

The probability of obtaining the coin flip combinations under an assumption of independence can be calculated using this table:

Table 5. Joint probabilities of flipping two normal coins

 quarter heads 0.5 tails 0.5 penny heads 0.5 0.25 0.25 tails 0.5 0.25 0.25

In Table 5, the sum of the probabilities for the four possible outcomes add up to one (meaning that it is certain that one of the four will happen).  Since the four joint outcomes have the same probability values, we can say that the four outcomes are equally likely.

Now consider the situation where we have trick coins that are loaded to produce heads more often than tails.  The trick penny has a probability of 0.6 of obtaining heads, while the trick quarter has a probability of 0.7 of obtaining heads.  Under these circumstances, the joint probabilities can be calculated with this table:

Table 6. Joint probabilities of flipping loaded coins

 trick quarter heads 0.7 tails 0.3 trick penny heads 0.6 0.42 0.18 tails 0.4 0.28 0.12

The outcomes are less obvious this time.  A quick check shows that again the combination probabilities add up to one.  However, this time the four joint probabilities are not equally likely.

In summary, we can predict the probability of two kinds of events co-occurring by multiplying the probabilities of the individual kinds of events

## 2.2 Association and independence

Consider the case of the sex of children in families who have two children.  Assume that the probabilities of having males and females are each 0.5 (i.e. it is equally likely to have a boy or a girl).  There are various ways that the sexes of children could be distributed among families with two children and still produce an overall relative frequency of 0.5 males and 0.5 females.  Some possible distributions are shown in Tables 7 through 9:

Table 7. Absolute frequencies of sexes of children with extreme negative association

 second child male female first child male 0 250 female 250 0

Table 8. Absolute frequencies of sexes of children with extreme positive association

 second child male female first child male 250 0 female 0 250

Table 9. Absolute frequencies of sexes of children with no association (complete independence)

 second child male female first child male 125 125 female 125 125

The examples in Tables 7 through 9 are extreme, but they demonstrate the range of possible distributions.  You should notice that in all three examples the sex ratios are the same (half males and half females).  The difference is in the way those sexes are distributed within families.  In the case of Table 7, the second child born is always the opposite sex of the first child born.  In Table 8, the second child born is always the same sex as the first child born.  In Table 9, there is no association between the sex of the first child born and the sex of second child born.  An alternative to the term "association" is "contingent".  We can say that the second outcome is contingent on the first outcome when the state of the second outcome depends on the state of the first outcome.

## 2.3 The chi squared contingency test (test of independence)

A scientist collects data on the sexes of children in 500 families having two children and records the following data:

Table 10. Actual absolute frequencies of children in some families with two children

 second child male female first child male 114 131 female 132 123

From the data in Table 10, it appears that there may be a small negative association between the sexes of first and second children.  However, it is also possible that there is no association and that the deviation from the expected is due to random variation.  This situation can be tested statistically using a special case of the chi squared goodness of fit test that was described in Section 1.4 and 1.5 .  This test is called a chi squared contingency test.

In this case, the null hypothesis is that there is no association between the sex of the first and second child (i.e. that the two factors, sex of first child and sex of second child, are independent).  So it would seem like we could just compare the cells in Table 10 with those in Table 9 since both were based on 500 families.  However, that would actually be inappropriate because it would be testing two different things: whether the sex of the first and second children were associated AND whether the sex ratios of the children were actually 1:1.  What we really want to know is this: given the sex ratios that exist, are the sexes of the first and second children associated?  So our first task is to determine the actual sex ratios of the first children and actual sex ratios of the second children.  We can do this by expanding the table to provide totals for each category:

Table 11. Calculation of actual relative sex frequencies of children in some two child families

 second child male female total actual relative frequencies first child male 114 131 245 0.490 female 132 123 255 0.510 total 246 254 500 1.000 actual relative frequencies 0.492 0.508 1.000

The totals in Table 11 were used to calculate the actual relative frequencies of males and females for first and second children.  These frequencies are near, but not identical to 0.5 .  If we assume that the observed relative frequencies represent the probabilities of achieving these states (as discussed in section 1.6), we can now use these actual relative frequencies to calculate the joint probabilities of the various combinations of sexes for first and second children by multiplying the probabilities of single outcomes, as discussed in section 2.1 .  The results of this are in Table 12.

Table 12. Calculation of expected joint probabilities for children in some two child families

 second child male female actual relative frequencies first child male 0.241 0.249 0.490 female 0.251 0.259 0.510 actual relative frequencies 0.492 0.508

In order to conduct an actual goodness of fit test, the expected joint probabilities must be converted into expected absolute frequencies of combinations, based on a total sample of 500 (i.e. the test must be performed on counts, not relative frequencies).  This has been done in Table 13 by multiplying each expected joint probability by the total number of children observed.

Table 13. Expected absolute frequencies of children in some two child families

 second child male female first child male 120.5 124.5 female 125.5 129.5

We are now in a position to conduct a goodness of fit test to see if the actual (observed) absolute frequencies listed in Table 10 differ significantly from the absolute frequencies we would expect if there were no association (Table 13).

The chi squared term for each combination (cell in Tables 10 and 13) is calculated as described in section 1.4 and the sum of the terms for each combination represents the chi squared value for the test.  The number of degrees of freedom in a chi squared contingency test is reduced when compared to a generic goodness of fit test.  That is because we lose degrees of freedom when we calculate the relative frequencies based on the data itself (as we did in Table 11).  The rule for degrees of freedom in contingency tests is:

df=(rows-1)(columns-1)

In this example, there are two rows and two columns, so (2-1)(2-1)=1 and there is one degree of freedom.  As in the regular goodness of fit test, the value of P depends on the chi squared value and number of degrees of freedom, and can be calculated using Excel as will be shown in the following section.

## 2.3 Summary

A chi squared contingency test is used to determine whether two factors are associated or independent.  Each of the two factors must be discontinuous and recorded as one of several possible states.  The test is performed on counts of outcomes.  If the factors are not independent, they may simply be associated in some unknown way.  It is also possible to use the test in circumstances where one variable is suspected to be dependent on the other (i.e. that the state of one variable is affected by the state of the other variable).