Section 0 reviews some of the basic ideas about statistics from the first semester of the course. For more detailed information and for information about how to conduct a t-test of means, a paired t-test, or a regression, see the Excel Reference and Statistics Manual from the first semester of the class.
Different or the same?
Imagine that you compare the heights of several college-aged men and several college-aged women and determined that the average height of the women was 171.1 cm and that the average height of the men was 179.8 cm. From these measurements, it would seem straightforward to say that college men are taller than college women. However, it's possible that men aren't really taller than women and we got unlucky and picked men that were taller than average and women that were shorter than average. How can we decide which we think is true - that there is a real difference, or that there is no real difference and we were unlucky in our sampling?
The purpose of statistics is to help us make objective decisions about whether the differences that we measure are real. If the differences are large enough that we believe that they are real, we say that there is a significant difference.
Statistical tests
Statistical tests help us reach conclusions about differences through a calculated number called the "value of the statistic". In all statistical tests, the calculated value of the statistic is larger when the differences are greater. How big must the statistic be in order for us to say that the differences are significant? Statistical tests give us another number that helps us decide: the P-value.
The P-value is a sort of assessment of the likelihood of "bad luck" in sampling. The P-value always goes down when the value of the statistic goes up. If the value of the statistic that we calculate is very large, then the probability that we could have gotten a big difference like that simply through bad luck is very low. We probably got a big difference like that because the things we are comparing are really different! How big must the statistic be before we are willing to say that the difference is significant and not just due to bad luck in sampling? When it's big enough for the P-value to be less than 0.05 .
Example 1:
Here are some fake data for heights of men and women, with the mean (average) values at the bottoms of the columns:
women |
men |
175.4 |
181.5 |
172.1 |
187.3 |
181.1 |
175.3 |
165.2 |
178.3 |
166.3 |
169 |
167 |
183.2 |
170.3 |
184.5 |
171.0571 |
179.8714 |
We can use the Analysis Tool Pack of Excel to perform a two-sample t-test of means on these data. See section 7.3 in the BSCI 1510L Statistics Guide for directions.
We formally state our choices like this:
Null hypothesis: the mean height of men and women are the same
(i.e. the differences that we measure are not real and are just caused by bad luck in sampling)
Alternative hypothesis: the mean heights of men and women are different.
(i.e. there are real differences in the heights)
Here are the results:
As you can see, there is more information here than we really want. One of the things that we care about is the value of the statistic. In the t-test of means, the statistic that we calculate is t. In the results table, the calculated value of t is called "t Stat". When Excel does the test, the sign of t is arbitrary and depends on which column was selected first. So we can ignore the sign and summarize by saying "t=2.77". Is a value of 2.77 large enough for us to conclude that the heights of men and women are different? That depends partly on how many people we sampled. The degrees of freedom are related to the sample size, so we care about that, and would report "df=12". The most important value in the table is the P-value. We should use the value listed under " P(T<=t) two-tail". Since 0.0171 is less than 0.05, we can conclude that the heights of men were significantly different from the heights of women based on this sample. Here is the standard way we would report the results of the test:
t=2.77, df=12, P=0.0171
(the value of the statistics, the degrees of freedom, then the P-value).
To describe the results in words, we can say that we reject the null hypothesis and that the mean heights of men and women are significantly different. We should consider, however, that there is some probability that we are mistaken. If the mean heights of men and women were really the same, there is a 0.0171 probability that we would get results this different by bad luck in sampling. That's about 2% or one time in 50.
Example 2:
Here are some other fake data for heights of men and women, with the mean (average) values at the bottoms of the columns:
women |
men |
175.4 |
181.5 |
172.1 |
185.7 |
181.1 |
175.3 |
171.5 |
177.4 |
176.3 |
169 |
167 |
181.2 |
171.3 |
175.5 |
173.5286 |
177.9429 |
The null and alternative hypotheses are the same and you can run the test yourself with Excel to get the results. Here is a summary of the test results for this sample:
t=1.65, df=12, P=0.124
Here is how we could describe the results if they came out this way: we failed to reject the null hypothesis and did not show that the mean heights of men and women were significantly different.
You should notice that I did NOT say "we proved the alternative hypothesis" nor "we showed that the heights of men and women are the same. It is possible that the height of men and women are really the same. But it is also possible that they are different, and that we couldn't detect the difference because we didn't sample enough people to make the differences show up (i.e. our experiment stinks). If the mean heights of men and women were really the same, there is a 0.124 probability that we would get results this different by bad luck in sampling. That's about 12% of the time or only about 1 time out of 10. To put it another way, if the mean heights of men and women were really the same, we really would really be pretty uncommon to get differences this big by chance. So it's a bit silly to say that we "proved" that the means were the same. The chances just aren't low enough that we feel confident to say that the differences are significant. We really needed to have sampled a lot more people to be more confident that the means were the same.
A fundamental principle of hypothesis testing is that IT IS NOT POSSIBLE TO DECIDE IF TWO GROUPS ARE REALLY DIFFERENT BASED ON A SINGLE MEASUREMENT FROM EACH GROUP. This is because only multiple measurements allow for an estimate of how precisely we know the values of the group we measured
In summary, the following relationships are true for ALL statistical tests:
The null hypothesis represents the situation where there is no difference between two groups or that a factor has no effect. The alternative hypothesis is that there is a difference or that a factor has an effect. P is the probability that a given difference would occur in samples if there were really no difference between the populations from which the samples were taken. It can also be considered the probability of making a Type I error (i.e. assuming that differences are real when in fact the null hypothesis is true).
A difference is considered significant if P< 0.05.
A significant difference means we reject the null hypothesis and assume the alternative hypothesis is true.
All statistical tests have a P value and P values are always used to evaluate significance. Although the statistic used to derive P may be different depending on the test (e.g. t, χ^{2}, F, G, etc.), the general relationship between a statistic and P is always the same. Since the size of a statistic provides a measure of the difference, a higher value of the statistic results in a lower P value, and vice versa.
The exact relationship between a statistic and P is usually complicated and cannot generally be calculated with a simple formula. The relationship includes the number of degrees of freedom (df), an integer number that is often related to the sample size, and is calculated in a specific way for a specific test.