At what point is the probability that our differences were caused by bad luck in sampling (i.e. unrepresentative) low enough that we feel comfortable concluding that a difference is significant? P=0.20 is not low enough because one time out of five we would mistakenly conclude that differences were real when they actually were not. P=0.001 is fine because we would only make that kind of mistake one time out of a thousand. The consensus in the scientific community is that P must be less than 0.05 in order for us to conclude that the difference we are observing is significant. One time out of twenty we would make the mistake of concluding that the difference was real when actually the null hypothesis was true. This type of mistake is called a Type I error. One could define P to be the probability of making a Type I error.
P < 0.05 is called the criterion for significance. It is also referred to as the α (alpha) level, written as α = 0.05 . If we conduct a statistical test and P < 0.05, we say that we have rejected the null hypothesis and that we have shown that there is a significant difference. If P > 0.05, we say that we have failed to show that there is a significant difference and that we accept the null hypothesis.
If P > 0.05, we do NOT say that we have proven the null hypothesis (i.e. proven that there is no difference). Why? Recall the example where there were few patients in the drug trial. The value of P was high (0.21) because unrepresentative sampling was likely and not necessarily because the drug had no effect. Students are often confused about why we can say that we have shown that there is a difference when P < 0.05, but we cannot say that we have shown that there is no difference when P > 0.05. The reason is because it is easy to cause an experiment to have a high P-value just by having an inadequate sample size. If you want to demonstrate that there is no difference, not only do you have to show that the value of P is high, but you also have to show that your experiment doesn't stink (i.e. that you had a big enough sample size that you could have shown differences if they were there). Thus, we can only accept the null hypothesis or reject the null hypothesis. We can never "prove" it.
One thing you should consider about the above, why this matters, is that by proceeding in the above manner (0.05) allows enough certainty for medications to be approved or not for public use. For whether or not a procedure (such as a chemo treatment) is having enough of a desired effect to help a person. There needs to be some objective way to distinguish between "A" and "B" and whether or not the effect is clearly different (or not) to warrant usage (or stopping usage).