Skip to Main Content

BSCI 1511L Statistics Manual: 0 Review

Introduction to Biological Sciences lab, second semester

Introduction

Section 0 reviews some of the basic ideas about statistics from the first semester (1510L) of the course.  For more detailed information and for information about how to conduct a t-test of means, a paired t-test, or a regression, see the Excel Reference and Statistics Manual from the first semester of the class.

On the Review for Excel and Statistics

Students who have had 1510L will notice several Readings that they have read prior or are asked to read again.  Students who have not had 1510L may be unfamiliar with some of the content that we refer back to (Excel and Statistics and Writing Guide Content).  We will present a mini-review below on a few ideas that were covered in 1510L.  This will allow students unfamiliar with Excel/Statistical basics to have a chance to learn a few ideas and for those who are familiar, to become re-acquainted with a few details they may have forgotten.

It would be completely unfair and not practical to expect you to:

-perform an experiment

- collect data

-analyze the data AND ask you to do a T-test of means (or Chi Square goodness of fit next week) without explaining what a T-test of means is (how to do it), what the results mean (statistical analysis), and how to present the results (Writing Guide formats)

Thus, at all places we introduce a feature, it is described in the Reading, and we give examples of how to do without spending time trying to find the information somewhere else on-line.

For those for whom this content is new, it is a longer read to be sure to show how/why.  

For those for whom this content is somewhat familiar, it should be a quick read to refresh your memory.

You are never expected to have memorized the buttons to press to perform a given statistical test.  You ARE expected to know that it exists, how to operate it, and how to express it.

Specifically from now through about Lab 6, you will be constantly referred back to this page or other parts of The 1510L and 1511L Statistical Guide information we have assembled for you.  In several cases, we will go over X test in the Reading and then in the homework perform that X test. And then the next week, we will use the exact same data set but with a different tests to perform OR a different program to use to compare if you obtain the same values or not. If there was some way to link content within Brightspace to itself, we would do so!! 

With apologies, you will need to keep going back to certain Lab weeks OR become really familiar with the entire presented content we have on the "Writing/Excel/R-Studio/Statistics Guide copy" module prior to "Week 0". There are links provided to the Library Guides web interface as well.

Once you learn how to do things, such as a t-test of means on Excel, without going back to the "instructions", it will be easier.  If you did not really learn how to use Excel in 1510L, now is the time.

 

General comment about all things Excel

 


Review – 1510L – Excel


Preface

We will use Microsoft Excel extensively throughout this course. If you have had a lot of practice with EXCEL, you will already have most of the basic skills outlined in this document and it will provide a great review prior to the course. If you are a novice, You NEED to learn all of the techniques explained in the links/text below. The use of Excel as outlined in this document will act as a key foundation for assignments and content taught this semester. It will be assumed that all students have the skills outlined following this reading.

Program versus App

We have some confusion on this aspect.  There are lots of ways to have access to Excel, either as an installed program on your system or as a wed-based application (an App).

There is only ONE way to use Excel in the 1511L course (with the data analysis toolpak), and that is as an installed program.

Example: When you save a file, how do you save it?  IF you save the file on a Cloud or other virtual platform, you have access to that file as long as you have internet access, but it is not physically within your system. However, if you save the file directly onto your system (in a folder or on the desktop, etc,.), you have access to that file at any time since it is within your system physically.

If the version of Excel you use is directly installed onto your system (as part of the Microsoft Office Suite) then you have the program physically within your system.

If the version of Excel you use is web-based, and you need internet access for it, then you do NOT have Excel physically within your system. Similar, if you have a web-address displayed at the top of your "Excel" page and it is part of the office365 WEB-based verison.....it is still web-based and will be missing some embedded components. 

The difference is that with the web-based App, the full range of Excel options is not available. It does NOT have the data analysis toolpak.  "Data miner" or "Xcel miner" or "analyze data" are NOT the same features as data analysis toolpak!!!

 

 

Enabling the Data Analysis for Excel (required for most statistical functions)

Note: to carry out the statistical tests described later on using Excel for Windows, the Data Analysis Tools must be enabled (NOT 'data miner' or 'Xcel miner'). This has already been done on the lab computers in 2122 Stevenson but if you are using a computer elsewhere, you may need to enable it. Usually this works for most systems, go to the File tab (left side) then select Options. Click on Add-Ins  at the bottom of the screen, then "ok". Select Analysis ToolPak, then click OK. 

 

For some Excel versions (your laptop), if the above did not work then look under the Tools drop down, select "Add-ins" and then from that menu that opens, select the "Analysis Toolpak" check box.

 

WHATEVER YOU DO...do NOT just search for "data analysis".  You will end up with Xcel Miner or Analyze Data and that will not work!

 

'DATA MINER/XCEL MINER' is NOT it. 'Data Analysis Tools'.

The Data Analysis Tools pack is only available in Office for Mac 2016 - it can't be enabled in older versions. However, all Vanderbilt students can download a free copy of Office 2016 (or whatever it is called now) for Mac from the software store. If you have Mac Office 2016 Excel, you can enable it as follows:

1. Click the Tools menu, then click Add-Ins.

2. Select the Analysis ToolPak from the Add-Ins available box. Click OK. 

3. Restart Excel.

The Analysis tools should appear on the Data ribbon at the far right.

 

If the above did not work, there may be a different way for your specific system.  Look for it. not 'data miner'...."Data Analysis Tools"

It was evident that multiple students had issues with enabling the ToolPak on their own systems in Fall 2024 1510L. The primary reason determined was that the ‘App’ was being downloaded and not the ‘actual Office package’. Be sure to determine if you are able to enable it during Week 0 as this feature of Excel will be used multiple times this semester for 1511L.

Fall Semester Review of Excel

Everyone has a different background and level of familiarity (or none) with Excel and Word. Regardless of what on-line platforms or computer training you have had, Excel and Word are used in BSCI 1510L and 1511L, as well as several courses beyond this one. 

This link https://researchguides.library.vanderbilt.edu/c.php?g=69346&p=831787 leads to the ‘Library Guides’ website where we placed our ‘Excel Guide’ and other resources several years ago.  The version of the reading on Brightspace that students in Fall 1510L read was much of this. If you need any refresher on the basic operations in Excel, please go to the site and review OR go into your 1510L course. It was deemed to be "not required to make you read it all" again…such as how to copy/paste in Excel, etc.

The link above is open access on the internet without logging into Brightspace.

The web page the above link opens to is the table of contents. You can use the drop-down arrow on ‘Excel Reference and Statistics Manual’ to quickly access a specific part OR simply use the “next page” arrow prompts. For the complete and full review, look through Sections 1 through 12 (much of this was Fall 1510L from Week 1 through Week 5). Very clearly, you do not necessarily NEED to read every line unless you feel it is necessary. If you were 100% comfortable with the Excel work that you did in Fall 2024 for 1510L, you know where to go to refresh later. If you had difficulty with the Excel work in Fall 2024, this is the right time to begin to look again…quite often things start clicking on some ideas simply after a little time passes and you look at it again.

BUT IF YOU WERE NOT IN 1510L FALL 2024, you should go to the site and look through the pages there.  If you already know how to enter data into Excel, great. If you have no idea how to enter data in Excel, then you should read through the relevant sections of content.

 

General review of Excel functions and operation

The interface of Excel is a screen of rows and columns of white blocks, or "cells".  Each cell has a unique identifier based on the column letter (top of column) and row number.  Excel has an annoying feature: it bases what it thinks you want to do based on what you type into a given cell.  If you type a number, it will assume a number or mathematical function will occur.  If you type a letter, it will adjust the "format" of the cell from 'general' or 'number' to 'text'. 

The really great thing about Excel is that it can act as a super calculator.  Hopefully you all are familiar with a calculator, in particular the TI (Texas Instruments) type models that have graphing features plus all those keys that you have to press to do certain mathematical operations or "Functions". Thus, if you type "2+2" or "2+2=" (without the quote marks!!) into a cell..... nothing happens.  Why?  Because Excel is also a program and needs a correct command sequence.  If you type, "=2+2" into a cell, Excel has now been given a command to add 2 and 2 together and it knows to place it within that cell.  Why did 2+2 not work earlier?  Excel did not know where to place the result.  With "=2+2", Excel knew to place it within that specific cell. You can thus use combinations of arithmetic functions and cell references to place results in specific places or to use specific cells in whatever ways you need to do something, such as adding two numbers in two different cells and placing the results into a different cell.  

Excel also has more complex mathematical operations, such as performing an average (mean), a Standard Deviation, and logarithms to name a few plus the ability to set up graphs and charts add various add-ons. And also, statistical tests.  

For a more full review on data entry into Excel with screenshots, view sections 1.1 through 1.5 1 Formulas in Excel - BSCI 1510L Literature and Stats Guide - Research Guides at Vanderbilt University and 2.1 through 2.3.3. 2 Basic operations in Excel - BSCI 1510L Literature and Stats Guide - Research Guides at Vanderbilt University If you have effectively zero experience with Excel, you should do the complete reading suggested and also try to replicate the values shown in the screenshots.

1510L Broad Review of Statistics

 

1510L Broad Review of Statistics:

The following section is a broad overview for statistics, containing a review of t-test of means and a review on reporting relevant values and null/alternative hypotheses. There will be two examples shown that will be required to replicate for the Assignment.

Statistically different or the same?

Imagine that you compare the heights of several college-aged men and several college-aged women and determined that the average height of the women was 171.1 cm and that the average height of the men was 179.8 cm. From these measurements, it would seem straightforward to say that college men are taller than college women. However, it is possible that men are not really significantly taller than women and that we had sampled men that were taller than average and women that were shorter than average. How can we decide which we think is true: that there is a real difference, or that there is no real difference and we were unlucky in our sampling?

The purpose of statistics is to help us make objective decisions about whether the differences that we measure are real. If the differences are large enough that we believe that they are real, we say that there is a significant difference (based on some objective qualifying values).

Thus, we have two possible outcomes for our two sets of data: either the data in one group is effectively the same sort of data as the other group or the two sets of data are not the same. This can be described as that there are two educated guesses, two thoughts, or two hypotheses about how the data behaves.

In the case of the data being the "same", this would be a typical example of a null hypothesis.  A null hypothesis is a general statement of no change, no effect or no difference between two groups. There is not a specific "catch-all" for a stated null hypothesis.  For our college-aged humans, we could say as a possible null hypothesis that, "there is no difference between the heights of the two groups of people sampled" or "the difference in average height between college-aged women and college-aged men will be the same".  There are several possible ways to phrase it.

In the case of the data being "not the same" or "different", this is where we have a slight area of confusion.  First, an alternate hypothesis is a statement of there being a difference, that, "there is a difference between the heights of the two groups of people sampled" or "the difference in average height between college-aged women and college-aged men will be different".  Notice that there was not a qualifier placed there, that it was not stated about one group specifically being taller or shorter than the other, just that there would be a difference between the two groups of data.  It will not be delved into more deeply than that in this course, but there are expected differences or assumed differences to occur in a "direction of data" and also that it is possible to accept one of the hypotheses and NOT reject the other (and also not accept at the same time)....or that you could reject both.  A formal course in Statistics would be advised to clarify those ideas better.

Statistical Tests

Statistical tests help us reach conclusions about differences through a calculated number called the “value of the statistic.” In all statistical tests, the calculated value of the statistic is larger when the differences are greater. How big must the statistic be in order for us to say that the differences are significant? We are not covering that aspect...that is for a true Statistics class. HOWEVER, statistical tests can give us another number that helps us decide: The P-value

The P-value is a sort of assessment of the likelihood of “bad luck” in sampling. The P-value always goes down when the value of the statistic goes up for most statistical tests and only goes from a value range of 0 to 1. If the value of the statistic that we calculate is very large, then the probability that we could have gotten a big difference like that simply through bad luck is very low. We probably got a big difference like that because the things we are comparing are really different! How big must the statistic be before we are willing to say that the difference is significant and not just due to bad luck in sampling? When it is big enough for the P-value to be less than 0.05.

T-test of Means

After reading through the following example you SHOULD attempt the work on your own computer to make sure you can obtain the same results (hint-hint: this means it will be used in an assignment soon).

Get in the habit of saving the Excel file on your computer in a ‘statistics’ folder with its own subfolder, ‘t-test’. We will do lots of statistical tests and you should save the files in one easy to find place for your future reference.  Homework 0 will need Example 1 and 2

Example 1:

Here are some fake data for heights of men and women, with the mean (average) values at the bottoms of the columns:

Table 1: Heights of men and women. Mean(average) values are in bold at the bottom of the columns.

women

men

175.4

181.5

172.1

187.3

181.1

175.3

165.2

178.3

166.3

169

167

183.2

170.3

184.5

171.0571

179.871

We can use the Analysis Tool Pack of Excel to perform a two-sample t-test of means on these data. See section 7.3 in the BSCI 1510L Statistics Guide for directions if you need a refresh on your memory. 

We formally state our choices like this:

Null hypothesis: There IS NO statistical difference between the mean height of the sampled men and the mean height of the sampled women.

(i.e. the differences that we measure are not real and are just caused by bad luck in sampling)

Alternative hypothesis: There IS a statistical difference between the mean height of the sampled men and the mean height of the sampled women.

(i.e. there are real differences in the height between women and men, here, the average height of women is less than of men)

 

 

Here are the full results from Excel:

https://s3.amazonaws.com/libapps/accounts/14039/images/t-test-screenshot.png

Figure 1: Results of (excel generated) Table 1 data using a t-Test of means between the sampled heights of men and women. <<PLEASE note that insertion of the "Excel output" is NOT appropriate in Writing. We need to show you the full example in this case.>>

As you can see, there is more information here than we really want inside the Figure 1. One of the things that we care about is the value of the statistic. In the t-test of means, the statistic that we calculate is called “t.”  In the results table, the calculated value of t is called “t Stat”. When Excel does the test, the sign of t is arbitrary and depends on which column of data was selected first. So, we can ignore the sign and summarize by saying “t=2.77”.

Is a value of 2.77 large enough for us to conclude that the heights of men and women are different?

That depends partly on how many people we sampled. Therefore, it is important to always report the degrees of freedom which is related to the sample size. Degrees of freedom is not quite the same as sample size, but rather the number of values that are free to vary within a data set or put another way the number of independent pieces of information. Degrees of freedom is usually the data set minus one. In the case of a t-test of means, there are two sets of data, thus (7-1) + (7-1) equals a df of 12 here.

The most important value for us in the table is the P-value. We should use the value, which can be found in figure 1 under “P(T<=t) two-tail”.  (FOR ALL STATISTICAL TESTS IN 1511L, use the "two-tail" value.  If you want to know the difference in one-tail/two-tail, ask us or take Statistics.) Since 0.0171 is less than 0.05, we can conclude that the heights of men were significantly different from the heights of women based on this sample. Here is the standard way we would report the results of the test:

t=2.77, df=12, P=0.0171

(the value of the statistics, the degrees of freedom, then the P-value). 

To describe the results in words, we can say that we fail to accept the null hypothesis and that the mean heights of  our sampled men and women are significantly different form each other and specifically here, this means that the college-aged women are shorter than the colleg-aged men. We should consider, however, that there is some probability that we are mistaken. If the mean heights of men and women were really the same, there is a 0.0171 probability that we would get results this different by bad luck in sampling (with the same sample size). That's about 2% or one time in 50 similar experiments. This is also why we "failed to accept" the null hypothesis instead of outright "rejecting the null hypothesis".  A more formal Statistics Course would explain better why.  

Example 2: As you should have done for example 1 you also SHOULD read Example 2, and then try it on your own to compare the results. Save the Excel file to yourself in a ‘statistics’ folder with its’ own subfolder, ‘t-test’.

Here are some other fake data for heights of men and women, with the mean (average) values at the bottoms of the columns:

Table 2: Heights of men and women. Mean (average) values are in bold at the bottom of the columns.

women

men

175.4

181.5

172.1

185.7

181.1

175.3

171.5

177.4

176.3

169

167

181.2

171.3

175.5

173.5286

177.9429

The null and alternative hypotheses are the same as example 1 above. Run the test yourself using Excel. Check your work using the following summary of the test results for this sample: 

t=1.65, df=12, P=0.124

Here is how we could describe the results if they came out this way: we failed to reject the null hypothesis and did not show that the mean heights of men and women were significantly different. 

You should notice that it was NOT said that “we proved the alternative hypothesis,” nor “we showed that the heights of men and women are the same. It is possible that the height of men and women are really the same. But it is also possible that they are different, and that we couldn't detect the difference because we didn't sample enough people to make the differences show up (i.e. our experiment needs revision). If the mean heights of men and women were really the same, there is a 0.124 probability that we would get results this different by bad luck in sampling. That's about 12% of the time or only about 1 time out of 10 similar experiments. To put it another way, if the mean heights of men and women were really the same, it would really be uncommon to get differences this big by chance. So, it's a bit silly to say that we “proved” that the means were the same. The chances just aren't low enough that we feel confident to say that the differences are significant. We really needed to have sampled a lot more people to be more confident that the means were the same. 

Data files

Summary

Summary

A fundamental principle of hypothesis testing is that IT IS NOT POSSIBLE TO DECIDE IF TWO GROUPS ARE REALLY DIFFERENT BASED ON A SINGLE MEASUREMENT FROM EACH GROUP. This is because only multiple measurements allow for an estimate of how precisely we know the values of the group that we measured. Just as an example, what if the college-aged men we sampled in the first Example (where the heights were different) were from college-aged men on the basketball team (which tends to have players taller than average) OR the college-aged women were from the gymnast team (which tends to have participants shorter than average)? In other words, we potentially had a flawed set of data/sampling due to the number of people sampled and did not make sure to account for biases or variances that may affect the data. We will discuss the idea of that later in the semester.

In summary, the following relationships are true for ALL statistical tests:

The null hypothesis represents the situation where there is no difference between two groups or that an independent variable has no effect. The alternative hypothesis is that there is a difference or that a variable had an effectP is the probability that a given difference would occur in samples if there were really no difference between the populations from which the samples were taken. It can also be considered the probability of making a Type I error (i.e. assuming that differences are real when in fact the null hypothesis is true).

A difference is considered significant if P< 0.05 (most always the case, but researchers might alter the 0.05 requirement to 0.01 or something else, if the study requires it).

A significant difference for 1511L generally means we fail to accept the null hypothesis and assume the alternative hypothesis is true (although we cannot say the alternate is 'proved').

All statistical tests have a P value and P values are always used to evaluate significance. Although the statistics used to derive P may be different depending on the test (e.g. t, χ2F, G, etc.), the general relationship between a statistic and P is always the same. A higher value of the statistic (relative to a comparison of another value) results in a lower P value, and vice versa. 

1510L Concept Review (Scientific Literature Writing Guide or Writing in Science)

In addition to Excel and statistics from 1510L, you should also review other key concepts that you learned in 1510L that you will see again in 1511L. To do this, the following link will be invaluable: https://researchguides.library.vanderbilt.edu/c.php?g=69346&p=447282. Through this link you will find material covering key concepts such as: what is in a scientific paper, citations, and IN PARTICULAR how to do Table headings, Figure Legends, and Report Statistical Results. If you feel comfortable with these concepts from 1510L, know that this is available. If you were NOT in 1510L in Fall, then you should look through this.

You can also see from what you have already read thus far some of the ideas being used (Table headings, Figure Legends, reporting statistical results).  Take a look above again to look at the features you may have just glanced at in the Tables and Figures already shown.

There are also other general formats that you should do when you write about the results of your work/experiment "in Science".  Later on, we will look at a primary literature source and break it down into parts: the sections of a paper/journal article, common terminology/procedures in experiments, how to write based on established formats (how to write a scientific name, table headings/figure legends, tone and style). In 1510L, we looked at a few papers and then later on performed an experiment on that topic area and you wrote a science-style paper ("journal article") based on the results and your statistical tests.  In 1511L, we will also have an experiment for everyone to write about.