Skip to Main Content

BSCI 1511L Statistics Manual: 0.2.1 Running a t-test of means using RStudio

Introduction to Biological Sciences lab, second semester

Running RStudio

This section you will read and attempt as a practice run for input of data directly in R.

There are three ways to get data into R: direct entry of data, inclusion of a data file, and URL.  This page will just show direct data entry (physical input of values).  We will cover files later. URL, that one is for you to learn about if you go further into R usage.

 

This exercise assumes that you already have both R and RStudio installed on your computer, or that you are using the lab computers where it is already installed.  If you need to install R and RStudio, see the Installing R and RStudio page.  

To start the exercise, double-click on the RStudio icon.  It is possible to do the exercise using R rather than RStudio, but RStudio has additional capabilities, so using it is recommended.

 

  One repeated comment: there are at least three different ways to run a set of data in R to do the statistical test. Try to not get the differences mixed up.

The Source Editor pane

Run RStudio, then from the File menu, select "New File -> R Script".  This will open a new Source Editor pane.  The Source Editor is where you can work on putting together scripts that will run a series of R commands in the console.  You could just as easily run those commands manually by yourself by typing them in the Console pane.  But by creating the script in the Source Editor, you can easily change things and re-run the script.  You can also save your script to use as a starting point in the future.  

Data format for t-test of means

We will be doing Example 1 from the review page, Example 1 data from the review page, (This is also the same t-test data you did in Week 0, for review) so you should refer to that before starting this exercise.

              The format for doing a t-test of means in R requires placing all of the data in one column with a second column containing a “grouping variable”. You do not have to actually use "grouping" in the column heading, but we are so this all makes sense. The grouping variable is just a string (series of characters) that identifies the group to which a particular data value belongs. The height data table with grouping variable should look like this:

Notice how this format is different from the format we would typically use for Excel:

Note: when you create a grouping variable, you should NOT use one that is composed only of numeric characters.  This can cause unpredictable behavior in R.  For example: "1, 2, 3" is NOT a good series of grouping variables, while "block1, block2, block3" is fine.  

It is also best when assigning grouping variables to copy the first instance and paste it into the other cells.  This prevents the accidental assignment of grouping variables that look the same, but that are actually different character strings.  For example, "cat" and "cat " ("cat" followed by a space) are not the same even though they look the same.  "group1" and "groupl" are also not the same (the former ends with a one and the latter ends with a lower case "L").

Running a script

There are basically three main steps to running an analysis in RStudio:

  1. Get data into the software.
  2. Do some kind of manipulations to the data (optional)
  3. Output some results.

The following script does those three things to conduct the t-test of means.  DON'T try to paste this script into RStudio!  Read on for the reason why!

# get the data into the software
Input =(
  "grouping height
men 181.5
men 187.3
men 175.3
men 178.3
men 169
men 183.2
men 184.5
women 175.4
women 172.1
women 181.1
women 165.2
women 166.3
women 167
women 170.3
")

# do some kind of manipulations to the data
Data = read.table(textConnection(Input),header=TRUE)

# output some results
t.test(height ~ grouping, data=Data,
       var.equal=TRUE,
       conf.level=0.95)

 

Here are several things you should notice about the script:

  1. It has comments on lines starting with “#.” These lines are ignored when the script is run. The lines starting with "#" just make it easier for people (like you and I) to understand the script when they look at it.
  2. The data have been entered directly into the script by typing them between the quotation marks. This is fine when you don't have many data points, but would be really annoying if you had a lot of numbers to type. The columns of data are separated by whitespace (space characters or tabs) and the rows of data are separated by linefeeds (i.e. pressing the Enter key). This is intuitive.
  3. The typed-in text gets turned into an actual R data table using the "read.table" function.
  4. The t.test function does the actual test. The arguments of the function specify what data in the table should be used and the exact kind of t-test to do. You can look at reference materials or examples to hack/change functions to make them do what you want.

Unfortunately, copying this script from here includes invisible bad characters.  Instead, go to this GitHub Gist raw file page, copy, and paste the test into the Source Editor pane.  To run the script, highlight it in the Source Editor, then click on the Run button.  You will see each step of the script in the Console pane in blue, followed by the output of that step (if any) in black.  Compare the values of t, df, and P that are given to the values in the example. If they do not match the ‘Excel result’ (t=2.77, df=12, P=0.0171) then try again. **Be sure that you copied the script from the GitHub page and NOT what was shown above. **And sometimes, R ‘remembers’ what you just did, so if it still is not working, close R. Reopen it. Clear the screen of the previous script if it is still there, and copy the GitHub script again.

 You should be aware that you can save R scripts as text files from the Source Editor pane. The default file extension for R scripts is “.R” and on a Windows computer, if that file extension is associated with RStudio, you can double-click on a file such as “test.R” and it will automatically be loaded into the Source Editor pane. You may wish to collect useful R scripts for future use somewhere where you can find them. SO, YOU SHOULD SAVE the script above in a place where you are saving all the statistics files you have done so far. Name it something clear, like ‘t-test – example1data.R.’

              You now have a script and an example of a t-test using data “physically” inside the script. You could easily open this file up later and simply type over the data (carefully) in the script and run it. But what if it is a lot of numbers? More than 20 in each data set or 200 or 2,000? That is where getting the data from a file comes in. 

Running a script using data from a file

Often you will collect or transfer larger amounts of data in a file. The file format of Excel is too complicated for most programs to read, but there is a much simpler format called CSV (for “comma separated values”) that can be read by many programs. Excel can save data sheets in CSV format. If a spreadsheet contains multiple sheets, each one must be saved as a separate CSV file. To save an Excel sheet in CSV format, go to Save As… and select “CSV (Comma delimited) (*.csv)” from the “Save as type:” dropdown. BIG NOTICE: there are several kinds of .CSV in the dropdown list…no different than there are several kinds of .doc word file types. Be sure when you save an Excel file of data that you want to run in R, that you save it as .CSV (and not the first thing the computer does for you when saving…look in the drop-down list when you hit ‘save as’). TRYING TO USE .xlsx ‘Excel’ files won’t work with R. ONLY files saved with ".csv"

Again, you have to save the Excel file as .csv (Comma delimited) NOT UTF-8!!!. The most common R-Studio failure is saving the file in the wrong format.

You can save the CSV file any place on your computer that you want (but learn to navigate your computer and keep things in organized folders). R has a function that initiates a “file open” dialog: 

file.choose()

 

When the file open dialog is executed, a popup window lets you navigate to and select the file that you want to open. Occasionally, the popup window is below the RStudio window. If you run the script and it seems to have gotten stuck, minimize the RStudio window and see if the file open dialog window was hiding underneath or look at the bottom task bar for a highlighted box.

              The .choose() function reads the CSV file into R, but it isn't in the correct form for R to use it. The function "read.csv" is used to convert the CSV-formatted data into an actual R data structure. Therefore the file open function and the CSV conversion function can be put together into a combined function that both opens the file and converts the CSV:

:

read.csv(file.choose())

To load data from a CSV file that uses the format shown in the height data table, I will go to the script that we've been working on and delete the "Input = …" command along with all of the typed-in data, and replace the "Data = read.table …" command with:

Data = read.csv(file.choose())

The new script looks like this:

# get the data into the software
Data = read.csv(file.choose())
 
# output some results
t.test(height ~ grouping, data=Data,
       var.equal=TRUE,
       conf.level=0.95)

So what that means in a short-hand version is the Data will get pulled from the .csv file that you select, the data in that file (read.csv) will get displayed in the R panel there as the table of data, then the t.test part to do a t-test of means on the data which is grouped and comparing height.

Again, don't copy the script from above, but get it from this raw text data file

Now all we have done is take out the data from the original script that you saw and replaced that with a command to ‘find the .csv file and read the .csv file’. And you simply select the file you want. AGAIN, an Excel file (.xlsx or .xls) will NOT work, only (.csv) will work. You should save the ‘raw text’ as a notepad (.txt) file and also save the r-script (.r) file. 

This is just a second way to run a test on R, through using a file of data.

Loading the data from a file on the Internet using a URL (optional reading)

The material in this box is informational and not required to complete the assignment.

The THIRD way to use data in R is by way of an URL of a webpage. This method will not be described, merely pointing out that it exists. You may learn on it on your own.             

              Normally, the file open dialog is going to be the easiest way to get data into an R script. However, sometimes your teacher, colleague, or a website (or paper or research group) might make a file available directly via a URL. In that case an alternative to getting the file from your computer's drive is to specify a URL that points to the file location at some place on the Internet. One important consideration is that the URL must deliver the raw data file and not a web page.

 

Below is the t-test.csv data file required to load the Example 1 data and run the t-test as described in the previous section above.

              Below is the Excel spreadsheet with the data from both examples. Notice that this file has two sheets labeled “example 1” and “example 2!” To run a t-test of means on the second example, you will have to change the format of the data in the second sheet into the form needed for doing the test in R. Then save the sheet as a CSV file. Some general advice about naming files:

  1. Don't use long file names.
  2. Don't use file names with spaces in them.
  3. Pay attention to whether you used upper- or lower-case letters in the file names.

And realize, you should get the same answer since it is an example you did this in Excel previously also!!!!

Data files

Below is the t-test.csv data file required to load the Example 1 data and run the t-test as described in the previous section.

Below is the Excel spreadsheet with the data from both examples.  Notice that this file has two sheets labeled "example 1" and "example 2"!  To run a t-test of means on the second example, you will have to change the format of the data in the second sheet into the form needed for doing the test in R.  Then save the sheet as a CSV file.  Some general advice about naming files:

  1. Don't use long file names.
  2. Don't use file names with spaces in them.
  3. Pay attention to whether you used upper or lower case letters in the file names.