# BSCI 1511L Statistics Manual: 0.2.1 Running a t-test of means using RStudio

Introduction to Biological Sciences lab, second semester

## Running RStudio

This exercise assumes that you already have both R and RStudio installed on your computer, or that you are using the lab computers where it is already installed.  If you need to install R and RStudio, see the Installing R and RStudio page.

To start the exercise, double-click on the RStudio icon.  It is possible to do the exercise using R rather than RStudio, but RStudio has additional capabilities, so using it is recommended.

## The Source Editor pane

Run RStudio, then from the File menu, select "New File -> R Script".  This will open a new Source Editor pane.  The Source Editor is where you can work on putting together scripts that will run a series of R commands in the console.  You could just as easily run those commands manually by yourself by typing them in the Console pane.  But by creating the script in the Source Editor, you can easily change things and re-run the script.  You can also save your script to use as a starting point in the future.

## Data format for t-test of means

We will be doing Example 1 from the review page, so you should refer to that before starting this exercise.

The format for doing a t-test of means in R requires placing all of the data in one column with a second column containing a "grouping variable".  The grouping variable is just a string (series of characters) that identifies the group to which a particular data value belongs.  The height data table with grouping variable should look like this:

Notice how this format is different from the format we would typically use for Excel:

Note: when you create a grouping variable, you should NOT use one that is composed only of numeric characters.  This can cause unpredictable behavior in R.  For example: "1, 2, 3" is NOT a good series of grouping variables, while "block1, block2, block3" is fine.

It is also best when assigning grouping variables to copy the first instance and paste it into the other cells.  This prevents the accidental assignment of grouping variables that look the same, but that are actually different character strings.  For example, "cat" and "cat " ("cat" followed by a space) are not the same even though they look the same.  "group1" and "groupl" are also not the same (the former ends with a one and the latter ends with a lower case "L").

## Running a script

There are basically three main steps to running an analysis in RStudio:

1. Get data into the software.
2. Do some kind of manipulations to the data (optional)
3. Output some results.

The following script does those three things to conduct the t-test of means.  DON'T try to paste this script into RStudio!  Read on for the reason why!

# get the data into the software
Input =(
"grouping height
men 181.5
men 187.3
men 175.3
men 178.3
men 169
men 183.2
men 184.5
women 175.4
women 172.1
women 181.1
women 165.2
women 166.3
women 167
women 170.3
")

# do some kind of manipulations to the data

# output some results
t.test(height ~ grouping, data=Data,
var.equal=TRUE,
conf.level=0.95)

Here are several things you should notice about the script:

1. It has comments on lines starting with "#".  They are ignored when the script is run and just make it easier for people to understand the script when they look at it.
2. The data have been entered directly into the script by typing them between the quotation marks.  This is fine when you don't have many data, but would be really annoying if you had a lot of numbers to type.  The columns of data are separated by whitespace (space characters or tabs) and the rows of data are separated by linefeeds (i.e. pressing the Enter key).
3. The typed-in text gets turned into an actual R data table using the read.table function.
4. The t.test function does the actual test.  The arguments of the function specify what data in the table should be used and the exact kind of t-test to do.  You can look at reference materials or examples to hack functions to make them do what you want.

Unfortunately, copying this script from here includes invisible bad characters.  Instead, go to this GitHub Gist raw file page, copy, and paste the test into the Source Editor pane.  To run the script, highlight it in the Source Editor, then click on the Run button.  You will see each step of the script in the Console pane in blue, followed by the output of that step (if any) in black.  Compare the values of t, df, and P that are given to the values in the example.

You should be aware that you can save R scripts as text files from the Source Editor pane.  The default file extension for R scripts is ".R" and on a Windows computer, if that file extension is associated with RStudio, you can double-click on a file such as "test.R" and it will automatically be loaded into the Source Editor pane.  You may wish to collect useful R scripts for future use somewhere where you can find them.

## Running a script using data from a file

Often you will collect or transfer larger amounts of data in a file.  The file format of Excel is too complicated for most programs to read, but there is a much simpler format called CSV (for "comma separated values") that can be read by many programs.  Excel can save data sheets in CSV format.  If a spreadsheet contains multiple sheets, each one must be saved as separate CSV files.  To save an Excel sheet in CSV format, go to Save As… and select "CSV (Comma delimited) (*.csv)" from the "Save as type:" dropdown.

You can save the CSV file any place on your computer that you want.  R has a function that initiates a "file open" dialog:

file.choose()

When the file open dialog is executed, a popup window lets you navigate to and select the file that you want to open.  Occasionally, the popup window is below the RStudio window.  So if you run the script and it seems to have gotten stuck, minimize the RStudio window and see if the file open dialog window was hiding underneath.

The .choose() function reads the CSV file into R, but it isn't in the correct form for R to use it.  The function read.csv is used to convert the CSV-formatted data into an actual R data structure.  So the file open function and the CSV conversion function can be put together into a combined function that both opens the file and converts the CSV:

So to load data from a CSV file that uses the format shown in the height data table, I will go to the script that we've been working on and delete the "Input = …" command along with all of the typed-in data, and replace the "Data = read.table …" command with:

The new script looks like this:

# get the data into the software

# output some results
t.test(height ~ grouping, data=Data,
var.equal=TRUE,
conf.level=0.95)

Again, don't copy the script from here, but get it from this raw text data file.

The material in this box is informational and not required to complete the assignment.

Normally, the file open dialog is going to be the easiest way to get data into an R script.  However, sometimes your teacher, colleague, or a website might make a file available directly via a URL.  So an alternative to getting the file from your computer's drive is to specify a URL that points to the file location at some place on the Internet.  One important consideration is that the URL must deliver the raw data file and not a web page.  You can see the distinction between the two by comparing:

with

In the first case, the URL leads to a web page that displays the content of the CSV file formatted as an HTML table.  In the second case, the browser displays the actual characters that comprise the CSV file.  The second URL could be used to load the file as part of an R script, but the first URL would display an error.

Here is the command that would read data from this file into an R table:

If you have a GitHub account, creating a Gist is an easy way to make raw data available through a URL.  Create the gist in the editing environment, then after creating a public Gist click on the Raw button at the upper right of the screen.  Copy the URL from the browser's address box and paste it into the script between the quotes after the file= argument as shown in the example above.

The following script should run the t-test example using the link to the file in the next section below.  You should be able to substitute the other URLs given above and get the same result as long as the file still exists on the servers.

# output some results
t.test(height ~ grouping, data=Data,
var.equal=TRUE,
conf.level=0.95)

Click here to get the raw script suitable for copy and paste

## Data files

Below is the t-test.csv data file required to load the Example 1 data and run the t-test as described in the previous section.

Below is the Excel spreadsheet with the data from both examples.  Notice that this file has two sheets labeled "example 1" and "example 2"!  To run a t-test of means on the second example, you will have to change the format of the data in the second sheet into the form needed for doing the test in R.  Then save the sheet as a CSV file.  Some general advice about naming files:

1. Don't use long file names.
2. Don't use file names with spaces in them.
3. Pay attention to whether you used upper or lower case letters in the file names.