Intro to R tutorial

Return to SCWG Home

Wifi SSID: workshop

Please have these both installed (come early if you need help):

R: Download from here
rstudio: Download from here

Also, please install here awesome R packages (again, if you don’t know how please come early):

‘rms’
‘ggplot2’
‘dpylr’

Please download the following files

RStudio environment

Welcome to R studio! You should see 4 spaces:

Source

Your script! everything you do to your data should be typed here!! It saves what you are doing in a file ending in ‘.R’

Console

The place where the good stuff happens. When you run a command, you see the output here.

Enviroment/History

Enviroment: a list of everything in your “Workspace”
History: a list of every command you have run

Files/Plots/Packages/Help

Files: a file browser (like windows Explorer)
Plots: your plots will show up here
Packages: a list of all the packages you have installed with checkboxes showing if it is loaded
Help: your help documentation with be shown here.

Loading libraries / installing packages

for this tutorial, we are going to use the packages ggplot, rms and car.

Look at the Packages Tab
Click on the check box to the left of “ggplot2”
Notice that a bunch of lines start running in Console window to show you that the package is loading
Go to the History Tab (top right corner)
Notice that you should see lines starting in library(“rms”..) and library(“ggplot2”…) have just run
Click on these lines to highlight them, then click on To Source at the top of the panel
This moves the lines we just ran into our script so we can rerun this step tomorrow!

library(rms)

## Loading required package: Hmisc

## Loading required package: grid

## Loading required package: lattice

## Loading required package: survival

## Loading required package: splines

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read in data

The two datasets provided are as follows:

messy_demographic.csv:

5 variables:
- age, diagnosis (dx), ethnicity, sex, and subject identifier (subject_ID)

messy_cognitive.csv:

4 variables: +three cognitive scores (cog1, cog2, cog3), and subject identifier (subID)

In order to view and manipulate this data in R, we need to import the data into our R workspace (the same as you would open a file in excel to edit it).

Rstudio trick:

Click on the Environment Tab, then click on Import Dataset –> From text File
Navigate the browser window to the location of messy_cognitive.csv and click Open
This opens a text reader window: You see the raw text on the top and what R will read in (the data frame) at the bottom
In my view, it looks like the R is not going to read in the first line as a header..to change this
switch the Heading option on the right to yes
Click Import
Now, if you look at the Environment tab you should see that data1 has been loaded into R, It has 350 rows (or observations) and 5 variables
So that you do not have to type this again tomorrow - go to History, click on the line “data1 <- read.csv(…)” and then click on To Source
Repeat this whole process for messy_demographic.csv

data1 <- read.csv("~/Downloads/Rtutorial_data1.csv")
data2 <- read.csv("~/Downloads/Rtutorial_data2.csv")

What you actually did was use the read.csv function… to find out more about this option you can type “?read.csv” in the Console
This is the basic syntax of R functions: some.function(“stuff inside to do the function on”)
The help document for the read.csv function is shown in the Help tab

Now we have two “data frames” loaded into our workspace. They are called data1 and data2.

Basic data summaries and visualization ( head, tail, describe() )

Now that we have the data loaded, how do we just look at it? The simplest way is with the “View” function within rstudio.
In Enviroment tab. Click on the little spreadsheet to teh far right of the data1.csv row… this shows you your data in what looks like a spreadsheet - but you cannot edit it!

To look at the top six rows of your data:

head(data1)

##   subject_ID age sex ethnicity dx
## 1      SUB_1  43   0      Cauc  0
## 2      SUB_2  47   1      Cauc  1
## 3      SUB_3  69   1      Cauc  1
## 4      SUB_4  51   0      Cauc  1
## 5      SUB_5  52   1      Cauc  0
## 6      SUB_6  71   0        AA  1

To look at the bottom six rows:

tail(data2)

##          subID             cog1             cog2             cog3
## 345 subject345 10.5491439098318 29.0366805293698 50.0341778674542
## 346 subject346 13.7560798734217  21.047620123942 11.2510017219872
## 347 subject347 16.4949897425522  33.323511618334 2.42614834379569
## 348 subject348 11.1612493587292 30.8723352333704 16.0049438698844
## 349 subject349 15.3654440645612 29.7598065423247 44.3994545119479
## 350 subject350  13.993297479479 28.3229119000634 11.2012255384154

Using the function names() tells us what all the variables in our dataframe are called. the ls() function does the same thing, except it returns the variables in alphabetical order

names(data1)

## [1] "subject_ID" "age"        "sex"        "ethnicity"  "dx"

ls(data1)

## [1] "age"        "dx"         "ethnicity"  "sex"        "subject_ID"

That was all nice, but we want to find out more about this data we can use “summary”

summary(data1)

##    subject_ID       age           sex            ethnicity         dx     
##  SUB_1  :  1   51     : 18   Min.   :   0.00   Cauc   :196          :  1  
##  SUB_10 :  1   49     : 16   1st Qu.:   0.00   AA     : 70   0      :166  
##  SUB_100:  1   44     : 15   Median :   1.00   As     : 42   1      :178  
##  SUB_101:  1   47     : 15   Mean   :  29.44   In     : 19   9999   :  3  
##  SUB_102:  1   41     : 13   3rd Qu.:   1.00   Other  : 14   missing:  2  
##  SUB_103:  1   46     : 13   Max.   :9999.00          :  3                
##  (Other):344   (Other):260   NA's   :3         (Other):  6

summary(data2)

##         subID                   cog1                   cog2    
##  subject1  :  1   missing         :  3   9999            :  3  
##  subject10 :  1                   :  2                   :  1  
##  subject100:  1   10.0013840844221:  1   20.2283036591592:  1  
##  subject101:  1   10.0085681428667:  1   21.047620123942 :  1  
##  subject102:  1   10.0384391843365:  1   21.4284685612335:  1  
##  subject103:  1   10.1405843821872:  1   21.4715446666536:  1  
##  (Other)   :344   (Other)         :341   (Other)         :342  
##                  cog3    
##                    :  4  
##  9999              :  4  
##  0.0579567477989184:  1  
##  0.0910132398598138:  1  
##  0.15520493462606  :  1  
##  0.227352897565002 :  1  
##  (Other)           :338

Another very useful function is describe() - (from the rms package)

describe(data1)

## data1 
## 
##  5  Variables      350  Observations
## ---------------------------------------------------------------------------
## subject_ID 
##       n missing  unique 
##     350       0     350 
## 
## lowest : SUB_1   SUB_10  SUB_100 SUB_101 SUB_102
## highest: SUB_95  SUB_96  SUB_97  SUB_98  SUB_99  
## ---------------------------------------------------------------------------
## age 
##       n missing  unique 
##     350       0      56 
## 
## lowest :         22      25      26      29     
## highest: 76      78      89      9999    missing 
## ---------------------------------------------------------------------------
## sex 
##       n missing  unique    Info    Mean 
##     347       3       3    0.71   29.44 
## 
## 0 (130, 37%), 1 (216, 62%), 9999 (1, 0%) 
## ---------------------------------------------------------------------------
## ethnicity 
##       n missing  unique 
##     350       0       8 
## 
##             9999 AA As Cauc In missing Other
## Frequency 3    3 70 42  196 19       3    14
## %         1    1 20 12   56  5       1     4
## ---------------------------------------------------------------------------
## dx 
##       n missing  unique 
##     350       0       5 
## 
##               0   1 9999 missing
## Frequency 1 166 178    3       2
## %         0  47  51    1       1
## ---------------------------------------------------------------------------

describe(data2)

## data2 
## 
##  4  Variables      350  Observations
## ---------------------------------------------------------------------------
## subID 
##       n missing  unique 
##     350       0     350 
## 
## lowest : subject1   subject10  subject100 subject101 subject102
## highest: subject95  subject96  subject97  subject98  subject99  
## ---------------------------------------------------------------------------
## cog1 
##       n missing  unique 
##     350       0     347 
## 
## lowest :                  10.0013840844221 10.0085681428667 10.0384391843365 10.1405843821872
## highest: 9.87965671775739 9.91112193296954 9.95977995174256 9.98090282668629 missing          
## ---------------------------------------------------------------------------
## cog2 
##       n missing  unique 
##     350       0     348 
## 
## lowest :                  20.2283036591592 21.047620123942  21.4284685612335 21.4715446666536
## highest: 41.6581530711547 41.8981032197182 44.7385384301294 9999             missing          
## ---------------------------------------------------------------------------
## cog3 
##       n missing  unique 
##     350       0     344 
## 
## lowest :                    0.0579567477989184 0.0910132398598138 0.15520493462606   0.227352897565002 
## highest: 9.73188733825491   9.78735321905286   9.94790280238559   9999               missing            
## ---------------------------------------------------------------------------

Data cleaning

Now that you have looked at your data - you might have noticed that there are a couple probems The RA that you have been working with have coded missing values in three different ways (“9999”, “missing”,and “NA”)
We first need to set these all to NA - which R recognizes as missing value:

The following will take all values in data1 that are equal to “”, “missing”, or “9999”, and code them as missing in a way that R understands:

data1[data1==""] <- NA
data1[data1=="missing"] <- NA
data1[data1=="9999"] <- NA

Because R is “smart”, it categorizes data types automatically when data are loaded. Before working with new data, especailly if it is real (i.e. messy), it is important to tell R what kind of data you are dealing with. This will be especially important when we discuss our statistical analyses… after all, R is statistical software.

The following will correctly format our variables for analyses:

age is a numeric variable
ethicity is a discrete factor
sex is a discrete factor
diagnosis is a discrete factor

data1$age <- as.numeric(as.character(data1$age))
data1$ethnicity <- factor(data1$ethnicity,levels=c("Cauc","AA","As","In","Other"))
data1$sex <- factor(data1$sex, levels=c(0,1), labels=c("Male","Female"))
data1$dx <- factor(data1$dx, levels=c(0,1), labels=c("Control","Case"))

By indicating the levels of our factors, we have erased from R the memory that we once had values of “”, “9999”, and “missing” (which up until now R had no reason to assume were not observations).

Let us now apply the same cleanup steps to our second data frame:

Remove missing:

data2[data2==""] <- NA
data2[data2=="missing"] <- NA
data2[data2=="9999"] <- NA

Correctly format variables for analyses:

data2$cog1 <- as.numeric(as.character(data2$cog1))
data2$cog2 <- as.numeric(as.character(data2$cog2))
data2$cog3 <- as.numeric(as.character(data2$cog3))

Merging data frames

In order to analyze the effect of sex on diagnosis, or perform any other comparison across our data frames, we should merge them. If you remember only this and nothing else today, it will still have been worth your time.

Conceptually, merging two data frames assumes that the rows in one correspond to rows in the other, even if they are not in the same order. In order to match up the correct rows between data frames we need to make sure that one column in each spreadsheet can act as a “key” (i.e. each row has a unique value in this key that is the same in both spreadsheets). In our case, we have one subject identifier column in each of our spreadsheets.

First we need to make sure that the values in these columns are the same:

data2$subID <- gsub(data2$subID,pattern="subject",replacement="SUB_")

We can then merge the two datasets by specifying their names (in order x,y) and then specifying which columns are to be used as the key to merging the two data frames (by.x and by.y):

alldata <- merge(data1,data2,by.x="subject_ID",by.y="subID")

Skipping ahead a little - now we can look at histograms of our numeric variables, just to see what we are dealing with:

hist(data2$cog1)

hist(data2$cog2)

hist(data2$cog3)

hist(data1$age)

Now that our data are loaded, cleaned, and merged, it is time to do some basic statistics!

STUDY QUESTION 1: What is the relationship between sex and diagnosis?

For this question, our null hypothesis is that there is no difference in the number of males and females between our case and control diagnosis groups

The ftable() function will give us a 2 x 2 contingency table of the frequency of observations in each category. the formula syntax “y ~ x” is common in R!

ftable(data=alldata,dx~sex)

##        dx Control Case
## sex                   
## Male           37   90
## Female        127   86

We now want to save that table as an object called “dxXsex_table”:

dxXsex_table <- ftable(data=alldata,dx~sex)

Now, in order to test our null hypothesis using a chi-squared test, we simply apply the chisq.test() function to that table:

chisq.test(dxXsex_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dxXsex_table
## X-squared = 28.4149, df = 1, p-value = 9.791e-08

Similarly, we can use the nonparametric Fisher test to get a more exact test statistic:

fisher.test(dxXsex_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  dxXsex_table
## p-value = 5.68e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1686647 0.4567064
## sample estimates:
## odds ratio 
##   0.279488

A bit more advanced! This will accoplish the same thing as ftable(), except that here we are indexing our alldata dataframe with the R syntax [,]. the blank value for tells R that we want all rows. The c(“dx”,“sex”) value for means we want to use the columns named “dx” and “sex”. the table() function knows to arrange these as a 2 x 2 contingency table.

table(alldata[ ,c("dx","sex")])

##          sex
## dx        Male Female
##   Control   37    127
##   Case      90     86

STUDY QUESTION 2: What is the relationship between diagnosis and cog1?

for this question, our null hypothesis is that there is no difference in cog1 between our case and control diagnosis groups

t.test(cog1 ~ dx, data=alldata)

## 
##  Welch Two Sample t-test
## 
## data:  cog1 by dx
## t = -6.347, df = 334.916, p-value = 7.133e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.871388 -1.512676
## sample estimates:
## mean in group Control    mean in group Case 
##              9.940047             12.132079

ggplot(alldata, aes(x=dx, y=cog1)) + geom_boxplot()

## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

P.S.* Here is an R script with all of the steps we went over today!! Download Intro R script