Introduction to R: A Computational Workbench for Biological Data Analysis

Brian Capaldo

6/10/2017

Acknowledgements

attributes(Brian)

Why R?

History of R

R as a computational bench top

Data types in R

The base types include double, integer, characater and logical. Values can be tested as belonging to one of those types.

typeof(1)
## [1] "double"

typeof(1L)
## [1] "integer"

is.numeric(1.1)
## [1] TRUE

is.logical(TRUE)
## [1] TRUE
is.character("Hello")
## [1] TRUE

Vectors

R provides several kinds of compound data structures commonly referred to as vectors. All operations in R are vectorized, this means that they take vectors as arguments and they will operate on the individual values (basic types) of these vectors.

The operator [<- is the assignment function.

x <- c(1,2,2)   # 'c()' is the concatenate function
y <- c(2,2,1)   
( x + y ) * x
## [1] 3 8 6

Atomic vectors

Atomic vectors are homogeneous (possibly multi-dimensional) data structures.

c(1,2,3)                # again using 'c()' to concatenate
## [1] 1 2 3
c("one","two","three")
## [1] "one"   "two"   "three"

Subsetting

All vectors can be accessed used subsetting operators [ and [[

Element of vectors can be accessed individually.

x <- c(1L, 2L, 3L) # The `L` lets R know you want an interger
x[1] <- 45
x[1]
## [1] 45

A two dimensional atomic vector is called a matrix and higher-dimensional vector is called an array.

a <- matrix(1:6, ncol = 3, nrow = 2)
b <- array(1:12, c(2,3,4))

a

a <- matrix(1:6, ncol = 3, nrow = 2)

a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

b

b <- array(1:12, c(2,3,4))

b
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Dynamic typing and coercions

Matrices and array, like vectors, are homogeneous, but it is possible to assign values of different basic types into any one of them. This causes a coercion of the entire data structure.

x
## [1] 45  2  3
x[[1]] <-1.1
x
## [1] 1.1 2.0 3.0

a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
a[[2,1]] <- "one"
a
##      [,1]  [,2] [,3]
## [1,] "1"   "3"  "5" 
## [2,] "one" "4"  "6"

R has a powerful set of subsetting operations that apply to all vectors uniformly. R allows, to subset ranges, rows, columns of vectors.

a[1,]       # subset the first row
## [1] "1" "3" "5"
a[1:2,]     # subset a range consisting of the first and second row
##      [,1]  [,2] [,3]
## [1,] "1"   "3"  "5" 
## [2,] "one" "4"  "6"
a[1:2,2:3]  # subset a range consisting of the first and second row and the second and third column
##      [,1] [,2]
## [1,] "3"  "5" 
## [2,] "4"  "6"

Coercions

aa <- as.integer(a) # as.___ functions coerce one type into another
## Warning: NAs introduced by coercion
a
##      [,1]  [,2] [,3]
## [1,] "1"   "3"  "5" 
## [2,] "one" "4"  "6"
aa
## [1]  1 NA  3  4  5  6

aa
## [1]  1 NA  3  4  5  6
dim(aa) <- c(2,3) # dim() refers to the dimension of the data structure
aa
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]   NA    4    6
a[1, a[1,] > 1] # here we are subsetting the first row of a and returning which positions are > 1
## [1] "3" "5"
aa[1, aa[1,] > 1]
## [1] 3 5

When using a predicate such as a[1,] > 1 for subsetting, behind the scenes R generates a vector of logical values. True indicates a position that should be extracted. So this is:

aa[1,]>1
## [1] FALSE  TRUE  TRUE

We can use logical arrays to subset as follows:

aa[1, c(FALSE, TRUE, TRUE)]
## [1] 3 5

Subsetting can be used in assignment operations as well.

a
##      [,1]  [,2] [,3]
## [1,] "1"   "3"  "5" 
## [2,] "one" "4"  "6"
a[1:3] <- a[4:6]
a
##      [,1] [,2] [,3]
## [1,] "4"  "6"  "5" 
## [2,] "5"  "4"  "6"

Lists

Lists are heterogeneous vectors. The elements of list can be of any kind, including lists and vectors.

list(1, "hi", c(1,2))
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hi"
## 
## [[3]]
## [1] 1 2

The typeof() a list is list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() will coerce.

Referential transparency

To facilitate equational reasoning, R attempts to provide referential transparency for function calls. Referential transparency means that arguments are not changed by the function being called. So the following function f(x) does not modify the vector passed in.

x
## [1] 1.1 2.0 3.0
f <- function(x) { x[1] <- 1 }
f(x)
x
## [1] 1.1 2.0 3.0

Questions on the basics of R?

Packages

Installing packages from CRAN

install.packages("tidyverse", verbose = T, repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/d_/gyqsqcq13m7_kcpx9hxdxx540000gn/T//Rtmp8pHiP8/downloaded_packages

Installing packages from Bioconductor

source("https://bioconductor.org/biocLite.R")
## Bioconductor version 3.5 (BiocInstaller 1.26.0), ?biocLite for help
biocLite("flowCore", verbose = T, suppressUpdates = T)
## BioC_mirror: https://bioconductor.org
## Using Bioconductor 3.5 (BiocInstaller 1.26.0), R 3.4.0 (2017-04-21).
## Installing package(s) 'flowCore'
## 
## The downloaded binary packages are in
##  /var/folders/d_/gyqsqcq13m7_kcpx9hxdxx540000gn/T//Rtmp8pHiP8/downloaded_packages

Loading libraries

library(flowCore)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, flowCore, stats
## lag():    dplyr, stats

Read Files

file.name <- system.file("extdata","0877408774.B08", package="flowCore")
x <- read.FCS(file.name, transformation=FALSE)
summary(x)
##             FSC-H     SSC-H    FL1-H     FL2-H    FL3-H     FL1-A
## Min.      85.0000   11.0000   0.0000    0.0000   0.0000    0.0000
## 1st Qu.  385.0000  141.0000 233.0000  277.0000  90.0000    0.0000
## Median   441.0000  189.0000 545.5000  346.0000 193.0000   26.0000
## Mean     491.9644  277.9105 439.1023  366.1567 179.7122   34.0766
## 3rd Qu.  518.0000  270.0000 610.0000  437.0000 264.0000   51.0000
## Max.    1023.0000 1023.0000 912.0000 1023.0000 900.0000 1023.0000
##             FL4-H   Time
## Min.       0.0000   1.00
## 1st Qu.  210.0000 122.00
## Median   279.0000 288.00
## Mean     323.5306 294.77
## 3rd Qu.  390.0000 457.50
## Max.    1022.0000 626.00

Search Keywords

keyword(x,c("$P1E", "$P2E", "$P3E", "$P4E"))
## $`$P1E`
## [1] "0,0"
## 
## $`$P2E`
## [1] "0,0"
## 
## $`$P3E`
## [1] "4,0"
## 
## $`$P4E`
## [1] "4,0"

One FCS file

x # Operator did not annotate the channels!
## flowFrame object '0877408774.B08'
## with 10000 cells and 8 observables:
##      name              desc range minRange maxRange
## $P1 FSC-H             FSC-H  1024        0     1023
## $P2 SSC-H             SSC-H  1024        0     1023
## $P3 FL1-H              <NA>  1024        0     1023
## $P4 FL2-H              <NA>  1024        0     1023
## $P5 FL3-H              <NA>  1024        0     1023
## $P6 FL1-A              <NA>  1024        0     1023
## $P7 FL4-H              <NA>  1024        0     1023
## $P8  Time Time (51.20 sec.)  1024        0     1023
## 147 keywords are stored in the 'description' slot

Biaxial Plot

library(ggcyto)
## Loading required package: ncdfFlow
## Loading required package: RcppArmadillo
## Loading required package: BH
## Loading required package: flowWorkspace
autoplot(x, "FL1-H", "FL2-H")
## Warning: Removed 32 rows containing missing values (geom_hex).

Histogram Plot

autoplot(x, "FL1-H")

Handling multiple FCS files

frames <- lapply(                                      # member of apply family
                 dir(system.file("extdata",            # object to which function will be "applied"
                                 "compdata",           # system.file() is pointing to where the data is
                                 "data",    
                                  package="flowCore"), # which package is the data stored in
                     full.names=TRUE),                 # returns the full file names
                 read.FCS)                             # function being applied
as(frames, "flowSet")                                  # casting all the frames as a flowSet
## A flowSet with 5 experiments.
## 
##   column names:
##   FSC-H SSC-H FL1-H FL2-H FL3-H FL1-A FL4-H

Storing a Flow Set

names(frames) <- sapply(frames, keyword, "SAMPLE ID") # naming the frames in the flowSet
fs <- as(frames, "flowSet")                           # storing the flowSet as "fs"
fs                                                    # TIMTOWTDI
## A flowSet with 5 experiments.
## 
##   column names:
##   FSC-H SSC-H FL1-H FL2-H FL3-H FL1-A FL4-H

Annotating a flowSet

phenoData(fs)$Filename <- fsApply(fs,        # another apply function
                                  keyword,   # applies keyword() on each frame in fs
                                  "$FIL")    # returns the $FIL (file name)
pData(phenoData(fs))                         # fs is now annotated with the file names
##      name   Filename
## NA     NA 060909.001
## fitc fitc 060909.002
## pe     pe 060909.003
## apc   apc 060909.004
## 7AAD 7AAD 060909.005

Much more efficient method

fs <- read.flowSet(path=system.file("extdata",
                                    "compdata",
                                    "data",
                                    package="flowCore"), # pointing to the data
                   name.keyword="SAMPLE ID",             
                   phenoData=list(name="SAMPLE ID", Filename="$FIL")) # annotating the data
fs
## A flowSet with 5 experiments.
## 
## An object of class 'AnnotatedDataFrame'
##   rowNames: NA fitc ... 7AAD (5 total)
##   varLabels: name Filename
##   varMetadata: labelDescription
## 
##   column names:
##   FSC-H SSC-H FL1-H FL2-H FL3-H FL1-A FL4-H
pData(phenoData(fs))
##      name   Filename
## NA     NA 060909.001
## fitc fitc 060909.002
## pe     pe 060909.003
## apc   apc 060909.004
## 7AAD 7AAD 060909.005

Vignettes as Protocols

ggcyto

cytofkit

Tour of RStudio

RStudio

Resources

Next steps while here at CYTO 2017