X Tutup
$\pagebreak$ ## Overview and History of R * **R** = dialect of the **S** language * S was developed by John Chambers @ Bell Labs * initiated in 1976 as internal tool, originally FORTRAN libraries * 1988 rewritten in C (version 3 of language) * 1998 version 4 (what we use today) * **History of S** * Bell labs $\rightarrow$ insightful $\rightarrow$ Lucent $\rightarrow$ Alcatel-Lucent * in 1998, S won the Association for computing machinery’s software system award * **History of R** * 1991 created in New Zealand by Ross Ihaka & RobertGentleman * 1993 first announcement of R to public * 1995 Martin Machler convinces founders to use GNU General Public license to make R free * 1996 public mailing list created R-help and R-devel * 1997 R Core Group formed * 2000 R v1.0.0 released * **R Features** * Syntax similar to S, semantics similar to S, runs on any platforms, frequent releasees * lean software, functionalities in modular packages, sophisticated graphics capabilities * useful for interactive work, powerful programming language * active user community and ***FREE*** (4 freedoms) * freedom to run the program * freedom to study how the program works and adapt it * freedom to redistribute copies * freedom to improve the program * **R Drawbacks** * 40 year-old technology * little built-in support for dynamic/3D graphics * functionality based on consumer demand * objects generally stored in physical memory (limited by hardware) * **Design of the R system** * 2 conceptual parts: base R from CRAN vs. everything else * functionality divided into different packages * **base R** contains core functionality and fundamental functions * other utility packages included in the base install: `util`, `stats`, `datasets`, ... * Recommended packages: `bootclass`, `KernSmooth`, etc * 5000+ packages available $\pagebreak$ ## Coding Standards * Always use text files/editor * Indent code (4 space minimum) * limit the width of code (80 columns) * limit the length of individual functions ## Workspace and Files * `getwd()` = return current working directory * `setwd()` = set current working directory * `?function` = brings up help for that function * `dir.create("path/foldername", recursive = TRUE)` = create directories/subdirectories * `unlink(directory, recursive = TRUE)` = delete directory and subdirectories * `ls()` = list all objects in the local workspace * `list.files(recursive = TRUE)` = list all, including subdirectories * `args(function)` = returns arguments for the function * `file.create("name")` = create file * `.exists("name")` = return true/false exists in working directory * `.info("name")` = return file info * `.info("name")$property` = returns value for the specific attribute * `.rename("name1", "name2")` = rename file * `.copy("name1", "name2")` = copy file * `.path("name1")` = return path of file ## R Console and Evaluation * `<-` = assignment operator * `#` = comment * expression is evaluated after hitting `enter` and result is returned * autoprinting occurs when you call a variable * `print(x)` = explicitly printing * `[1]` at the beginning of the output = which element of the vector is being shown $\pagebreak$ ## R Objects and Data Structures * 5 basic/**atomic classes** of objects: 1. character 2. numeric 3. integer 4. complex 5. logical * **Numbers** * numbers generally treated as `numeric` objects (double precision real numbers - decimals) * Integer objects can be created by adding `L` to the end of a number(ex. `1L`) * `Inf` = infinity, can be used in calculations * `NaN` = not a number/undefined * `sqrt(value)` = square root of value * **Variables** * `variable <- value` = assignment of a value to a variable name ### Vectors and Lists * **atomic vector** = contains one data type, most basic object * `vector <- c(value1, value2, ...)` = creates a vector with specified values * `vector1*vector2` = element by element multiplication (rather than matrix multiplication) * if the vectors are of different lengths, shorter vector will be recycled until the longer runs out * computation on vectors/between vectors (`+`, `-`, `==`, `/`, etc.) are done element by element by default * `%*%` = force matrix multiplication between vectors/matrices * `vector("class", n)` = creates empty vector of length n and specified class - `vector("numeric", 3)` = creates 0 0 0 * `c()` = concatenate * `T, F` = shorthand for `TRUE` and `FALSE` * `1+0i` = complex numbers * **explicit coercion** * `as.numeric(x)`, `as.logical(x)`, `as.character(x)`, `as.complex(x)` = convert object from one class to another * nonsensible coercion will result in NA (ex. `as.numeric(c("a", "b")`) * `as.list(data.frame)` = converts a `data.frame` object into a `list` object * `as.character(list)` = converts list into a character vector * **implicit coercion** * matrix/vector can only contain one data type, so when attempting to create matrix/vector with different classes, forced coercion occurs to make every element to same class * *least common denominator* is the approach used (basically everything is converted to a class that all values can take, numbers $\rightarrow$ characters) and *no errors generated* * coercion occurs to make every element to same class (implicit) - `x <- c(NA, 2, "D")` will create a vector of character class * `list()` = special vector wit different classes of elements - `list` = vector of objects of different classes * elements of list use `[[]]`, elements of other vectors use `[]` * **logical vectors** = contain values `TRUE`, `FALSE`, and `NA`, values are generated as result of logical conditions comparing two objects/values * `paste(characterVector, collapse = " ")` = join together elements of the vector and separating with the `collapse` parameter * `paste(vec1, vec2, sep = " ")` = join together different vectors and separating with the `sep` parameter * ***Note**: vector recycling applies here too * * `LETTERS`, `letters`= predefined vectors for all 26 upper and lower letters * `unique(values)` = returns vector with all duplicates removed ### Matrices and Data Frames * `matrix` can contain **only 1** type of data * `data.frame` can contain **multiple** * `matrix(values, nrow = n, ncol = m)` = creates a n by m matrix * constructed **COLUMN WISE** $\rightarrow$ the elements are placed into the matrix from top to bottom for each column, and by column from left to right * matrices can also be created by adding the dimension attribute to vector * `dim(m) <- c(2, 5)` * matrices can also be created by binding columns and rows * `rbind(x, y)`, `cbind(x, y)` = combine rows/columns; can be used on vectors or matrices * `*` and `/` = element by element computation between two matrices * `%*%` = matrix multiplication * `dim(obj)` = dimensions of an object (returns `NULL` if a vector) * `dim(obj) <- c(4, 5)` = assign `dim` attribute to an object * if object is a vector, R converts the vector to a n by m matrix (i.e. 4 rows by 5 column from the example command) * ***Note**: if n by m is larger than length of vector, then an error is returned* * ***example*** ```{r} # initiate a vector x <-c(NA, 1, "cx", NA, 2, "dsa") class(x) x # convert to matrix dim(x) <- c(3, 2) class(x) x ``` * `data.frame(var = 1:4, var2 = c(….))` = creates a data frame * `nrow()`, `ncol()` = returns row and column numbers * `data.frame(vector, matrix)` = takes any number of arguments and returns a single object of class "data.frame" composed of original objects * `as.data.frame(obj)` = converts object to data frame * data frames store tabular data * special type of list where every list has the same length (can be of different type) * data frames are usually created through `read.table()` and `read.csv()` * `data.matrix()` = converts a data frame to matrix. * `colMeans(matrix)` or `rowMeans(matrix)` = returns means of the columns/rows of a matrix/dataframe in a vector * `as.numeric(rownames(df))` = returns row indices for rows of a data frame with unnamed rows * **attributes** * objects can have attributes: `names`, `dimnames`, `row.names`, `dim` (matrices, arrays), `class`, `length`, or any user-defined ones * `attributes(obj)`, `class(obj)` = return attributes/class for an R object * `attr(object, "attribute") <- "value"` = creates/assigns a value to a new/existing attribute for the object * `names` attribute * all objects can have names * `names(x)` = returns names (`NULL` if no name exists) * `names(x) <- c("a", …)` = can be used to assign names to vectors * `list(a = 1, b = 2, …)` = `a`, `b` are names * `dimnames(matrix) <- list(c("a", "b"), c("c" , "d"))` = assign names to matrices * use list of two vectors: row, column in that order * `colnames(data.frame)` = return column names (can be used to set column names as well, similar to `dim()`) * `row.names` = names of rows in the data frame (attribute) ### Arrays * multi-dimensional collection of data with $k$ dimensions * matrix = 2 dimensional array * `array(data, dim, dimnames)` - `data` = data to be stored in array - `dim` = dimensions of the array + `dim = c(2, 2, 5)` = 3 dimensional array $\rightarrow$ creates 5 2x2 array - `dimnames` = add names to the dimensions + input must be a `list` + every element of the `list` must correspond in length to the dimensions of the array + `dimnames(x) <- list(c("a", "b"), c("c", "d"), c("e", "f", "g", "h", "i"))` = set the names for row, column, and third dimension respectively (2 x 2 x 5 in this case) * `dim()` function can be used to create arrays from vectors or matrices - `x <- rnorm(20); dim(x) <- c(2, 2, 5)` = converts a 20 element vector to a 2x2x5 array ### Factors * factors are used to represent *categorical data* (integer vector where each value has a label) * 2 types: **unordered** vs **ordered** * treated specially by `lm()`, `glm()` * Factors easier to understand because they self describe (vs. 1 and 2) * `factor(c("a", "b"), levels = c("1", "2"))` = creates factor * `levels()` argument can be used to specify baseline levels vs other levels * ***Note**:without explicit specification, R uses alphabetical order* * `table(factorVar)` = how many of each are in the factor $\pagebreak$ ## Missing Values * `NaN` or `NA` = missing values * `NaN` = undefined mathematical operations * `NA` = any value not available or missing in the statistical sense - any operations with `NA` results in `NA` - NA can have different classes potentially (integer, character, etc) * ***Note**: NaN is an NA value, but NA is not NaN* * `is.na()`, `is.nan()` = use to test if each element of the vector is `NA` and `NaN` * ***Note**: cannot compare `NA` (with `==`) as it is not a value but a **placeholder** for a quantity that is not available* * `sum(my_na)` = sum of a logical vector (`TRUE` = 1 and `FALSE` = 0) is effectively the number of `TRUE`s * **Removing `NA` Values** * `is.na()` = creates logical vector where T is where value exists, F is `NA` * subsetting with the above result can return only the non NA elements * `complete.cases(obj1, obj2)` = creates logical vector where `TRUE` is where both values exist, and `FALSE` is where any is `NA` * can be used on data frames as well * `complete.cases(data.frame)` = creates logical vectors indicating which observation/row is good * `data.frame[logicalVector, ]` = returns all observations with complete data * **Imputing Missing Values** = replacing missing values with estimates (can be averages from all other data with the similar conditions) $\pagebreak$ ## Sequence of Numbers * `1:20` = creates a sequence of numbers from first number to second number * works in descending order as well * increment = 1 * `?':'` = enclose help for operators * `seq(1, 20, by=0.5)` = sequence 1 to 20 by increment of .5 * `length=30` argument can be used to specify number of values generated * `length(variable)` = length of vector/sequence * `seq_along(vector)` or `seq(along.with = vector)` = create vector that is same length as another vector * `rep(0, times = 40)` = creates a vector with 40 zeroes * `rep(c(1, 2), times = 10)` = repeats combination of numbers 10 times * `rep(c(1, 2), each = 10)` = repeats first value 10 times followed by second value 10 times ## Subsetting * R uses **one based index** $\rightarrow$ starts counting at $1$ * `x[0]` returns `numeric(0)`, not error * `x[3000]` returns `NA` (not out of bounds/error) * `[]` = always returns object of same class, can select more than one element of an object (ex. `[1:2]`) * `[[]]` = can extract one element from list or data frame, returned object not necessarily list/dataframe * `$` = can extract elements from list/dataframe that have names associated with it, not necessarily same class ### Vectors * `x[1:10]` = first 10 elements of vector x * `x[is.na(x)]` = returns all NA elements * `x[!is.na(x)]` = returns all non NA elements * `x > 0` = would return logical vector comparing all elements to 0 (`TRUE`/`FALSE` for all values except for NA and `NA` for NA elements (NA a placeholder) * `x[x>"a"]` = selects all elements bigger than a (lexicographical order in place) * `x[logicalIndex]` = select all elements where logical index = TRUE * `x[-c(2, 10)]` = returns everything **but** the second and tenth element * `vect <- c(a = 1, b = 2, c = 3)` = names values of a vector with corresponding names * `names(vect)` = returns element names for object * `names(vet) <- c("a", "b", "c")` = assign/change names of vector * `identical(obj1, obj2)` = returns TRUE if two objects are exactly equal * `all.equal(obj1, obj2)` = returns TRUE if two objects are near equal ### Lists * `x <- list(foo = 1:4, bar = 0.6)` * `x[1]` or `x["foo"]` = returns the list object `foo` * `x[[2]]` or `x[["bar"]]` or `x$bar` = returns the content of the second element from the list (in this case vector without name attribute) * ***Note**: `$` can’t extract multiple elements * * `x[c(1, 3)]` = extract multiple elements of list * `x[[name]]` = extract using variable, where as `$` must match name of element * `x[[c(1, 3)]]` or `x[[1]][[3]]` = extracted nested elements of list third element of the first object extracted from the list ### Matrices * `x[1, 2]` = extract the (row, column) element * `x[,2]` or `x[1,]` = extract the entire column/row * `x[ , 11:17]` = subset the x `data.frame` with all rows, but only 11 to 17 columns * when an element from the matrix is retrieved, a vector is returned * behavior can be turned off (force return a matrix) by adding `drop = FALSE` * `x[1, 2, drop = F]` ### Partial Matching * works with `[[]]` and `$` * `$` automatically partial matches the name (`x$a`) * `[[]]` can partial match by adding exact = `FALSE` * `x[["a", exact = false]]` $\pagebreak$ ## Logic * `<`, `>=` = less than, greater or equal to * `==` = exact equality * `!=` = inequality * `A | B` = union * `A & B` = intersection * `!` = negation * `&` or `|` evaluates every instance/element in vector * `&&` or `||` evaluate only first element * ***Note**: All AND operators are evaluated before OR operators * * `isTRUE(condition)` = returns `TRUE` or `FALSE` of the condition * `xor(arg1, arg2)` = exclusive OR, one argument must equal `TRUE` one must equal `FALSE` * `which(condition)` = find the indicies of elements that satisfy the condition (TRUE) * `any(condition)` = `TRUE` if one or more of the elements in logical vector is `TRUE` * `all(condition)` = `TRUE` if all of the elements in logical vector is `TRUE` ## Understanding Data * use `class()`, `dim()`, `nrow()`, `ncol()`, `names()` to understand dataset * `object.size(data.frame)` = returns how much space the dataset is occupying in memory * `head(data.frame, 10)`, `tail(data.frame, 10)` = returns first/last 10 rows of data; default = 6 * `summary()` = provides different output for each variable, depending on class, * for numerical variables, displays min max, mean median, etx. * for categorical (factor) variables, displays number of times each value occurs * `table(data.frame$variable)` = table of all values of the variable, and how many observations there are for each * ***Note**: mean for variables that only have values 1 and 0 = proportion of success* * `str(data.frame)` = structure of data, provides data class, num of observations vs variables, and name of class of each variable and preview of its contents * compactly display the internal structure of an R object * "What’s in this object" * well-suited to compactly display the contents of lists + `view(data.frame)` = opens and view the content of the data frame $\pagebreak$ ## Split-Apply-Combine Funtions * loop functions = convenient ways of implementing the Split-Apply-Combine strategy for data analysis ### `split()` * takes a vector/objects and splits it into group b a factor or list of factors * `split(x, f, drop = FALSE)` * `x` = vector/list/data frame * `f` = factor/list of factors * `drop` = whether empty factor levels should be dropped * `interactions(gl(2, 5), gl(5, 2))` = 1.1, 1.2, … 2.5 - `gl(n, m)` = group level function * `n` = number of levels * `m` = number of repetitions - `split` function can do this by passing in list(f1, f2) in argument * `split(data, list(gl(2, 5), gl(5, 2)))` = splits the data into 1.1, 1.2, … 2.5 levels ### `apply()` * evaluate a function (often anonymous) over the margins of an array * often used to apply a function to the row/columns of a matrix * can be used to average array of matrices (general arrays) * `apply(x, margin = 2, FUN, ...)` * `x` = array * `MARGIN` = 2 (column), 1 (row) * `FUN` = function * `…` = other arguments that need to be passed to other functions * ***examples*** * `apply(x, 1, sum)` or `apply(x, 1, mean)` = find row sums/means * `apply(x, 2, sum)` or `apply(x, 2, mean)` = find column sums/means * `apply(x, 1, quantile, props = c(0.25, 0.75))` = find 25% 75% percentile of each row * `a <- array(rnorm(2*2*10), c(2, 2, 10))` = create 10 2x2 matrix * `apply(a, c(1, 2), mean)` = returns the means of 10 ### `lapply()` * loops over a `list` and evaluate a function on each element and always returns a **`list`** - ***Note**: since input must be a list, it is possible that conversion may be needed* * `lapply(x, FUN, ...)` = takes list/vector as input, applies a function to each element of the list, returns a list of the same length * `x` = list (if not list, will be coerced into list through "as.list", if not possible —> error) * `data.frame` are treated as collections of lists and can be used here * `FUN` = function (without parentheses) * anonymous functions are acceptable here as well - (i.e `function(x) x[,1]`) * `…` = other/additional arguments to be passed for FUN (i.e. `min`, `max` for `runif()`) * ***example*** * `lapply(data.frame, class)` = the data.frame is a list of vectors, the `class` value for each vector is returned in a list (name of function, `class`, is without parentheses) * `lapply(values, function(elem), elem[2])` = example of an anonymous function ### `sapply()` * performs same function as `lapply()` except it simplifies the result * if result is of length 1 in every element, sapply returns vector * if result is vectors of the same length (>1) for each element, sapply returns matrix * if not possible to simplify, `sapply` returns a `list` (same as `lapply()`) ### `vapply()` * safer version of `sapply` in that it allows to you specify the format for the result * `vapply(flags, class, character(1))` = returns the `class` of values in the flags variable in the form of character of length 1 (1 value) ### `tapply()` * split data into groups, and apply the function to data within each subgroup * `tapply(data, INDEX, FUN, ..., simplify = FALSE)` = apply a function over subsets of a vector * `data` = vector * `INDEX` = factor/list of factors * `FUN` = function * `…` = arguments to be passed to function * `simplify` = whether to simplify the result * ***example*** - `x <- c(rnorm(10), runif(10), rnorm(10, 1))` - `f <- gl(3, 10); tapply(x, f, mean)` = returns the mean of each group (f level) of x data ### `mapply()` * multivariate apply, applies a function in parallel over a set of arguments * `mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)` * `FUN` = function * `…` = arguments to apply over * `MoreArgs` = list of other arguments to FUN * `SIMPLIFY` = whether the result should be simplified * ***example*** ```{r} mapply(rep, 1:4, 4:1) ``` ### `aggregate()` * aggregate computes summary statistics of data subsets (similar to multiple `tapply` at the same time) * `aggregate(list(name = dataToCompute), list(name = factorVar1,name = factorVar2), function, na.rm = TRUE)` * `dataToCompute` = this is what the function will be applied on * `factorVar1, factorVar1` = factor variables to split the data by * ***Note**: order matters here in terms of how to break down the data * * `function` = what is applied to the subsets of data, can be sum/mean/median/etc * `na.rm = TRUE` $\rightarrow$ removes NA values $\pagebreak$ ## Simulation * `sample(values, n, replace = FALSE)` = generate random samples * `values` = values to sample from * `n` = number of values generated * `replace` = with or without replacement * `sample(1:6, 4, replace = TRUE, prob=c(.2, .2…))` = choose four values from the range specified with replacing (same numbers can show up twice), with probabilities specified * `sample(vector)` = can be used to permute/rearrange elements of a vector * `sample(c(y, z), 100)` = select 100 random elements from combination of values y and z * `sample(10)` = select positive integer sample of size 10 without repeat * Each probability distribution functions usually have 4 functions associated with them: * `r***` function (for "random") $\rightarrow$ random number generation (ex. `rnorm`) * `d***` function (for "density") $\rightarrow$ calculate density (ex. `dunif`) * `p***` function (for "probability") $\rightarrow$ cumulative distribution (ex. `ppois`) * `q***` function (for "quantile") $\rightarrow$ quantile function (ex. `qbinom`) * If $\Phi$ is the cumulative distribution function for a standard Normal distribution, then `pnorm(q)` = $\Phi(q)$ and qnorm(p) = $\Phi^{-1}(q)$. * `set.seed()` = sets seed for randon number generator to ensure that the same data/analysis can be reproduced ### Simulation Examples * `rbinom(1, size = 100, prob = 0.7)` = returns a binomial random variable that represents the number of successes in a give number of independent trials * `1` = corresponds number of observations * `size = 100` = corresponds with the number of independent trials that culminate to each resultant observation * `prob = 0.7` = probability of success * `rnorm(n, mean = m, sd = s)` = generate n random samples from the standard normal distribution (mean = 0, std deviation = 1 by default) * `rnorm(1000)` = 1000 draws from the standard normal distribution * `n` = number of observation generated * `mean = m` = specified mean of distribution * `sd = s` = specified standard deviation of distribution * `dnorm(x, mean = 0, sd = 1, log = FALSE)` * `log` = evaluate on log scale * `pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)` * `lower.tail` = left side, `FALSE` = right * `qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)` * `lower.tail` = left side, `FALSE` = right * `rpois(n, lambda)` = generate random samples from the poisson distrbution * `n` = number of observations generated * `lambda` = $\lambda$ parameter for the poisson distribution or rate * `rpois(n, r)` = generating Poisson Data * `n` = number of values * `r` = rate * `ppois(n, r)` = cumulative distribution * `ppois(2, 2)` = $Pr(x<=2)$ * `replicate(n, rpois())` = repeat operation n times ### Generate Numbers for a Linear Model * Linear model $$ y = \beta_0 + \beta_1 x + \epsilon \\ ~\mbox{where}~ \epsilon \sim N(0, 2^2), x \sim N(0, 1^2), \beta_0 = 0.5, \beta_1 = 2 $$ ``` {r} set.seed(20) x <- rnorm(100) # normal x <- rbinom(100, 1, 0.5) # binomial e <- rnorm(100, 0, 2) y <- 0.5 + 2* x + e ``` * Poisson model $$ Y ~ Poisson(\mu) \\ log(\mu) = \beta_0 + \beta_1 x \\ ~\mbox{where}~ \beta_0 = 0.5, \beta_1 = 2 $$ ``` {r} x <- rnorm(100) log.mu <- 0.5 + 0.3* x y <- rpois(100, exp(log.mu)) ``` $\pagebreak$ ## Dates and Times * `Date` = date class, stored as number of days since 1970-01-01 * `POSIXct` = time class, stored as number of seconds since 1970-01-01 * `POSIXlt` = time class, stored as list of sec min hours * `Sys.Date()` = today's date * `unclass(obj)` = returns what obj looks like internally * `Sys.time()` = current time in POSIXct class * `t2 <- as.POSIXlt(Sys.time())` = time in POSIXlt class * `t2$min` = return min of time (only works for POSIXlt class) * `weekdays(date)`, `months(date)`, `quarters(date)` = returns weekdays, months, and quarters of time/date inputed * `strptime(string, "%B %d, %Y %H:%M")` = convert string into time format using the format specified * `difftime(time1, time2, units = 'days')` = difference in times by the specified unit ## Base Graphics * `data(set)` = load data * `plot(data)` = R plots the data as best as it can * `x` = variable, x axis * `y` = variable * `xlab, ylab` = corresponding labels * `main, sub` = title, subtitle * `col = 2` or `col = "red"` = color * `pch = 2` = different symbols for points * `xlim,ylim(v1, v2)` = restrict range of plot * `boxplot(x ~ y, data = d)` = creates boxplot for x vs y variables using the data.frame provided * `hist(x, breaks)` = plots histogram of the data - `break = 100` = split data into 100 bins $\pagebreak$ ## Reading Tabular Data * `read.table(), read.csv()` = most common, read text files (rows, col) return data frame * `readLines()` = read lines of text, returns character vector * `source(file)` = read R code * `dget()` = read R code files (R objects that have been reparsed) * `load()`, `unserialize()` = read binary objects * writing data * `write.table()`, `writeLines()`, `dump()`, `put()`, `save()`, `serialize()` * `read.table()` arguments: * `file` = name of file/connection * `header` = indicator if file contains header * `sep` = string indicating how columns are separated * `colClasses` = character vector indicating what each column is in terms of class * `nrows` = number of rows in dataset * `comment.char` = char indicating beginning of comment * `skip` = number of lines to skip in the beginning * `stringsAsFactors` = defaults to TRUE, should characters be coded as Factor * `read.table` can be used without any other argument to create data.frame * telling R what type of variables are in each column is helpful for larger datasets (efficiency) * `read.csv()` = `read.table` except default sep is comma (`read.table` default is `sep = " "` and `header = TRUE`) ### Larger Tables * ***Note**: help page for read.table important* * need to know how much RAM is required $\rightarrow$ calculating memory requirements * `numRow` x `numCol` x 8 bytes/numeric value = size required in bites * double the above results and convert into GB = amount of memory recommended * set `comment.char = ""` to save time if there are no comments in the file * specifying `colClasses` can make reading data much faster * `nrow = n` = number of rows to read in (can help with memory usage) * `initial <- read.table("file", rows = 100)` = read first 100 lines * `classes <- sapply(initial, class)` = determine what classes the columns are * `tabAll <- read.table("file", colClasses = classes)` = load in the entire file with determined classes ### Textual Data Formats * `dump` and `dput` preserve metadata * text formats are editable, not space efficient, and work better with version control system (they can only track changes in text files) * `dput(obj, file = "file.R")` = creates R code to store all data and meta data in "file.R" (ex. data, class, names, row.names) * `dget("file.R")` = loads the file/R code and reconstructs the R object * `dput` can only be used on one object, where as `dump` can be used on multiple objects * `dump(c("obj1", "obj2"), file= "file2.R")` = stores two objects * `source("file2.R")` = loads the objects ### Interfaces to the Outside World * `url()` = function can read from webpages * `file()` = read uncompressed files * `gzfile(), bzfile()` = read compressed files (gzip, bzip2) * `file(description = "", open = "")` = file syntax, creates connection * `description` = description of file * `open` = r -readonly, w - writing, a - appending, rb/wb/ab - reading/writing/appending binary * `close()` = closes connection * `readLines()` = can be used to read lines after connection has been established * `download.file(fileURL, destfile = "fileName", method = "curl")` - `fileURL` = url of the file that needs to be downloaded - `destfile = "fileName"` = specifies where the file is to be saved + `dir/fileName` = directories can be referenced here - `method = "curl"` = necessary for downloading files from "https://" links on Macs + `method = "auto"` = should work on all other machines $\pagebreak$ ## Control Structures * Common structures are * `if`, `else` = testing a condition * `for` = execute a loop a fixed number of times * `while` = execute a loop while a condition is true * `repeat` = execute an infinite loop * `break` = break the execution of a loop * `next` = skip an interation of a loop * `return` = exit a function * ***Note**: Control structures are primarily useful for writing programs; for command-line interactive work, the `apply` functions are more useful * ### `if - else` ```r # basic structure if() { ## do something } else { ## do something else } # if tree if() { ## do something } else if() { ## do something different } else { ## do something different } ``` * `y <- if(x>3){10} else {0}` = slightly different implementation than normal, focus on assigning value ### `for` ```{r} # basic structure for(i in 1:10) { # print(i) } # nested for loops x <- matrix(1:6, 2, 3) for(i in seq_len(nrow(x))) { for(j in seq_len(ncol(x))) { # print(x[i, j]) } } ``` * `for(letter in x)` = loop through letter in character vector * `seq_along(vector)` = create a number sequence from 1 to length of the vector * `seq_len(length)` = create a number sequence that starts at 1 and ends at length specified ### `while` ```{r} count <- 0 while(count < 10) { # print(count) count <- count + 1 } ``` * conditions can be combined with logical operators ### `repeat` and `break` * `repeat` initiates an infinite loop * not commonly used in statistical applications but they do have their uses * The only way to exit a `repeat` loop is to call `break` ```r x0 <- 1 tol <- 1e-8 repeat { x1 <- computeEstimate() if(abs(x1 - x0) < tol) { break } else { x0 <- x1 # requires algorithm to converge } } ``` * ***Note**: The above loop is a bit dangerous because there’s no guarantee it will stop * * Better to set a hard limit on the number of iterations (e.g. using a `for` loop) and then report whether convergence was achieved or not. ### `next` and `return` * `next` = (no parentheses) skips an element, to continue to the next iteration * `return` = signals that a function should exit and return a given value ```{r} for(i in 1:100) { if(i <= 20) { ## Skip the first 20 iterations next } ## Do something here } ``` $\pagebreak$ ## Functions * `name <- function(arg1, arg2, …){ }` * inputs can be specified with default values by `arg1 = 10` * it is possible to define an argument to `NULL` * returns **last expression** of function * many functions have `na.rm`, can be set to `TRUE` to remove `NA` values from calculation * structure ```r f <- function() { ## Do something interesting } ``` * function are first class object and can be **treated like other objects** (pass into other functions) * functions can be nested, so that you can define a function inside of another function * function have named arguments (i.e. `x = mydata`) which can be used to specifiy **default values** * `sd(x = mydata)` (matching by name) * formal arguments = arguments included in the functional definition * `formals()` = returns all formal arguments * not all functional call specifies all arguments, some can be missing and may have default values * `args()` = return all arguments you can specify * multiple arguments inputted in random orders (R performs positional matching) $\rightarrow$ not recommended * argument matching order: exact $\rightarrow$ partial $\rightarrow$ positional - *partial* = instead of typing `data = x`, use `d = x` * **Lazy Evaluation** * R will evaluate as needed, so everything executes until an error occurs * `f <- function (a, b) {a^2}` * if b is not used in the function, calling `f(5)` will not produce an error * `...` argument * used to extend other functions by representing the rest of the arguments * generic functions use `...` to pass extra arguments (i.e. mean = 1, sd = 2) * necessary when the number of arguments passed can not be known in advance * functions that use `...` = `paste()`, `cat()` * ***Note**: arguments coming after `...` must be explicitly matched and cannot be partially matched * $\pagebreak$ ## Scoping * scoping rules determine how a value is associated with a free variable in a function * **free variables** = variables not explicitly defined in the function (not arguments, or local variables - variable defined in the function) * R uses **lexical/static scoping** * common alternative = **dynamic scoping** * **lexical scoping** = values of free vars are searched in the environment in which the function is defined * environment = collection of symbol/value pairs (x = 3.14) * each package has its own environment * only environment **without** parent environment is the *empty environment* * **closure/function closure** = function + associated environment * search order for free variable 1. environment where the function is defined 2. parent environment 3. ... (repeat if multiple parent environments) 4. top level environment: global environment (worspace) or namespace package 5. empty environment $\rightarrow$ produce error * when a function/variable is called, R searches through the following list to match the first result 1. `.GlobalEnv` 2. `package:stats ` 3. `package:graphics` 4. `package:grDeviced` 5. `package:utils` 6. `package:datasets` 7. `package:methods` 8. `Autoloads` 9. `package:base` * **order matters** * `.GlobalEnv` = everything defined in the current workspace * any package that gets loaded with `library()` gets put in position 2 of the above search list * namespaces are separate for functions and non-functions * possible for object c and function c to coexist ### Scoping Example ```{r} make.power <- function(n) { pow <- function(x) { x^n } pow } cube <- make.power(3) # defines a function with only n defined (x^3) square <- make.power(2) # defines a function with only n defined (x^2) cube(3) # defines x = 3 square(3) # defines x = 3 # returns the free variables in the function ls(environment(cube)) # retrieves the value of n in the cube function get("n", environment(cube)) ``` ### Lexical vs Dynamic Scoping Example ```{r} y <- 10 f <- function(x) { y <- 2 y^2 + g(x) } g <- function(x) { x*y } ``` * **Lexical Scoping** 1. `f(3)` $\rightarrow$ calls `g(x)` 2. `y` isn’t defined locally in `g(x)` $\rightarrow$ searches in parent environment (working environment/global workspace) 3. finds `y` $\rightarrow$ `y = 10` * **Dynamic Scoping** 1. `f(3)` $\rightarrow$ calls `g(x)` 2. `y` isn’t defined locally in `g(x)` $\rightarrow$ searches in calling environment (f function) 3. find `y` $\rightarrow$ `y <- 2` * **parent frame** = refers to calling environment in R, environment from which the function was called * ***Note**: when the defining environment and calling environment is the same, lexical and dynamic scoping produces the same result * * **Consequences of Lexical Scoping** * all objects must be carried in memory * all functions carry pointer to their defining environment (memory address) ### Optimization * optimization routines in R (`optim`, `nlm`, `optimize`) require you to pass a function whose argument is a vector of parameters * ***Note**: these functions **minimize**, so use the negative constructs to maximize a normal likelihood * * **constructor functions** = functions to be fed into the optimization routines * ***example*** ```{r} # write constructor function make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) { params <- fixed function(p) { params[!fixed] <- p mu <- params[1] sigma <- params[2] a <- -0.5*length(data)*log(2*pi*sigma^2) b <- -0.5*sum((data-mu)^2) / (sigma^2) -(a + b) } } # initialize seed and print function set.seed(1); normals <- rnorm(100, 1, 2) nLL <- make.NegLogLik(normals); nLL # Estimating Prameters optim(c(mu = 0, sigma = 1), nLL)$par # Fixing sigma = 2 nLL <- make.NegLogLik(normals, c(FALSE, 2)) optimize(nLL, c(-1, 3))$minimum # Fixing mu = 1 nLL <- make.NegLogLik(normals, c(1, FALSE)) optimize(nLL, c(1e-6, 10))$minimum ``` $\pagebreak$ ## Debugging * `message`: generic notification/diagnostic message, execution continues * `message()` = generate message * `warning`: something’s wrong but not fatal, execution continues * `warning()` = generate warning * `error`: fatal problem occurred, execution stops * `stop()` = generate error * `condition`: generic concept for indicating something unexpected can occur * `invisible()` = suppresses auto printing * ***Note**: random number generator must be controlled to reproduce problems (`set.seed` to pinpoint problem) * * `traceback`: prints out function call stack after error occurs * must be called right after error * `debug`: flags function for debug mode, allows to step through function one line at a time - `debug(function)` = enter debug mode * `browser`: suspends the execution of function wherever its placed - embedded in code and when the code is run, the browser comes up * `trace`: allows inserting debugging code into a function at specific places * `recover`: error handler, freezes at point of error * `options(error = recover)` = instead of console, brings up menu (similar to `browser`) ## R Profiler * optimizing code cannot be done without performance analysis and profiling ```{r} # system.time example system.time({ n <- 1000 r <- numeric(n) for (i in 1:n) { x <- rnorm(n) r[i] <- mean(x) } }) ``` * `system.time(expression)` * takes R expression, returns amount of time needed to execute (assuming you know where) * computes time (in sec) $\rightarrow$ gives time until error if error occurs * can wrap multiple lines of code with `{}` * returns object of class `proc_time` * **user time** = time computer experience * **elapsed time** = time user experience * usually close for standard computation * ***elapse > user*** = CPU wait around other processes in the background (read webpage) * ***elapsed < user*** = multiple processor/core (use multi-threaded libraries) * ***Note**: R doesn’t multi-thread (performing multiple calculations at the same time) with basic package * * Basic Linear Algebra Standard [BLAS] libraries do, prediction, regression routines, matrix * i.e. `vecLib`/Accelerate, `ATLAS`, `ACML`, `MKL` * `Rprof()` = useful for complex code only * keeps track of functional call stack at regular intervals and tabulates how much time is spent in each function * default sampling interval = 0.02 second * calling `Rprof()` generates `Rprof.out` file by default * `Rprof("output.out")` = specify the output file * ***Note**: should NOT be used with `system.time()` * * `summaryRprof()` = summarizes `Rprof()` output, 2 methods for normalizing data - loads the `Rprof.out` file by default, can specify output file `summaryRprof("output.out")` * `by.total` = divide time spent in each function by total run time * `by.self` = first subtracts out time spent in functions above in call stack, and calculates ratio to total * `$sample.interval` = 0.02 $\rightarrow$ interval * `$sampling.time` = 7.41 $\rightarrow$ seconds, elapsed time * ***Note**: generally user spends all time at top level function (i.e. `lm()`), but the function simply calls helper functions to do work so it is not useful to know about the top level function times * * ***Note**: `by.self` = more useful as it focuses on each individual call/function * * ***Note**: R must be compiled with profiles support (generally the case) * * good to break code into functions so profilers can give useful information about where time is spent * C/FORTRAN code is not profiled ### Miscellaneous * `unlist(rss)` = converts a list object into data frame/vector * `ls("package:elasticnet")` = list methods in package
X Tutup