[ The R Script associated with this page is available here](`r output`). Download this file and open it (or copy-paste into a new script) with RStudio so you can follow along.
# RStudio Shortcuts
## Running code
* `ctrl-R` (or `command-R`) to run current line
* Highlight `code` in script and run `ctrl-R` (or `command-R`) to run selection
* Buttons:
## Switching windows
* `ctrl-1`: script window
* `ctrl-2`: console window
> Try to run today's script without using your mouse/trackpad
# Data wrangling
## Useful packages: [`tidyverse`](https://www.tidyverse.org/packages/)
[Cheat sheets on website](https://www.rstudio.com/resources/cheatsheets/) for [Data Wrangling](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
```{r,results='hide', message=FALSE, warning=F}
library(tidyverse)
```
Remember use `install.packages("tidyverse")` to install a new package.
### Example operations from [here](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html)
## New York City Flights
Data from [US Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0) (see `?nycflights13`)
```{r,results='hide', warning=F}
library(nycflights13)
```
Check out the `flights` object
```{r}
head(flights)
```
### Object _Structure_
Check out data _structure_ with `glimpse()`
```{r}
glimpse(flights)
```
# `dplyr` "verbs"
* `select()` and `rename()`: Extract existing variables
* `filter()` and `slice()`: Extract existing observations
* `arrange()`
* `distinct()`
* `mutate()` and `transmute()`: Derive new variables
* `summarise()`: Change the unit of analysis
* `sample_n()` and `sample_frac()`
## Useful select functions
* "`-`" Select everything but
* "`:`" Select range
* `contains()` Select columns whose name contains a character string
* `ends_with()` Select columns whose name ends with a string
* `everything()` Select every column
* `matches()` Select columns whose name matches a regular expression
* `num_range()` Select columns named x1, x2, x3, x4, x5
* `one_of()` Select columns whose names are in a group of names
* `starts_with()` Select columns whose name starts with a character string
### `select()` examples
Select only the `year`, `month`, and `day` columns:
```{r}
select(flights,year, month, day)
```
### `select()` examples
Select everything _except_ the `tailnum`:
```{r}
select(flights,-tailnum)
```
Select all columns containing the string `"time"`:
```{r}
select(flights,contains("time"))
```
You can also rename columns with `select()`
```{r}
select(flights,year,carrier,destination=dest)
```
## `filter()` observations
Filter all flights that departed on on January 1st:
```{r}
filter(flights, month == 1, day == 1)
```
## _Base_ R method
This is equivalent to the more verbose code in base R:
```{r}
flights[flights$month == 1 & flights$day == 1, ]
```
Compare with `dplyr` method:
```{r,eval=F}
filter(flights, month == 1, day == 1)
```
Filter the `flights` data set to keep only evening flights (`dep_time` after 1600) in June.
## Other _boolean_ expressions
`filter()` is similar to `subset()` except it handles any number of filtering conditions joined together with `&`.
You can also use other boolean operators, such as _OR_ ("|"):
```{r}
filter(flights, month == 1 | month == 2)
```
Filter the `flights` data set to keep only flights where the `distance` is greater than 1000 OR the `air_time` is more than 100
## Select rows with `slice()`:
```{r}
slice(flights, 1:10)
```
## `arrange()` rows
`arrange()` is similar to `filter()` except it reorders instead of filtering.
```{r}
arrange(flights, year, month, day)
```
_Base_ R method:
```{r,eval=F}
flights[order(flights$year, flights$month, flights$day), ]
```
## Descending order: `desc()`
```{r}
arrange(flights, desc(arr_delay))
```
_Base_ R method:
```{r,eval=F}
flights[order(desc(flights$arr_delay)), ]
```
## Distinct: Find distinct rows
```{r}
distinct(
select(flights,carrier)
)
```
## Mutate: Derive new variables
Adds columns with calculations based on other columns.
Average air speed (miles/hour):
```{r}
mutate(flights,ave_speed=distance/(air_time/60))%>%
select(distance, air_time,ave_speed)
```
## Chaining Operations
Learn to performing multiple operations sequentially with a _pipe_ character (`%>%`)
1. Group by a variable
2. Select some columns
3. Summarize observations
4. Filter by results
With temporary objects:
```{r}
a1 <- group_by(flights, year, month, day)
a2 <- select(a1, arr_delay, dep_delay)
a3 <- summarise(a2,
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE))
a4 <- filter(a3, arr > 30 | dep > 30)
head(a4)
```
If you don’t want to save the intermediate results: wrap the function calls inside each other:
```{r}
filter(
summarise(
select(
group_by(flights, year, month, day),
arr_delay, dep_delay
),
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
```
Arguments are distant from function -> difficult to read!
## Chaining Operations
`%>%` (from the dplyr package) allows you to _pipe_ together various commands.
`x %>% f(y)` turns into `f(x, y)`
So you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:
```{r}
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
```
## Analyze by group with `group_by()`
Perform operations by _group_: mean departure delay by airport (`origin`)
```{r}
flights %>%
group_by(origin) %>%
summarise(meanDelay = mean(dep_delay,na.rm=T))
```
Perform operations by _group_: mean and sd departure delay by airline (`carrier`)
```{r}
flights %>%
group_by(carrier) %>%
summarise(meanDelay = mean(dep_delay,na.rm=T),
sdDelay = sd(dep_delay,na.rm=T))
```
Flights from which `origin` airport go the farthest (on average)? Hint: Group by airport (`origin`) then calculate the maximum flight distance (`distance`).
# Today's task
Now [complete the task here](CS_03.html) by yourself or in small groups.
## Colophon
This exercise based on code from [here](http://spatial.ly/2012/06/mapping-worlds-biggest-airlines/).