Open woneuy01 opened 4 years ago
Key points The tidyr package includes several functions that are useful for tidying data. The gather() function converts wide data into tidy data. The spread() function converts tidy data to wide data.
Code
library(tidyverse) path <- system.file("extdata", package="dslabs") filename <- file.path(path, "fertility-two-countries-example.csv") wide_data <- read_csv(filename)
library(dslabs) data("gapminder") tidy_data <- gapminder %>% filter(country %in% c("South Korea", "Germany")) %>% select(country, year, fertility)
new_tidy_data <- wide_data %>%
gather(year, fertility, 1960
:2015
)
head(new_tidy_data)
new_tidy_data <- wide_data %>% gather(year, fertility, -country)
class(tidy_data$year) class(new_tidy_data$year)
new_tidy_data <- wide_data %>% gather(year, fertility, -country, convert = TRUE) #gather makes columns as string class(new_tidy_data$year)
new_tidy_data %>% ggplot(aes(year, fertility, color = country)) + geom_point()
new_wide_data <- new_tidy_data %>% spread(year, fertility)
select(new_wide_data, country, 1960
:1967
)
Key points The separate() function splits one column into two or more columns at a specified character that separates the variables. When there is an extra separation in some of the entries, use fill="right" to pad missing values with NAs, or use extra="merge" to keep extra elements together. The unite() function combines two columns and adds a separating character.
path <- system.file("extdata", package = "dslabs") filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv") raw_dat <- read_csv(filename) select(raw_dat, 1:5)
dat <- raw_dat %>% gather(key, value, -country) head(dat) dat$key[1:5]
dat %>% separate(key, c("year", "variablename"), "") dat %>% separate(key, c("year", "variablename")) #default is "" so this code is the same in the above
dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right")
dat %>% separate(key, c("year", "variablename"), sep = "", extra = "merge")
dat %>% separate(key, c("year", "variablename"), sep = "", extra = "merge") %>% spread(variable_name, value)
dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>% unite(variable_name, first_variable_name, second_variablename, sep="")
dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>% unite(variable_name, first_variable_name, second_variablename, sep="") %>% spread(variable_name, value) %>% rename(fertility = fertility_NA)
A collaborator sends you a file containing data for three years of average race finish times.
age_group,2015,2016,2017 20,3:46,3:22,3:50 30,3:50,3:43,4:43 40,4:39,3:49,4:51 50,4:48,4:59,5:01
No. These data are not considered “tidy” because the variable “year” is stored in the header.
state abb region population total Alabama AL South 4779736 135 Alaska AK West 710231 19 Arizona AZ West 6392017 232 Arkansas AR South 2915918 93 California CA West 37253956 1257 Colorado CO West 5029196 65
Explanation
In tidy format, each observation has its own row, and each variable has its own column.
tidy_data <- d %>%
gather(year, time, 2015
:2017
)
This code will gather the years from 2015 to 2017 into a single column and create a single column called “time” that contains the time for each age group and each year.
dat_tidy <- dat_wide %>% gather(key = disease, value = count, HepatitisA:Rubella)
You have successfully formatted marathon finish times into a tidy object called tidy_data. The first few lines are shown below.
age_group year time 20 2015 03:46 30 2015 03:50 40 2015 04:39 50 2015 04:48 20 2016 03:22
Select the code that converts these data back to the wide format, where each year has a separate column.
tidy_data %>% spread(year, time)
Key points The join functions in the dplyr package combine two tables such that matching rows are together.
Code
library(tidyverse) library(ggrepel) library(dslabs) ds_theme_set() data(murders) head(murders)
data(polls_us_election_2016) head(results_us_election_2016) identical(results_us_election_2016$state, murders$state)
tab <- left_join(murders, results_us_election_2016, by = "state") head(tab)
tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) + geom_point() + geom_text_repel() + scale_x_continuous(trans = "log2") + scale_y_continuous(trans = "log2") + geom_smooth(method = "lm", se = FALSE)
tab1 <- slice(murders, 1:6) %>% select(state, population) tab1 tab2 <- slice(results_us_election_2016, c(1:3, 5, 7:8)) %>% select(state, electoral_votes) tab2
left_join(tab1, tab2) tab1 %>% left_join(tab2) tab1 %>% right_join(tab2) inner_join(tab1, tab2) semi_join(tab1, tab2) anti_join(tab1, tab2)
Key points Unlike the join functions, the binding functions do not try to match by a variable, but rather just combine datasets. bind_cols() binds two objects by making them columns in a tibble. The R-base function cbind() binds columns but makes a data frame or matrix instead. The bind_rows() function is similar but binds rows instead of columns. The R-base function rbind() binds rows but makes a data frame or matrix instead.
bind_cols(a = 1:3, b = 4:6)
tab1 <- tab[, 1:3] tab2 <- tab[, 4:6] tab3 <- tab[, 7:9] new_tab <- bind_cols(tab1, tab2, tab3) head(new_tab)
tab1 <- tab[1:2,] tab2 <- tab[3:4,] bind_rows(tab1, tab2)
Key points By default, the set operators in R-base work on vectors. If tidyverse/dplyr are loaded, they also work on data frames. You can take intersections of vectors using intersect(). This returns the elements common to both sets. You can take the union of vectors using union(). This returns the elements that are in either set. The set difference between a first and second argument can be obtained with setdiff(). Note that this function is not symmetric. The function set_equal() tells us if two sets are the same, regardless of the order of elements.
intersect(1:10, 6:15) intersect(c("a","b","c"), c("b","c","d")) tab1 <- tab[1:5,] tab2 <- tab[3:7,] intersect(tab1, tab2)
union(1:10, 6:15) union(c("a","b","c"), c("b","c","d")) tab1 <- tab[1:5,] tab2 <- tab[3:7,] union(tab1, tab2)
setdiff(1:10, 6:15) setdiff(6:15, 1:10) tab1 <- tab[1:5,] tab2 <- tab[3:7,] setdiff(tab1, tab2)
setequal(1:5, 1:6) setequal(1:5, 5:1) setequal(tab1, tab2)
Key points Web scraping is extracting data from a website. The rvest web harvesting package includes functions to extract nodes of an HTML document: html_nodes() extracts all nodes of different types, and html_node() extracts the first node. html_table() converts an HTML table to a data frame.
Code
library(rvest) url <- "https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state" h <- read_html(url) class(h) h
tab <- h %>% html_nodes("table") # select all tables tab <- tab[[2]] # select only second
tab <- tab %>% html_table class(tab)
tab <- tab %>% setNames(c("state", "population", "total", "murders", "gun_murders", "gun_ownership", "total_rate", "murder_rate", "gun_murder_rate")) # rename head(tab)
SelectorGadget: point and click CSS selectors https://selectorgadget.com/
For the guacamole recipe page, we already have done this and determined that we need the following selectors:
h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609") recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text() prep_time <- h %>% html_node(".m-RecipeInfoa-Description--Total") %>% html_text() ingredients <- h %>% html_nodes(".o-Ingredientsa-Ingredient") %>% html_text() You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:
guacamole <- list(recipe, prep_time, ingredients) guacamole
Key points The most common tasks in string processing include: extracting numbers from strings removing unwanted characters from text finding and replacing characters extracting specific parts of strings converting free form text to more uniform formats splitting strings into multiple values The stringr package in the tidyverse contains string processing functions that follow a similar naming format (str_functionname) and are compatible with the pipe. Code
url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167" murders_raw <- read_html(url) %>% html_nodes("table") %>% html_table() %>% .[[1]] %>% setNames(c("state", "population", "total", "murder_rate"))
head(murders_raw) class(murders_raw$population) class(murders_raw$total)
Key points
Define a string by surrounding text with either single quotes or double quotes.
To include a single quote inside a string, use double quotes on the outside. To include a double quote inside a string, use single quotes on the outside.
The cat() function displays a string as it is represented inside R.
To include a double quote inside of a string surrounded by double quotes, use the backslash () to escape the double quote. Escape a single quote to include it inside of a string defined by single quotes.
We will see additional uses of the escape later.
Code
s <- "Hello!" ###double quotes define a string
s <- 'Hello!' ### single quotes define a string
s <- Hello
### backquotes do not
s <- "10"" ### error - unclosed quotes s <- '10"' ### correct
cat(s)
s <- "5'" cat(s)
s <- '5'10"' ### error s <- "5'10"" ### error s <- '5\'10"' ### correct cat(s) s <- "5'10\"" ### correct cat(s)
What does tidy data mean?
Tidy data is an alternative name for the common statistical form called a model matrix or data matrix. ... Hadley Wickham later defined "Tidy Data" as data sets that are arranged such that each variable is a column and each observation (or case) is a row.
ey points In tidy data, each row represents an observation and each column represents a different variable. In wide data, each row includes several observations and one of the variables is stored in the header. Code library(tidyverse) library(dslabs) data(gapminder)
create and inspect a tidy data frame
tidy_data <- gapminder %>% filter(country %in% c("South Korea", "Germany")) %>% select(country, year, fertility) head(tidy_data)
plotting tidy data is simple
tidy_data %>% ggplot(aes(year, fertility, color = country)) + geom_point()
import and inspect example of original Gapminder data in wide format
path <- system.file("extdata", package="dslabs") filename <- file.path(path, "fertility-two-countries-example.csv") wide_data <- read_csv(filename) select(wide_data, country,
1960
:1967
)