woneuy01 / Data_Wrangling-in-R

0 stars 0 forks source link

Tidy Data #2

Open woneuy01 opened 4 years ago

woneuy01 commented 4 years ago

What does tidy data mean?

Tidy data is an alternative name for the common statistical form called a model matrix or data matrix. ... Hadley Wickham later defined "Tidy Data" as data sets that are arranged such that each variable is a column and each observation (or case) is a row.

ey points In tidy data, each row represents an observation and each column represents a different variable. In wide data, each row includes several observations and one of the variables is stored in the header. Code library(tidyverse) library(dslabs) data(gapminder)

create and inspect a tidy data frame

tidy_data <- gapminder %>% filter(country %in% c("South Korea", "Germany")) %>% select(country, year, fertility) head(tidy_data)

plotting tidy data is simple

tidy_data %>% ggplot(aes(year, fertility, color = country)) + geom_point()

import and inspect example of original Gapminder data in wide format

path <- system.file("extdata", package="dslabs") filename <- file.path(path, "fertility-two-countries-example.csv") wide_data <- read_csv(filename) select(wide_data, country, 1960:1967)

woneuy01 commented 4 years ago

Key points The tidyr package includes several functions that are useful for tidying data. The gather() function converts wide data into tidy data. The spread() function converts tidy data to wide data.

Code

original wide data

library(tidyverse) path <- system.file("extdata", package="dslabs") filename <- file.path(path, "fertility-two-countries-example.csv") wide_data <- read_csv(filename)

tidy data from dslabs

library(dslabs) data("gapminder") tidy_data <- gapminder %>% filter(country %in% c("South Korea", "Germany")) %>% select(country, year, fertility)

gather wide data --> new tidy data

new_tidy_data <- wide_data %>% gather(year, fertility, 1960:2015) head(new_tidy_data) 1

gather all columns except country

new_tidy_data <- wide_data %>% gather(year, fertility, -country)

gather treats column names as characters by default

class(tidy_data$year) class(new_tidy_data$year)

convert gathered column names to numeric

new_tidy_data <- wide_data %>% gather(year, fertility, -country, convert = TRUE) #gather makes columns as string class(new_tidy_data$year)

ggplot works on new tidy data

new_tidy_data %>% ggplot(aes(year, fertility, color = country)) + geom_point() Rplot1

spread tidy data --> wide data

new_wide_data <- new_tidy_data %>% spread(year, fertility) select(new_wide_data, country, 1960:1967) 2

woneuy01 commented 4 years ago

Key points The separate() function splits one column into two or more columns at a specified character that separates the variables. When there is an extra separation in some of the entries, use fill="right" to pad missing values with NAs, or use extra="merge" to keep extra elements together. The unite() function combines two columns and adds a separating character.

import data

path <- system.file("extdata", package = "dslabs") filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv") raw_dat <- read_csv(filename) select(raw_dat, 1:5)

gather all columns except country

dat <- raw_dat %>% gather(key, value, -country) head(dat) dat$key[1:5] 1

separate on underscores

dat %>% separate(key, c("year", "variablename"), "") dat %>% separate(key, c("year", "variablename")) #default is "" so this code is the same in the above 2

split on all underscores, pad empty cells with NA

dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") 3

split on first underscore but keep life_expectancy merged

dat %>% separate(key, c("year", "variablename"), sep = "", extra = "merge") 4

separate then spread

dat %>% separate(key, c("year", "variablename"), sep = "", extra = "merge") %>% spread(variable_name, value) 5

separate then unite

dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>% unite(variable_name, first_variable_name, second_variablename, sep="") 6

full code for tidying data complicated way

dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>% unite(variable_name, first_variable_name, second_variablename, sep="") %>% spread(variable_name, value) %>% rename(fertility = fertility_NA) 7

woneuy01 commented 4 years ago

A collaborator sends you a file containing data for three years of average race finish times.

age_group,2015,2016,2017 20,3:46,3:22,3:50 30,3:50,3:43,4:43 40,4:39,3:49,4:51 50,4:48,4:59,5:01

No. These data are not considered “tidy” because the variable “year” is stored in the header.

woneuy01 commented 4 years ago

state abb region population total Alabama AL South 4779736 135 Alaska AK West 710231 19 Arizona AZ West 6392017 232 Arkansas AR South 2915918 93 California CA West 37253956 1257 Colorado CO West 5029196 65

Explanation

In tidy format, each observation has its own row, and each variable has its own column.

woneuy01 commented 4 years ago

tidy_data <- d %>% gather(year, time, 2015:2017)

This code will gather the years from 2015 to 2017 into a single column and create a single column called “time” that contains the time for each age group and each year.

woneuy01 commented 4 years ago

dat_tidy <- dat_wide %>% gather(key = disease, value = count, HepatitisA:Rubella)

1

woneuy01 commented 4 years ago

You have successfully formatted marathon finish times into a tidy object called tidy_data. The first few lines are shown below.

age_group year time 20 2015 03:46 30 2015 03:50 40 2015 04:39 50 2015 04:48 20 2016 03:22

Select the code that converts these data back to the wide format, where each year has a separate column.

tidy_data %>% spread(year, time)

woneuy01 commented 4 years ago

1

woneuy01 commented 4 years ago

2

woneuy01 commented 4 years ago

3

woneuy01 commented 4 years ago

Key points The join functions in the dplyr package combine two tables such that matching rows are together.

left_join() only keeps rows that have information in the first table.

right_join() only keeps rows that have information in the second table.

inner_join() only keeps rows that have information in both tables.

full_join() keeps all rows from both tables.

semi_join() keeps the part of first table for which we have information in the second.

anti_join() keeps the elements of the first table for which there is no information in the second.

Code

import US murders data

library(tidyverse) library(ggrepel) library(dslabs) ds_theme_set() data(murders) head(murders)

import US election results data

data(polls_us_election_2016) head(results_us_election_2016) identical(results_us_election_2016$state, murders$state)

join the murders table and US election results table

tab <- left_join(murders, results_us_election_2016, by = "state") head(tab)

plot electoral votes versus population

tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) + geom_point() + geom_text_repel() + scale_x_continuous(trans = "log2") + scale_y_continuous(trans = "log2") + geom_smooth(method = "lm", se = FALSE)

make two smaller tables to demonstrate joins

tab1 <- slice(murders, 1:6) %>% select(state, population) tab1 tab2 <- slice(results_us_election_2016, c(1:3, 5, 7:8)) %>% select(state, electoral_votes) tab2

experiment with different joins

left_join(tab1, tab2) tab1 %>% left_join(tab2) tab1 %>% right_join(tab2) inner_join(tab1, tab2) semi_join(tab1, tab2) anti_join(tab1, tab2)

woneuy01 commented 4 years ago

Key points Unlike the join functions, the binding functions do not try to match by a variable, but rather just combine datasets. bind_cols() binds two objects by making them columns in a tibble. The R-base function cbind() binds columns but makes a data frame or matrix instead. The bind_rows() function is similar but binds rows instead of columns. The R-base function rbind() binds rows but makes a data frame or matrix instead.

bind_cols(a = 1:3, b = 4:6)

tab1 <- tab[, 1:3] tab2 <- tab[, 4:6] tab3 <- tab[, 7:9] new_tab <- bind_cols(tab1, tab2, tab3) head(new_tab)

tab1 <- tab[1:2,] tab2 <- tab[3:4,] bind_rows(tab1, tab2)

woneuy01 commented 4 years ago

Key points By default, the set operators in R-base work on vectors. If tidyverse/dplyr are loaded, they also work on data frames. You can take intersections of vectors using intersect(). This returns the elements common to both sets. You can take the union of vectors using union(). This returns the elements that are in either set. The set difference between a first and second argument can be obtained with setdiff(). Note that this function is not symmetric. The function set_equal() tells us if two sets are the same, regardless of the order of elements.

intersect vectors or data frames

intersect(1:10, 6:15) intersect(c("a","b","c"), c("b","c","d")) tab1 <- tab[1:5,] tab2 <- tab[3:7,] intersect(tab1, tab2)

perform a union of vectors or data frames

union(1:10, 6:15) union(c("a","b","c"), c("b","c","d")) tab1 <- tab[1:5,] tab2 <- tab[3:7,] union(tab1, tab2)

set difference of vectors or data frames

setdiff(1:10, 6:15) setdiff(6:15, 1:10) tab1 <- tab[1:5,] tab2 <- tab[3:7,] setdiff(tab1, tab2)

setequal determines whether sets have the same elements, regardless of order

setequal(1:5, 1:6) setequal(1:5, 5:1) setequal(tab1, tab2)

woneuy01 commented 4 years ago

Key points Web scraping is extracting data from a website. The rvest web harvesting package includes functions to extract nodes of an HTML document: html_nodes() extracts all nodes of different types, and html_node() extracts the first node. html_table() converts an HTML table to a data frame.

Code

import a webpage into R

library(rvest) url <- "https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state" h <- read_html(url) class(h) h

tab <- h %>% html_nodes("table") # select all tables tab <- tab[[2]] # select only second

tab <- tab %>% html_table class(tab)

tab <- tab %>% setNames(c("state", "population", "total", "murders", "gun_murders", "gun_ownership", "total_rate", "murder_rate", "gun_murder_rate")) # rename head(tab)

woneuy01 commented 4 years ago

SelectorGadget: point and click CSS selectors https://selectorgadget.com/

For the guacamole recipe page, we already have done this and determined that we need the following selectors:

h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609") recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text() prep_time <- h %>% html_node(".m-RecipeInfoa-Description--Total") %>% html_text() ingredients <- h %>% html_nodes(".o-Ingredientsa-Ingredient") %>% html_text() You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:

guacamole <- list(recipe, prep_time, ingredients) guacamole

woneuy01 commented 4 years ago

Key points The most common tasks in string processing include: extracting numbers from strings removing unwanted characters from text finding and replacing characters extracting specific parts of strings converting free form text to more uniform formats splitting strings into multiple values The stringr package in the tidyverse contains string processing functions that follow a similar naming format (str_functionname) and are compatible with the pipe. Code

read in raw murders data from Wikipedia

url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167" murders_raw <- read_html(url) %>% html_nodes("table") %>% html_table() %>% .[[1]] %>% setNames(c("state", "population", "total", "murder_rate"))

inspect data and column classes

head(murders_raw) class(murders_raw$population) class(murders_raw$total)

woneuy01 commented 4 years ago

Key points Define a string by surrounding text with either single quotes or double quotes. To include a single quote inside a string, use double quotes on the outside. To include a double quote inside a string, use single quotes on the outside. The cat() function displays a string as it is represented inside R. To include a double quote inside of a string surrounded by double quotes, use the backslash () to escape the double quote. Escape a single quote to include it inside of a string defined by single quotes. We will see additional uses of the escape later. Code s <- "Hello!" ###double quotes define a string s <- 'Hello!' ### single quotes define a string s <- Hello ### backquotes do not

s <- "10"" ### error - unclosed quotes s <- '10"' ### correct

cat shows what the string actually looks like inside R

cat(s)

s <- "5'" cat(s)

to include both single and double quotes in string, escape with \

s <- '5'10"' ### error s <- "5'10"" ### error s <- '5\'10"' ### correct cat(s) s <- "5'10\"" ### correct cat(s)