pipeline fails at analysis

diazrenata commented 5 years ago

I'm struggling to get the pipeline to run on my computer, using the most up-to-date version of master. (I'm working in rmd-debug-pipeline, but it's a copy of master at the moment). Here is the highest-level error:

> make(pipeline, cache = cache)
target analysis
fail analysis
Error: Target `analysis` failed. Call `diagnose(analysis)` for details. Error message:
  could not find function "fun"

and what the analysis part of the pipeline looks like:

> pipeline[13, ]
# A tibble: 1 x 3
  target   command  transform                                            
  <chr>    <chr>    <chr>                                                
1 analysis fun(data) "cross(fun = list(lda), data = list(portal_data, mai…

I've done some digging, but before going into an extended narrative of that, I think it would help a lot if somebody whose pipeline is working (@ha0ye?) could confirm that this is the correct type of content for analysis? Thanks!

ha0ye commented 5 years ago

Can you check your version of Drake? I think version 6.2.1 might be required for the new implementation of mapping methods to datasets.

diazrenata commented 5 years ago

hm - that's the one I have?

ha0ye commented 5 years ago

Here's what I see for running the front half of pipeline.R... can you see what might be different?

library(MATSS)
#> Please look at our data formats by running `vignette("data-formats")`
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)

## Read in the maizuru community data from a csv file
get_maizuru_data <- function()
{
    data_path <- system.file("extdata", "Maizuru_dominant_sp.csv",
                             package = "MATSS", mustWork = TRUE)
    raw_data <- read.csv(data_path)

    list(abundance = dplyr::select(raw_data, -date_tag, -surf.t, -bot.t, -Y, -M, -D) %>%
             mutate_all(~round(. + 1e-10)),
         covariates = dplyr::select(raw_data, date_tag, surf.t, bot.t, Y, M, D))
}

## Get raw data
datasets_raw <- drake_plan(
    bbs_data_tables = rdataretriever::fetch("breed-bird-survey"),
    sdl_data_tables = rdataretriever::fetch("veg-plots-sdl"),
    mtquad_data_tables = rdataretriever::fetch("mapped-plant-quads-mt")
)

## Clean and transform the data into the appropriate format
datasets <- drake_plan(
    portal_data = get_portal_rodents(),
    maizuru_data = get_maizuru_data(),
    jornada_data = process_jornada_data(),
    sgs_data = process_sgs_data(),
    bbs_data = get_bbs_data(bbs_data_tables, region = 7),
    sdl_data = get_sdl_data(sdl_data_tables),
    mtquad_data = get_mtquad_data(mtquad_data_tables),
    bad_portal = portal_data[[1]]

)

## Analysis methods
methods <- drake_plan(
    lda = function(dataset) {run_LDA(dataset, max_topics = 6, nseeds = 20)}
)

## Define how results are collected
collect <- function(list_of_results, plan)
{
    names(list_of_results) <- all.vars(match.call()$list_of_results)
    list_of_results
}

## The combination of each method x dataset
analyses <- drake_plan(
    # expand out each `fun(data)``, where
    #   `fun` is each of the values in methods$target
    #   `data` is each of the values in datasets$target
    # note: tidyeval syntax is to get all the values from the previous plans,
    #       but keep them as unevaluated symbols, so that drake_plan handles
    #       them appropriately
    analysis = target(fun(data),
                      transform = cross(fun = !!rlang::syms(methods$target),
                                        data = !!rlang::syms(datasets$target))
    ),
    # create a list of the created `analysis` objects, grouping by the `fun`
    # that made them - this keeps the results from the different methods
    # separated, so that the reports/syntheses can handle the right outputs
    results = target(collect(analysis, ignore(analyses)),
                     transform = combine(analysis, .by = fun)),
    trace = TRUE
)

print(analyses)
#> # A tibble: 9 x 6
#>   target       command                   fun   data   analysis     results 
#>   <chr>        <chr>                     <chr> <chr>  <chr>        <chr>   
#> 1 analysis_ld… lda(portal_data)          lda   porta… analysis_ld… <NA>    
#> 2 analysis_ld… lda(maizuru_data)         lda   maizu… analysis_ld… <NA>    
#> 3 analysis_ld… lda(jornada_data)         lda   jorna… analysis_ld… <NA>    
#> 4 analysis_ld… lda(sgs_data)             lda   sgs_d… analysis_ld… <NA>    
#> 5 analysis_ld… lda(bbs_data)             lda   bbs_d… analysis_ld… <NA>    
#> 6 analysis_ld… lda(sdl_data)             lda   sdl_d… analysis_ld… <NA>    
#> 7 analysis_ld… lda(mtquad_data)          lda   mtqua… analysis_ld… <NA>    
#> 8 analysis_ld… lda(bad_portal)           lda   bad_p… analysis_ld… <NA>    
#> 9 results_lda  "collect(list(analysis_l… lda   <NA>   <NA>         results…

Created on 2019-02-12 by the reprex package (v0.2.0).

diazrenata commented 5 years ago

Ooh, thanks!

So I see two things: 1) analyses is not correct in mine and 2) I get a message:

Warning message:
Converting double-quotes to single-quotes because the `strings_in_dots` argument is missing. Use the file_in(), file_out(), and knitr_in() functions to work with files in your commands. To remove this warning, either call `drake_plan()` with `strings_in_dots = "literals"` or use `pkgconfig::set_config("drake::strings_in_dots" = "literals")`.

I'm going to try the pkgconfig and see if that fixes it.

> library(MATSS)
Please look at our data formats by running `vignette("data-formats")`
> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> library(drake)
> 
> ## Read in the maizuru community data from a csv file
> get_maizuru_data <- function()
+ {
+     data_path <- system.file("extdata", "Maizuru_dominant_sp.csv",
+                              package = "MATSS", mustWork = TRUE)
+     raw_data <- read.csv(data_path)
+     
+     list(abundance = dplyr::select(raw_data, -date_tag, -surf.t, -bot.t, -Y, -M, -D) %>%
+              mutate_all(~round(. + 1e-10)),
+          covariates = dplyr::select(raw_data, date_tag, surf.t, bot.t, Y, M, D))
+ }
> 
> ## Get raw data
> datasets_raw <- drake_plan(
+     bbs_data_tables = rdataretriever::fetch("breed-bird-survey"),
+     sdl_data_tables = rdataretriever::fetch("veg-plots-sdl"),
+     mtquad_data_tables = rdataretriever::fetch("mapped-plant-quads-mt")
+ )
Warning message:
Converting double-quotes to single-quotes because the `strings_in_dots` argument is missing. Use the file_in(), file_out(), and knitr_in() functions to work with files in your commands. To remove this warning, either call `drake_plan()` with `strings_in_dots = "literals"` or use `pkgconfig::set_config("drake::strings_in_dots" = "literals")`. 
> 
> ## Clean and transform the data into the appropriate format
> datasets <- drake_plan(
+     portal_data = get_portal_rodents(),
+     maizuru_data = get_maizuru_data(),
+     jornada_data = process_jornada_data(),
+     sgs_data = process_sgs_data(),
+     bbs_data = get_bbs_data(bbs_data_tables, region = 7),
+     sdl_data = get_sdl_data(sdl_data_tables),
+     mtquad_data = get_mtquad_data(mtquad_data_tables),
+     bad_portal = portal_data[[1]]
+ 
+ )
> 
> ## Analysis methods
> methods <- drake_plan(
+     lda = function(dataset) {run_LDA(dataset, max_topics = 6, nseeds = 20)}
+ )
> 
> ## Define how results are collected
> collect <- function(list_of_results, plan)
+ {
+     names(list_of_results) <- all.vars(match.call()$list_of_results)
+     list_of_results
+ }
> 
> ## The combination of each method x dataset
> analyses <- drake_plan(
+     # expand out each `fun(data)``, where
+     #   `fun` is each of the values in methods$target
+     #   `data` is each of the values in datasets$target
+     # note: tidyeval syntax is to get all the values from the previous plans,
+     #       but keep them as unevaluated symbols, so that drake_plan handles
+     #       them appropriately
+     analysis = target(fun(data),
+                       transform = cross(fun = !!rlang::syms(methods$target),
+                                         data = !!rlang::syms(datasets$target))
+     ),
+     # create a list of the created `analysis` objects, grouping by the `fun`
+     # that made them - this keeps the results from the different methods
+     # separated, so that the reports/syntheses can handle the right outputs
+     results = target(collect(analysis, ignore(analyses)),
+                      transform = combine(analysis, .by = fun)),
+     trace = TRUE
+ )
> print(analyses)
# A tibble: 3 x 3
  target   command               transform                               
  <chr>    <chr>                 <chr>                                   
1 analysis fun(data)             "cross(fun = list(lda), data = list(por…
2 results  collect(analysis, ig… combine(analysis, .by = fun)            
3 trace    TRUE                  NA                                      
>

diazrenata commented 5 years ago

Alas, no. It got rid of the warning message, but the rest of the output stayed the same.

Stepping through the code to make the analysis section, I get a couple of kinds of error:

 analysis = target(fun(data),
                      transform = cross(fun = !!rlang::syms(methods$target),
                                        data = !!rlang::syms(datasets$target))
)

rlang: For example:

> fun_test = !!rlang::syms(methods$target)
Error in !rlang::syms(methods$target) : invalid argument type
data_test = !!rlang::syms(datasets$target)
> data_test = !!rlang::syms(datasets$target)
Error in !rlang::syms(datasets$target) : invalid argument type

Googling lead me to updating everything (R, all tidyverse packages, rlang), which didn't work. If I understand correctly, this line is trying to create lists fun and data (or in this case, fun_test) of all the things in methods$target and datasets$target but as symbols rather than strings-in-quotes. I can accomplish this if I remove the !!:


> fun_test = rlang::syms(methods$target)
> data_test = rlang::syms(datasets$target)
> fun_test
[[1]]
lda

data_test [[1]] portal_data

[[2]] maizuru_data

(cut off for length)

I don't know if this is an OK solution or if the issue is specific to my setup? I'd explore further, but....

1. `cross` This happens:

transform = cross(rlang::syms(methods$target),
rlang::syms(datasets$target)) Error in cross(rlang::syms(methods$target), rlang::syms(datasets$target)) : could not find function "cross"
or
transform = cross(fun_test,
data_test) Error in cross(fun_test, data_test) : could not find function "cross"
Which package is `cross` coming from in this case? 

I tried some detective work (see below) but am coming up confused. I think (rather than necessarily wading through what I've tried) it would help me if I could see what analysis, the transform pieces, and results look like when the pipeline is working properly? I.e. I'm not sure when the expansion is supposed to happen vs. passing around the command to do the expansion, if that makes sense...

cross efforts (these don't lead to any resolution, so probably a good time for tl;dr): Googling got me to purrr::cross, which doesn't work with this syntax:

> transform = purrr::cross(fun_test,
+                   data_test)
Error in .l[[j]][[index]] : object of type 'symbol' is not subsettable

but purrr::cross2 does:

> transform = purrr::cross2(fun_test, data_test)
> str(transform)
List of 8
 $ :List of 2
  ..$ : symbol lda
  ..$ : symbol portal_data
 $ :List of 2
  ..$ : symbol lda
  ..$ : symbol maizuru_data

(again cut off for length)

As I understand, this still isn't quite what we want. We want transform to be like

> transform_goal = list(list(fun = rlang::sym('lda'), data = rlang::sym('portal_data')))
> str(transform_goal)
List of 1
 $ :List of 2
  ..$ fun : symbol lda
  ..$ data: symbol portal_data

But even this doesn't work, because transform is getting quoted as a string instead of evaluated and expanded to populate analysis or analyses.

ha0ye commented 5 years ago

Ok, sorry, this might be my bad. I think you might need version 6.2.1.9002+ of drake, as that implements the new syntax for specifying complex plans. Can you reinstall drake from github and try again?

diazrenata commented 5 years ago

That's it! Thanks!

ha0ye commented 5 years ago

Also, welcome to the land of rlang and NSE 🙀

weecology / MATSS

pipeline fails at analysis #65