Closed ha0ye closed 6 years ago
One part of this is the point of the 'year' or 'season' function (#2, also discussed in #64). There should be one function that takes any data (rodent, weather, plant, ant) and summarizes it by the timescale specified (month, 2-season, 4-season, year). That should get the summarizing steps out of the actual data cleaning functions. I think @bleds22e is working on that?
Also, I had envisioned the abundance function actually being separate from the biomass/energy function e.g.
abundance()
biomass()
biomass(currency="energy")
That's because the data cleaning is significantly different. (I'm only just realizing, looking at the PR, that you guys have them together.) But if it is possible to clean up the function and still keep abundance, biomass and energy together, then okay!
PR #81 was getting messy, so I think some sort of discussion or decision-making on what parts of the pipeline overlap or not is imminent.
The general pipeline is retrieve -> clean -> subset -> summarize
Those should not overlap at all. And they should even be split apart further within a step depending on the data. (We started with a few big specific functions, and have been slowly working on sequestering chunks into their own functions, like with the data_processing functions.)
Sorry, I meant overlap in the sense that "retrieve" and "clean" might be similar or identical between abundance
and biomass
.
So the cleaning/summarizing got complicated because of the issue with tidyr reporting empty records as NA or not reporting them at all when you summarize or complete (I don’t remember which without the code). that’s why there’s some apparently redundant chunks and why the cross tab section gets so complicated depending on how the data is being summarized. I think I ended up turning all nas to 0 and then getting the trapping table back and re-assigning nas to the periods that weren’t trapped. How you want to populate the zeros/nas depends on how the output needs to be including both level and shape.
That might be something to take into account in separating cleaning from summarizing? There’s probably a sleeker route, but that was a pitfall that I ran into in this first pass...
On Thu, Jan 18, 2018 at 16:58 Hao Ye notifications@github.com wrote:
Sorry, I meant overlap in the sense that "retrieve" and "clean" might be similar or identical between abundance and biomass.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/weecology/portalr/issues/83#issuecomment-358795197, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_ht-4NS2lXeyCGEzXA6P9ToJAv_M2Oks5tL75ygaJpZM4RjgdJ .
I was also imagining vertical overlap, if you will, as being minimal. That's because there are so many peculiarities in the data that you may actually want to clean it differently depending on whether you're using the weights or not, for example. It's unfortunate, but true.
That's just my guess. If it works out to be simpler, then great. We can take it on a case-by-case basis.
e.g. @diazrenata 's current issue, suggests that combining abundance and biomass makes them both significantly more complicated. But hopefully there is a way around that, and combining the two is really as simple as adding fill_weights to the cleaning step? (We can accommodate biological peculiarities, but we should try to avoid programming ones.)
I remember seeing that same problem, and it's described here
The abundance output uses table to compute counts, which is not ideal, but does preserve missing species. I looked into using summarize
and tidyr::complete
, but it ends up being slower because of how tidyr::complete works.
I think there's a relevant question here on what to do with 0s. In the cross-tab output, you have to have something, but in the flat (long format) output, you can drop the rows, which makes the output smaller. If you keep them the same, then tidyr::gather and tidyr::spread will convert from one to the other fairly easily though...
How strict is the format of the output that is being checked in test-regression.R
?
For example, the current output for abundance at level = "treatment"
has:
# A tibble: 1,468 x 23
period treatment BA DM DO DS `NA` OL OT PB PE PF PH PI
* <dbl> <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 control 0 42 0 6 4 0 1 0 2 3 0 0
2 2 control 0 63 1 13 4 0 0 0 0 8 0 0
3 3 control 0 45 0 24 4 0 0 0 2 5 0 0
4 4 control 0 40 3 19 6 3 3 0 2 1 0 0
whereas, for abundance at level = "plot"
has:
# A tibble: 30 x 23
period plot BA DM DO DS `NA` OL OT PB PE PF PH PI PL
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 0 2 0 0 0 0 0 0 0 1 0 0 0
2 1 2 0 1 0 0 1 0 0 0 1 0 0 0 0
3 1 3 0 2 0 1 1 0 0 0 0 0 0 0 0
...
23 1 23 0 3 0 0 0 0 0 0 0 0 0 0 0
24 1 24 NA NA NA NA NA NA NA NA NA NA NA NA NA
Whereas the first table drops missing levels of period x treatment, the second table includes the unsampled period x plot levels, but sets abundance to NA.
I feel like we should be consistent in what we do, rather than have it depend on whether the user chooses level = "treatment" or "plot".
cc @diazrenata, @skmorgane, @gmyenni
Oh yeah, this is one of those 'Portal peculiarities' situations. Plot 24 didn't always exist. We dealt with that at the plot level by forcing NAs.
At the treatment level, if a period x treatment doesn't exist, it's probably because trapping wasn't done at all that month. (Treatments are divided between nights, so we specifically try to avoid a case where one treatment is trapped but not another.) Currently, skipped months are just skipped, not filled with NAs.
I think it's like that because we assumed lots of NAs was worse than just skipped months? Well, that, and that NA means something different depending on the situation.
Is there a way to make this consistent? We can use a complete table every time (the full list of new moon numbers, and just fill with NAs as needed). Would that be terrible? @skmorgane ? I feel like we've been through this before and decided it was terrible.
You know, if a significant number of these issues are just arising from the crosstab option, we can also just stop doing crosstabs. People can deal with it themselves, if they want it.
It's not crosstab, specifically. I'm rewriting some of the summary code, and I think we can use some of the same summary code in R if the output is consistent. Right now, it isn't, so there's some messy hacks to handle level = plot
differently from level = treatment
. (related to @diazrenata's comment above)
Is this missing data issue related to issue @DAPPERstats is talking about over at portal predictions? https://github.com/weecology/portalPredictions/issues/161 In that issue I think we've decided that we need the missed months need to be documented as NAs in the data used for forecasting so that the models can deal with missing observations appropriately. There's a lot going on but it seems like we need the NAs inserted for that.
@gmyenni , I remember us deciding not to include the NAs in the summary output but I can't remember why except for size of the file (especially the monthly plot by species output).
Having the complete table with missing factor combinations does get large.
Discussing with @DAPPERstats, one solution is to add an argument to allow for missing factor combinations to be dropped or not. This can have default settings that depend on whether plot or treatment is selected for the level, but also allow the models in PortalPredictions to specify whether they want the NAs or not.
one solution is to add an argument to allow for missing factor combinations to be dropped or not
I think this is a good general solution. Let folks have the option, default to whatever is most usable.
That said, @gmyenni indicated that the definition of NA
's varied:
NA means something different depending on the situation
Looking at the discussion surrounding this quote it felt like they were effectively the same in that NA
indicated no data due to non-trapping. That could be plot 24 for several periods, or a bunch of plots for a single period, but it can still be described as "no data due to non-trapping". I may very well have missed something since this is a longer conversation.
At the treatment level, if a period x treatment doesn't exist, it's probably because trapping wasn't done at all that month.
When this happens is the abundance value being corrected for the different number of plots trapped to provide an estimate of treatment abundance or is it being filled with the actual abundance based on a subset of the plots?
Either way I think we may want to give users the option to just use fully sampled months.
A further complication is with biomass, where the output reports 0 for identified, but unweighed animals (and without imputing missing weights). Commonly, this might be reported as NA, but if the output is in a cross-table format with species as columns, it can't be distinguished from animals that weren't observed.
When this happens is the abundance value being corrected for the different number of plots trapped to provide an estimate of treatment abundance or is it being filled with the actual abundance based on a subset of the plots?
Right now, it's just doing a simple sum, but we have this noted in #47.
Either way I think we may want to give users the option to just use fully sampled months.
I think setting length = "longterm"
should do this, or were you thinking of something different?
Right now, the default is to only use fully sampled months, with incomplete = FALSE
. But #47 will give better options than just incomplete = TRUE
.
Re: the meaning of NAs --
If we fill missing combinations of year month treatment species
or year month plot species
with NA
, that could represent slightly different meanings (ignoring the obvious real 0
vs NA
differentiation):
1) that combo doesn't exist 2) that combo was skipped that month 3) that combo was trapped, but is unusable (a negative period code)
[This is just an important clarification for us, to make sure we're doing things correctly. I don't imagine this difference would matter to the average user of the package.]
Thanks @gmyenni & @ha0ye. That all makes sense to me. :+1: :+1:
WRT biomass - I think if we're producing a biomass value then it almost has to include imputed values since it so much better an estimate than not counting it at all. Anyone who needs to do something more sophisticated (e.g., do their own imputation) would need to get the individual level data anyway.
@ethanwhite, not all the rodents get imputed biomass; I opened a separate issue to discuss this: #86
Ok, I've diverted the new questions into new issues (#88 and #89) to continue those discussions if need be, so I'm going to close this issue (now that the code has been merged into master).
Right now the
abundance
function does a lot, which makes the pipeline hard to test and debug. (and it will only get worse with the implementations for biomass and energy).We should consider splitting it up, possibly in multiple private functions, and documenting the inputs and outputs to each step.