yihui / knitr

A general-purpose tool for dynamic report generation in R
https://yihui.org/knitr/
2.39k stars 878 forks source link

Unknown bug arises from using cache = TRUE & data.table #1457

Closed MichaelChirico closed 6 years ago

MichaelChirico commented 7 years ago

I'm going crazy from this; apologies in advance for not being able to provide a reproducible example -- my data set is quite large (hence the need for cache).

I have the following code in a chunk:

DT[DT[ , .(w = sum(V1)),
              by = .(id1, id2)
              ][ , .(m = median(w)), by = id1
                 ][ , .(id1, h =
                          cut(m, breaks = break_vals,
                              include.lowest = TRUE, right = FALSE))],
      .(id1, id3, h, V1), on = .(id1)
      # there are about 1,000,000 rows in the result here
      ][ , median(V1), by = h]

When I run it from a fresh session, there's no problem. And when I load the data into memory, it runs fine, as well.

However, when the chunk used to create DT is created with cache = TRUE, this chunk errors:

Error in gmedian(week_hrs) : negative length vectors are not allowed Calls: ... eval -> eval -> [ -> [.data.table -> gforce -> gmedian

gmedian is an internal data.table function. That [.data.table got dispatched suggests it's not the kind of problem that arises with data.table being loaded from binary where we need to setDT the object if it's added with, e.g., load. In fact if I add everything before the last median call in a chunk just before this one it runs completely fine:

cat(capture.output({
  DT[ , .(w = sum(V1)),
                by = .(id1, id2)
                ][ , .(m = median(w)), by = id1
                   ][ , .(id1, h =
                            cut(m, breaks = break_vals,
                                include.lowest = TRUE, right = FALSE))],
        .(id1, id3, h, V1), on = .(id1)]
}, sep = '\n')
# to prevent it running the erroneous chunk
stop()

Which assures that DT is still being treated as a data.table, that GForce is being dispatched correctly (the m = median(w) line dispatches GForce), up through that point.

Further, if I replace the real V1 with V1 = rnorm(.N) to try and generate anonymized data to share here, the code does not reliably error (extra surprising since I assumed the error was related to h being a factor.

That's as far as I've gotten... it certainly seems like a bug somewhere. This is a pain because the cached chunk takes about 20 minutes to run -- a perfect use case for only running it on occasion if the underlying code changes. But as stands this isn't feasible.

yihui commented 7 years ago

You don't need real data to create a reproducible example. Use simulated / random data instead.

Since the same error occurred without using knitr (e.g. https://github.com/Rdatatable/data.table/issues/2046), I doubt if it is really a knitr issue. Anyway, without a reproducible example, I cannot do anything about it (i.e. I cannot fix an issue only by guessing, especially when the error comes from another package that I don't maintain). Sorry.

MichaelChirico commented 7 years ago

I don't think the issue is the same. and I haven't been able to reliably reproduce in mocked up data, though I've tried a bit to force it

Of course I understand you can't devote much time on open ended guessing. Posted this 1) in case something obvious came to mind or you had a suggestion to assay the issue a bit more and 2) to update if/when I can nail down the issue to something reproducible. thanks.

yihui commented 7 years ago

Sounds good. I'll be happy to re-open this issue when we are sure it is knitr's fault. Thanks!

github-actions[bot] commented 4 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.