yihui / litedown

A lightweight version of R Markdown
https://yihui.org/litedown/
Other
130 stars 1 forks source link

General questions about litedown #10

Open statquant opened 1 month ago

statquant commented 1 month ago

Hello, I've been very interested to see that you're giving rmarkdown another look. I've red the docs and I have some questions, please let me know if this is not the place to ask for questions.

You say

HTML widgets are not supported yet, but may be reimagined in the future with some minimal support.

At work we use exclusively widgets but because we never managed to find a way to load them lazily in the browser (hence loading the document was unusably slow) we started to do the following: we use webshot (or webshot2 but it's buggy) to create snapshots and we link the actual html widget file to the snapshot picture. This has been a game changer for us, can I assume this will work as well ? Happy to hear if you have a better way to do this. BTW we are actually deferring the saving the widget to html on another thread which speeds up the rendering a lot, I do not know if you expect to do things similar to this as well but this might be food for thoughts.

You say

By default, litedown produces bare minimal HTML output, but many features can be enabled by adding CSS/JS assets (see 4).

Having a pleasing output is pretty important and too raw of an output might deter people, will you be showing examples or how to reproduce something like the default rmarkdown output, I am thinking something with a collapsible TOC as well ?

You say

You can feel more confident with using the chunk option cache = TRUE in litedown than in knitr. The new cache system is more robust and intelligent.

No hooks at the moment.

This is what perplexed me the most, I never could make use of caching satisfyingly. As long as you work with medium size data (say a few million rows and a few columns) chunks that will load/update/save the data will always be too long to load. I went through to modify the rds format to swap to qs to improve the cache loading (binary format a and multi-thread reading) but that's just too slow. For the exact same reason I never could use the preview efficiently, the total re-rendering being too slow. My view is that the only right thing is to not do anything at all when the code has not moved, very much like what jupyter is doing. I believe most user will knit the document in session anyway so data will be there. I personally solved this by creating a hook that won't eval the chunk if the hash of the code chunk has not changed, but if there is no hook I'd be lost. Do you have a nice way to provide a way to work "a la jupyter" (as far as caching/kniting is concerned) ?

Many thanks for your work I am using it everyday, to me being able to use R code in eval is still a killing feature that no other tool provides.

yihui commented 2 weeks ago

For HTML widgets, I need to know more about your use case and understand the bottleneck before I can tell if the problem is solvable. Without knowing the details, I think at least the widgets can be lazy-rendered, i.e., don't render them until they are scrolled into view. If the JS rendering is not the bottleneck, this won't help much. If you have hundreds of widgets on the same page, I guess lazy-rendering won't help much regarding the memory/CPU usage in the end (it only helps in the beginning).

will you be showing examples or how to reproduce something like the default rmarkdown output, I am thinking something with a collapsible TOC as well ?

Yes. For collapsible TOC, you need the following css/js:

---
title: "Test TOC"
output:
  litedown::html_format:
    meta:
      css: ["@default", "@article"]
      js: ["@sidenotes", "@toc-highlight"]
    options:
      toc: true
      number_sections: true
knit: litedown:::knit
---

```{r, results='asis'}
cat(paste(
  strrep('#', sample(0:3, 100, TRUE, c(.6, .1, .1, .2))), 'test',
  collapse = '\n\n'
))


> I never could make use of caching satisfyingly.

Caching _is_ hard. I completely rewrote the caching system for **litedown** in `xfun::cache_exec()` (it is not tied to **litedown** but can be used in other places). It supports both `rds` and `qs`, and you can bring your own read/write methods if the built-in methods are not satisfactory (e.g., not fast enough). Loading cache is always lazy, i.e., cache files won't be read unless cached objects are to be used. If you have a large object and it is not used in an uncached chunk, the object will not be read from cache, which can save you substantial time. Caching in **litedown** still needs substantial work on documentation. Users have to understand how caching works to take full advantage of it, otherwise caching does not necessarily make the build faster.

> the only right thing is to not do anything at all when the code has not moved, very much like what jupyter is doing

I think it is simple enough to implement that. The question is whether it's the right thing to do. [Jupyter is notorious for the hidden-state problem](https://yihui.org/en/2018/09/notebook-war/#1-hidden-state-and-out-of-order-execution). I'd like to avoid that.

With the proper use of caching, I think the preview should be fast enough. Most documents should take no more than one second to build. Performance is a priority of **litedown**, and I'll try my best to make it faster as I learn more from practical use cases.

Thanks for your feedback!
statquant commented 2 weeks ago

Hello, thanks for coming back to me,

you need the following css/js

Thanks ! Can I suggest that sometimes in the future you show an example that would produce mostly the same styled output one would get by default in rmarkdown ?

For HTML widgets, I need to know more about your use case and understand the bottleneck before I can tell if the problem is solvable. Without knowing the details, I think at least the widgets can be lazy-rendered, i.e., don't render them until they are scrolled into view. If the JS rendering is not the bottleneck, this won't help much. If you have hundreds of widgets on the same page, I guess lazy-rendering won't help much regarding the memory/CPU usage in the end (it only helps in the beginning).

We routinely produce html reports through rmarkdown with 100 plotly graphs. None are "big" (as in 100,000 points) but they all need to be rendered (as in displayable in the browser) before the page is useable (can be looked at and scrolled). I've never seen a way to render lazily (for instance hafen/lazyrmd does not work). As I said we found a hack-workaround by saving each graph as html and showing a static image with a link to each hml file in the report. Obviously, no workaround would be amazing.

I think it is simple enough to implement that. The question is whether it's the right thing to do. Jupyter is notorious for the hidden-state problem. I'd like to avoid that.

Unfortunately for me, I strongly stand on the other side, in practice I do not think this hidden state problem is much of a problem. As you say one just has to re-execute in clean session, which is what everybody does. Adding friction in the process of research (like having to wait to re-render from scratch, be it with caching, a document) is very detrimental. And the more advanced (data-heavy) the research is the worse it will get. Schematically speaking if I was about to re-render a 70 chunks Rmd document by only eval-ing the few (1) chunk that changed and see the output updating instantly that would change my work life (same for a few of my colleagues)

With the proper use of caching, I think the preview should be fast enough. Most documents should take no more than one second to build. Performance is a priority of litedown, and I'll try my best to make it faster as I learn more from practical use cases.

Pragmatically that's orders of magnitude away from what I see, let me describe schematically what I see in practice.

  1. I start with a data step that would take several minutes to generate to produce say a data.table "data" of say 20M rows, 20 columns. This is loaded in RAM and stored on ssd as a fst or qs file. I have my own caching but loading back this file would still take 10-30 seconds.
  2. I add several 10s of chunks that usually require working from "data" with aggregations etc... most chunks would take say a handful of secs to process and would end up by producing a plotly graph of a time serie of say 1000 points, sometimes several series of 1000 points sharing the same x axis.

In jupyter I would only execute the last chunk when ready and work incrementally

In R the equivalent workflow would require to render several times and that's too slow even if: I always render within the current environment. I created a hook that will simply not execute a chunk if a given list of objects already exist in the environment (typically I would test "data" in the data-chunk). I always use caching (and I cache in qs or fst format according to the class of the object)

I really think the problem is that loading the cache ends up being slow at some point and unless we can keep the output of a chunk that has not changed this cannot be solved.

yihui commented 2 weeks ago

I've been extremely busy recently and probably won't free up until August, so let me give you some quick answers first.

I've spent a large amount of time on the problem of state vs performance when designing litedown. It's a tricky problem, but I think the solutions that I have come up with so far should work well enough. In short, I have provided two approaches:

  1. In-memory caching: if you set litedown::reactor(cache = ':memory:') in the first code chunk, your objects will be cached in memory, which can save you substantial time because it no longer needs to read/write files. The price to pay is larger memory consumption. This approach is similar to your hook, but should be more intelligent (it can decide whether a chunk should be re-computed by testing whether its dependencies have changed).
  2. On-disk caching for all chunks: set litedown::reactor(cache = TRUE) in the first chunk to simply cache all code chunks. This should achieve the goal that you mentioned: only execute the code chunk that has been changed. However, the caching is also intelligent enough in the sense that if a chunk's dependencies have changed, the chunk's cache will be invalidated. This will avoid aforementioned jupyter's problem.

I haven't tested these approaches thoroughly myself, so it's possible that they are buggy somewhere. It will be great if you can help test them. Of course, the in-memory caching only works when you keep previewing a document in the same R session, which is what litedown::roam() does.

Re: the HTML widgets issue, it will be great if you can provide a reproducible example. I will take a closer look later.