Separate notebook source from cached output

riedelcastro commented 9 years ago

I run quite often into the problem that while editing moro hangs for one reason or another. When this happens while I am editing moro seems to cache a hanged state that then shows up in the static view. To get the static view to show the right results I need to go the editor view and re-run. If I happen to push the notebook in this state to github the server inherits the same problem. It's also weird because sometimes I push and I get a lot of changes to the notebook files, without actually making a lot of changes, just because the cached state has changed and is committed. These problems should disappear if the server caches results in a separate file (not under source control), not in the source file.

sameersingh commented 9 years ago

Separate file or not, the main question is whether the output is generated during editing or during static viewing. Currently it is the former, where the output is always synced to the source notebook (doesn't really matter if it is separate file or not), so you would want both to be in the revision control.

The main reason to save output during editing is so that the notebooks can be viewed in a non-interactive manner without having a server running, similar to iPython notebook viewer. Again, doesn't matter whether it is a different file or not.

On the other hand, we can just cache it the first time a file is rendered for static view. It still means the first time someone runs the static view it will run the whole code, which might be undesirable (imagine if it needs hours on a cluster, for example). Are you proposing we should only be saving the output after running it in static view? In that case, we can't have a version of notebook that can be rendered without a full server running somewhere.

I think saving during editing makes sense. The only thing that is actually fixed by keeping a separate file is that diffs will be smaller, but you will still need to commit both versions of the file for caching to work. And if Moro hangs, and the source and output are mismatched, the notebook will look weird anyway.

What I think could be reasonable is to (1) have a way for notebooks to not save their output during editing, and (2) if there is no output, for the server to run the code and save the output as a separate file.

riedelcastro commented 9 years ago

For me the main issue is still separate file or not :) Generally source code should be under source control, generated files should not (ipython seems like a bad exception in my view, in any other setting this distinction is always clear). Beyond just a principle---which, for example, enables different ways of rendering the same moro content---I really run into problems with this. Often times the notebook source is totally fine, but because of a memory leak moro hangs and screws up the output. Now I commit, and the web server has screwed up output too, even though the source is fine. Then I need to go my local editor, run the notebook again, and then commit. etc. In addition, notebooks json files where somewhat readable at some point, now they are bloated and hard to parse.

My proposal would be: any moro server implementation can cache output as much as it needs, but internally without changing source files. The server could generate the cache when the editor is used, or on the fly when needed for the static view, I don't care. If someone wants to see the generated static html files without a server, she will call sbt runMain generateStaticNotebooks, and this will put some static output somewhere. I really see no need for storing output in source files.

sameersingh commented 9 years ago

That sounds reasonable, but it will take a while to get this implemented, which might mean no caching in the meantime. Part of the reason I had it in single file was because it was easy :)

Here's the new proposal:

Notebooks only have source
During edits, I generate the output as well, in a separate file, but maybe it's not in source control
During static view, I check for the file, and if it is there, assume it's correct and synced up to the source
If it doesn't exist, there's no caching, and I run the whole code
(future) create a cache by issuing some command like "/doc/gen_cache/...", and maybe also a "/doc/gen_cache/all"

On Mon, Jun 22, 2015 at 3:17 PM, Sebastian Riedel notifications@github.com wrote:

For me the main issue is still separate file or not :) Generally source code should under source control, generated files should not (ipython seems like a bad exception in my view, in any other setting this distinction is always clear). Beyond just a principle---which, for example, enables different ways of rendering the same moro content---I really run into problems with this. Often times the notebook source is totally fine, but because of a memory leak moro hangs and screws up the output. Now I commit, and the web server has screwed up output too, even though the source is fine. Then I need to go my local editor, run the notebook again, and then commit. etc. In addition, notebooks json files where somewhat readable at some point, now they are bloated and hard to parse.

My proposal would be: any moro server implementation can cache output as much as it needs, but internally without changing source files. The server could generate the cache when the editor is used, or on the fly when needed for the static view, I don't care. If someone wants to see the generated static html files without a server, she will call sbt runMain generateStaticNotebooks, and this will put some static output somewhere. I really see no need for storing output in source files.

— Reply to this email directly or view it on GitHub https://github.com/wolfe-pack/moro/issues/67#issuecomment-114284912.

sameersingh commented 9 years ago

Implemented a version of this, let me know if something is broken. Should be backward compatible in that all notebooks can be read, but previously cached output in the notebook file itself will be ignored. Cache file is generated at editor time, but not checked in.

gen_cache is still needed.

sameersingh commented 9 years ago

gen_cache also present now, which you can call with /doc/gen_cache/path/to/nb to generate the cache for the path/to/nb notebook, which is then used automatically in static views.

I think this is good enough to be closed now. Reopen if any issues come up.

riedelcastro commented 9 years ago

Like this http://moro.wolfe.ml:9000/doc/gen_cache/wolfe-docs/concepts/02_terms? I can't see to get this to work (and the cached output for this nb is out of sync).

sameersingh commented 9 years ago

Hadn't updated the server.

This is now cached: http://moro.wolfe.ml:9000/doc/wolfe-static/wolfe-docs/concepts/02_terms

If you run your gen_cache command, it'll regenerate the cache, which takes 30-45 seconds on this notebook.

On Thu, Jun 25, 2015 at 1:33 AM, Sebastian Riedel notifications@github.com wrote:

Like this http://moro.wolfe.ml:9000/doc/gen_cache/wolfe-docs/concepts/02_terms? I can't see to get this to work.

— Reply to this email directly or view it on GitHub https://github.com/wolfe-pack/moro/issues/67#issuecomment-115162169.

wolfe-pack / moro

Separate notebook source from cached output #67