Interactive, online analysis tool (devname: w_ipython)

Hi all,

Inside of the DEVELOPMENT branch of west_tools, I've started work on an interactive tool to ease interactive and automated analysis; the current name is w_ipython. I'm totally up for a different name.

The idea stemmed from the fact that we routinely needed to access the raw h5 data, either to debug a simulation or analyse it in a way that the tools don't currently (and probably won't ever) support. It would have been nice, I figured, to have a script that would just load up the main h5 file (typically west.h5) and avoid having to do import numpy, h5py, load up the iterations, etc, and maybe throw in a few convenience functions.

It sort of grew from there. It currently looks through the main configuration file (west.cfg), pulls in analysis parameters, runs functions that it needs to, and drops you at an ipython prompt with a 'w' object that contains all the information from your simulation and the analysis you've selected to do.

The initial and current development goals, as well as their implementation, are as follows.

Ease analysis automation, in terms of reproducing the data, keeping track of analysis parameters, and in ensuring files from one analysis aren't accidentally used in other.
1. To do this, it's using 'analysis schemes', specified in west.cfg. These are simply parameters, underneath a name, for the main tools we use in analysis, with some 'these seem sane' defaults if the user hasn't specified anything. When the tool is run, it parses through the schemes, tries to load all the files from the different schemes, and if it can't, runs the appropriate analysis (from the appropriate tool) with the work manager selected from the command prompt (so, whatever the current default is). All files are stored in a subdir (specified by the user) in the simulation directory, and so the files won't interfere.
2. This also enables us to conveniently scale out to clusters for analysis using the work manager (instead of w_run, you just call w_ipython). Not that it was impossible before, but.
Avoid having to use X. We ship a few plotting tools, which is really quite nice, but it does mean an X server is required (or a way to ship files back and forth; the matplotlib backend agg IS available, but we don't use it by default). Plothist/ploterr and then rsyncing can get a bit nasty, particularly when we just need a quick look at an evolution plot, not publication quality graphs.
Easily analyse the simulation. Currently, if you want to analyze anything from the h5 file, you'll need to use either h5py or hdfview (and only h5py if you want to run calculations on things). If you want to plot data from an individual walker history, that also requires a set of scripts that we all probably have, but don't ship. Here, there are a number of convenience functions, all set up in a 'stateful' manner:
1. The user selects the iteration and analysis scheme they're interested in.
2. Three 'iteration' objects are now available to the user: w.current, w.past, and w.future.
3. Each 'iteration' object contains all the information from west.h5, assign.h5, stateprobs.h5, kintrace.h5, kinavg.h5, flux_matrices.h5, and kinrw.h5 (to use the default names) for the current iteration, in a convenient, easy to parse dictionary. In addition, they're all keyed to the current segid; if you're interested in what walker 6 was doing in the past, pcoord wise, you'd simply call 'w.past['pcoord'][6]', and it'll sort out the appropriate parent information for you. In addition, if you want to see what any future children of 6 are doing, you'd call 'w.future['pcoord'][6]', and it will return the pcoord information from the children of walker 6, assuming it has any.
4. There's a convenient trace function, which pulls all information about the walker (weights, pcoord, auxdata, state information, bin information, etc) and returns it in a dictionary. This makes it trivial to plot the weight or auxdata evolution of a particular trajectory, as an example.
5. The plotting interface is currently limited to doing basic state probabilities (NOT using w_stateprobs information, at the moment) and plotting evolution plots (with CI), using either X or the terminal, for either kinavg or kinrw. Calling 'w.help' should help elucidate this (and any other) information.

Some issues that would need to be ironed out before release:

The default parameters are set up as I found them convenient (which is cumulative evolution with a step size of 1). This is probably more or less fine, but it means I hardcoded a few 'this is always going to be an evolution plot with a step size of 1' bits in here and there that I'd need to change. It'll probably work with a bit of tweaking, but I should do said tweaking.
It can currently only construct rectilinear bin mappers, with code blatantly ripped from w_assign. I should just call this function from w_assign to un-duplicate the code, or just think of a better way to do it altogether.
MORE DOCUMENTATION.

It's calling functionality from other code whenever it can, for the most part, so it should be easy enough to maintain.

A few screenshots or configuration options, for the unbelievers:

Inside my west.cfg:

  w_ipython:
    directory: ANALYSIS
    postanalysis: True
    w_kinavg:
      bootstrap: True
    analysis_schemes:
      BOUND:
        enabled: True
        states:
          - label: unbound
            coords: [[10.0]]
          - label: bound
            coords: [[3.99]]
        bins:
          - type: RectilinearBinMapper
            boundaries: [[0.0,4.0,10.00,100000]]
      NOCORREL:
        enabled: True
        w_kinavg:
          bootstrap: True
          correl: False
        states:
          - label: unbound
            coords: [[10.0]]
          - label: bound
            coords: [[3.99]]
        bins:
          - type: RectilinearBinMapper
            boundaries: [[0.0,4.0,10.00,100000]]
      PROB:
        enabled: True
        w_kinavg:
          bootstrap: True
          correl: False
        states:
          - label: unbound
            coords: [[10.0]]
          - label: bound
            coords: [[3.99]]
        bins:
          - type: RectilinearBinMapper
            boundaries: [[0.0,4.0,100000]]

Startup, selecting iteration, and what's available in the current iteration: screenshot from 2016-10-18 17-03-57

Plotting from state 0 to 1 from the reweighting code: screenshot from 2016-10-18 17-05-52

Output from a trace. Easily plotted with pyplot, if one chose to do so:

screenshot from 2016-10-18 17-12-03

Comments, suggestions, criticisms, design suggestions, usability concerns, etc, are all appreciated. It's worth noting that all the tools have been updated such that they can run according to a particular 'analysis scheme' (in addition to their normal functionality), as well, so that it should be easy to integrate into an existing workflow. One can also call the 'analyze only' flag, as well, to just run everything and call it a day.

Adam

An output of the help, to give you an idea of the sort of information it exposes:

screenshot from 2016-10-18 17-24-53

@ajoshpratt It's an interesting idea and certainly with newer versions of IPython (>5.0) that have multi-line editing, it could be helpful. I know, however, that I tend to do most of my analysis in a Jupyter notebook when possible. This usually involved moving a copy of the data (or some relevant intermediate result) to my local machine, so I can see the advantage of having something that can be run remotely from the command line. What I never explored, that might be relevant is being able to run a remote jupyter notebook kernel and then attach a local browser to it so you get the best of both worlds:

http://jupyter-notebook.readthedocs.io/en/latest/public_server.html

Again, I've never done this, so there might be some major limitations, but maybe it's worth looking at so users could potentially leverage all of the niceties of the notebook and also have full-fledged plotting capabilities.

Also, I wanted to note from a workflow standpoint that I'd discourage you from having a generic DEVELOPMENT branch that all development goes into. Instead each feature should have it's own branch that comes off of a common development branch (or possibly master directly). This development branch should always strive to be fully deployable and the goal is to merge it into master when it is time to spin off a release. When a feature branch has been discussed and approved, then it gets merged in.

But more generally, I think the WESTPA team should have a well-defined workflow for adding features. Other big projects spell them out in the docs:

http://scikit-learn.org/stable/developers/contributing.html http://msmbuilder.org/3.6.0/contributing.html etc.

I know this is diverging from the main topic of the issue, but to keep the long term maintainability of the code, I think it behoves us to have a well-defined process that includes automated test running and pull requests.

@synapticarbors, thanks for the workflow suggestion. I agree; we don't have a well defined workflow, so it's easy to stumble into a development situation where changes and fixes end up getting built on top of each other without getting merged.

For what it's worth, I'd been thinking about breaking development of this off into another branch to keep this one focused on changes to the kinetics code, but hadn't decided if it was worth it. Development sins aside, though, I wanted feedback before making any more changes. I'll be opening another topic on the kinetics changes soon, once I can work through the writeup and document why the changes are necessary (as well as cleaning the code).

Anyway, the suggestion about leveraging Jupyter is worth looking into; the information about other people's workflows is also nice to hear. We could consider creating 'easy to import' modules (and sample notebooks) that could work with a Jupyter notebook, but greatly simplify analysis for new users (exposing the same sort of data we're doing here). Actually, that's probably pretty straightforward with this tool; when the object is created, it does all the work necessary to prepare the various datasets. I suppose you'd really only have to:

have the data locally available, and
instantiate the object.

Which is something I hadn't really thought of before. There are some 'convenience' plotting functions that take advantage of matplotlib that would already work reasonably well, here.

Adam

Hi Josh,

Thanks for bringing up these ideas of yours again about the workflow -- they have been on our list of things to do. It is very useful to see how projects like MSMBuilder that have been around for much longer than WESTPA have evolved in terms of handling workflow, etc. and we will keep it in mind. We still have a very small group of developers, so it will take some time to get everything in place.

Best, Lillian

On Tue, Oct 18, 2016 at 7:55 PM, Joshua Adelman notifications@github.com wrote:

@ajoshpratt https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fajoshpratt&data=01%7C01%7Cltchong%40pitt.edu%7C84aa14a2b13d45c1f9aa08d3f7b238bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=efvqSgP7q76dFYE%2B9N5OJhQ6CFSqRzDti3LxBoPfYgA%3D&reserved=0 It's an interesting idea and certainly with newer versions of IPython (>5.0) that have multi-line editing, it could be helpful. I know, however, that I tend to do most of my analysis in a Jupyter notebook when possible. This usually involved moving a copy of the data (or some relevant intermediate result) to my local machine, so I can see the advantage of having something that can be run remotely from the command line. What I never explored, that might be relevant is being able to run a remote jupyter notebook kernel and then attach a local browser to it so you get the best of both worlds:

http://jupyter-notebook.readthedocs.io/en/latest/public_server.html https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjupyter-notebook.readthedocs.io%2Fen%2Flatest%2Fpublic_server.html&data=01%7C01%7Cltchong%40pitt.edu%7C84aa14a2b13d45c1f9aa08d3f7b238bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=QsmtbKOtTjwXBn4b379Wv0nOT3qnvG5e4pmt5AF%2BFx0%3D&reserved=0

Again, I've never done this, so there might be some major limitations, but maybe it's worth looking at so users could potentially leverage all of the niceties of the notebook and also have full-fledged plotting capabilities.

Also, I wanted to note from a workflow standpoint that I'd discourage you from having a generic DEVELOPMENT branch that all development goes into. Instead each feature should have it's own branch that comes off of a common development branch (or possibly master directly). This development branch should always strive to be fully deployable and the goal is to merge it into master when it is time to spin off a release. When a feature branch has been discussed and approved, then it gets merged in.

But more generally, I think the WESTPA team should have a well-defined workflow for adding features. Other big projects spell them out in the docs:

http://scikit-learn.org/stable/developers/contributing.html https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fscikit-learn.org%2Fstable%2Fdevelopers%2Fcontributing.html&data=01%7C01%7Cltchong%40pitt.edu%7C84aa14a2b13d45c1f9aa08d3f7b238bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=djhkN5ibgs0pyz9sPrS8kPsyDyWLwkPvE7wMtULAVkI%3D&reserved=0 http://msmbuilder.org/3.6.0/contributing.html https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmsmbuilder.org%2F3.6.0%2Fcontributing.html&data=01%7C01%7Cltchong%40pitt.edu%7C84aa14a2b13d45c1f9aa08d3f7b238bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=L2MB4NGPzWkM8OG5TVq5Nw1FcUDnJxqEF%2BfNKIAJ99k%3D&reserved=0 etc.

I know this is diverging from the main topic of the issue, but to keep the long term maintainability of the code, I think it behoves us to have a well-defined process that includes automated test running and pull requests.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwestpa%2Fwest_tools%2Fissues%2F18%23issuecomment-254672677&data=01%7C01%7Cltchong%40pitt.edu%7C84aa14a2b13d45c1f9aa08d3f7b238bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=qhSTdG5MNrDeMOx217C8lcS4v0TFCfvsRu5Xw9t%2BRP4%3D&reserved=0, or mute the thread https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIiaoVdGQvO8BfITBlC8wb5dxskWR9xpks5q1VxogaJpZM4KaTIu&data=01%7C01%7Cltchong%40pitt.edu%7C84aa14a2b13d45c1f9aa08d3f7b238bf%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=ji29O%2Bbv3V7qRl7O5CJMj0rexWQclgt3qnDpT6cQc9k%3D&reserved=0 .

Lillian T. Chong Associate Professor Department of Chemistry University of Pittsburgh 219 Parkman Avenue Pittsburgh, PA 15260 (412) 624-6026

Looking around, it looks like it's a little difficult (if not impossible) to cleanly launch an interface-agnostic ipython notebook interface from a script*, and impossible to call an IPython notebook that interfaces to a running kernel.

I didn't search that hard yet, but if anyone knows of an interface/example that does it, link!

There may be a magic command, but it's probably much easier to simplify modify the west script in $WEST_ROOT/bin to accept a '--notebook' command which launches a Jupyter notebook. The user could then create a notebook and import the module (we could provide examples of how to do this) and run with the convenience functions in w_ipython, if they wanted.

On the user end, this takes care of all the variable setting that is required to launch a WESTPA script. On our end, it's not that difficult, either. The WEST script already accepts flags (strace, etc) that aren't sent on to the python binary, so the framework is there, so to speak.

Seems to work well enough. The user could then launch

w_ipython --notebook

to launch Jupyter notebook, or just

w_ipython

To drop them into an interactive prompt.

Still thinking of a good name for this. Also, you can tell that I started from w_kinavg as a base for this, given that it's still named Kinetics. Hah.

westpa / west_tools

Interactive, online analysis tool (devname: w_ipython) #18