yvs314 / epi-net-m

Data Processing and Simulation Tools for Networked SIR+
MIT License
1 stars 0 forks source link

Proper modular structure for Oboe (Julia data processing routines) #13

Open yvs314 opened 3 years ago

yvs314 commented 3 years ago

Oboe.jl is growing too large, it ought to be carved up.

Can be done one step at a time, e.g. first isolate the loading, collating and censoring of airport data (BTS and Openflights).

I am not sure about the “proper” way; I have seen “submodules” added via simple include("sub-air.jl") in the main module.

I see the following meaningful submodules and substructures:

Also, I prefer to separate input and output from the overall logic.

This is a rather vague assignment, we will see what can be worked out here.

karaign commented 3 years ago

For posterity, and to help me organize my thoughts, I'll be writing them here.

The terminology used by the Julia package ecosystem is somewhat confusing, at least to someone like me who's more used to npm than virtualenv. The distinction between "package" and "application" is of particular note here:

Package: a project which provides reusable functionality that can be used by other Julia projects via import X or using X. Application: a project which provides standalone functionality not intended to be reused by other Julia projects. For example a web application or a command-line utility, or simulation/analytics code accompanying a scientific paper.

This project overall seems fit the definition of "application" rather than "package", although I think there is a case for turning Oboe.jl into a package at some point, since the functionality in it can, in theory, be reused for different kinds of projects. I don't know if that's something you would be interested in, and it probably would complicate the setup, so I think it's best to stick with an application for now.

As for how to actually divide Oboe.jl, here's one idea for the submodules:

This structure more or less matches the data flow through the program (io being both the first and last step), so ideally it would allow each submodule to focus on just one step of the processing. If that's too many submodules, airports could be merged with tracts and travel with epi.

karaign commented 3 years ago

I found this package called FromFile that allows for importing modules directly from a file while avoiding the issues with include. To me, this seems more straightforward and less restrictive than grappling with the package system.

yvs314 commented 3 years ago

Apps, packages and DevOps

I found this package called FromFile that allows for importing modules directly from a file while avoiding the issues with include. To me, this seems more straightforward and less restrictive than grappling with the package system.

Oh my, nearly missed that comment. Good catch, let's go with FromFile then.

[Pkg.jl] terminology

Yes, that hit me too.

This project overall seems fit the definition of "application" rather than "package" ... there is a case for turning Oboe.jl into a package at some point

Right now it's definitely an application. And I would rather be light on infrastructure in the sense of minimizing any overhead on updating the code. I can imagine a 3-level overhead classification, all assuming you have necessary packages installed and just changed some function in your code—then,

no overhead, L1 : (1) just run julia whatever.jl ARGS...

little overhead, L2 : (1) run this build.sh (2) and then a.out ARGS...

too much overhead, L3 : (1) re-compile the package, (2) create a new package environment, (3) install ThisPackage.jl, at long last, (4) run julia whatever.jl ARGS...

As it stands, we're hopefully at L1. If a full-blown package is still at L2, it's fine to have it, but we need to stop before trying to hire a full-time DevOps.

Semantically, I would term this data processing routine a package if it were genericized vis-a-vis input data. That is, if it would make sense to use the same aggregation and processing for different input data. Right now reading from FluTE/BTS/Openflights is soldered in, not connected with a modular plug.

Submodules

Overall, the subdivision is sound.

I think airports ought ot be treated separately, no need to lump them together with tract processing.

The epi stuff is too small at present, small enough it is OK to lump it together with output.

One point to make is that this (1) Oboe data processing routine is one of three components, the other two are (2) differential equations/optimal control problem solver (/m-core) and (3) visualizer (that VegaLite scratch notebook).

It would make sense for these to be aware of each other, especially (1) and (3) because showing the air travel network is important. Perhaps commute can be visualized too.

As it stands, (2) reads the problem instance generated by (1), and (3) reads the solution, and the tract data. I think I explicitly made the id column in Oboe's output compatible with state/county IDs in the geo shape data set used to draw county borders.

karaign commented 3 years ago

I have set up the basic submodule structure in the modular branch. This technically works, but it still requires some cleanup and refinement (as well as better documentation). For one thing, I think the submodules are too tightly coupled at the moment, which is mostly due to some submodules importing methods from other submodules for the sake of providing default arguments. However, getting rid of some of these interdependencies would cause breaking changes in the Oboe module, which so far I have avoided. That being said, I'm not sure to what extent the modules should be decoupled, or if breaking changes are acceptable, so please let me know your thoughts on the matter.

yvs314 commented 3 years ago

Some quick thoughts:

  1. Default arguments were there for some lazy debugging without proper tracking of state. The idea was for the entry point (processOboe in oboe-main.jl) to track them explicitly, which is mostly implemented anyhow. It is OK to remove any and all that smells of shortcuts, which they mostly were.
  2. Modules should not be decoupled for decoupling's sake alone (enter MATLAB with its “won't let you call a function in REPL unless it is in its own .m file unless you wrap it into a function object”—ugh!). Some coupling is not just due to ad hoc development but has to do with actual computation.
  3. Frankly, the main decoupling is reading and censoring airport data vs. everything else.
  4. Wonder if the modularization as done in the branch has any effect on performance (of mkPsgMx).
karaign commented 3 years ago

1, 2

Got it.

Frankly, the main decoupling is reading and censoring airport data vs. everything else.

This actually bring me to my next question, should mkPsgMx remain dependent on grpBTS and mkFlightMx2? Seems more appropriate to have the client pass this stuff into mkPsgMx, but once again, breaking stuff is my concern.

Wonder if the modularization as done in the branch has any effect on performance (of mkPsgMx).

Also curious about this. It's kind of an inherently time-consuming test, but I will report on this when I get to it.

yvs314 commented 3 years ago

This actually bring me to my next question, should mkPsgMx remain dependent on grpBTS and mkFlightMx2? Seems more appropriate to have the client pass this stuff into mkPsgMx, but once again, breaking stuff is my concern.

It is indeed more appropriate to call these grpBTS and mkFlightMx2 beforehand, in the main procedure, and pass their results to mkPsgMx. Calling them directly from mkPsgMx was mostly a shortcut, as in, save a couple function arguments.

My original idea of structure was to have a big state object, which would be filled as the main routine went on: first the AP-AP travel matrix, then elementary (tract) nodes with designated APs assigned, then aggregated nodes and partition info, so that most functions that really do lifting would have this Instance object as a common interface.

But here I stumbled in transitioning from C++ to Julia and went full ad hoc. I mean, it's so scary when you can't call by constant reference or declare methods “const”.

Take your time with tests, feel free to automate them. DataFrames feel rather convenient for this statistical treatment.

karaign commented 3 years ago

I finally got around to benchmarking the new implementation compared to the old one. I can upload the full Jupyter notebook I used if necessary (would I have to make a brand new branch just for that?), but the bottom line is that mkPsgMx now takes ~0.5s less time and 74-75MiB less memory. This can partly be attributed to the fact that mkFlightMx2 is no longer called by mkPsgMx but even then, running a few @time tests showed that mkFlightMx2 only takes up around 0.2s and 19.699 MiB, so that doesn't explain it fully. If I had to guess, I'd say that using proper modules and import/using instead of include speeds things up because of the way Julia handles precompilation or something similar.

yvs314 commented 3 years ago

This is good news, and it does make sense with precompilation, etc. Thou shalt respect thy optimizing compiler, and the modular structure is apparently more clear to the latter; perhaps removing the default arguments which were function calls has also done something to help. Can't help ruminating that, judging at least from my C++ experience, the 1970s-style code golf (fewer variables, inlining functions by hand, etc.) can safely be abandoned, or rather, left to the compiler who'd do it better than most of the coders, unless something is distinctly ill-optimized, which will be caught by the benchmarks.

I can upload the full Jupyter notebook I used if necessary

I'd prefer a table-like summary here in this conversation. Consider writing or finding a Markdown table pretty-printer for the benchmarks so it's easier to post them here. This is not the last you'll see of the performance benchmarks, and some would crop up in visualization too—these quality-of-life thingies should be done from time to time.

would I have to make a brand new branch just for that?

No, no need for a new branch. It's quite OK to commit documentation and similar fixes directly to master. In fact, I have updated the contributing guidelines and repository access to that effect about a month ago. Please test that you can commit directly to master on something safe, like a couple words in the main readme.

karaign commented 3 years ago

After much data wrangling, here are the tables containing the time and memory taken by the 3 major steps (aggregation, commute matrix, passenger matrix) for different aggregation levels, before and after the recent changes. I only did 5 runs for each configuration to save time (but now that I think about it, it would be pretty easy to extend the sample by just adding onto the data I already have). Mean values are in bold, and everything here was run on WA+OR+CA.

By tract

Old version

AGGREGATION_TIME [s] AGGREGATION_MEM [MiB] mkCmtMx_TIME [s] mkCmtMx_MEM [MiB] mkPsgMx_TIME [s] mkPsgMx_MEM [MiB]
0 0 0.867369 949.982 1.310275 767.869
0 0 1.239195 949.982 1.256512 767.870
0 0 0.936267 949.982 1.439364 767.870
0 0 1.161487 949.982 1.163816 767.870
0 0 1.177220 949.982 1.201242 767.870
0.0 0.0 1.076308 949.982 1.274242 767.8698

New version

AGGREGATION_TIME [s] AGGREGATION_MEM [MiB] mkCmtMx_TIME [s] mkCmtMx_MEM [MiB] mkPsgMx_TIME [s] mkPsgMx_MEM [MiB]
0 0 1.146760 949.981 0.731168 693.360
0 0 1.116433 949.981 0.727866 693.360
0 0 1.167107 949.981 0.694101 693.360
0 0 1.105395 949.981 0.718226 693.360
0 0 0.953838 949.981 0.744612 693.360
0.0 0.0 1.097907 949.981 0.723195 693.36

By state

Old version

AGGREGATION_TIME [s] AGGREGATION_MEM [MiB] mkCmtMx_TIME [s] mkCmtMx_MEM [MiB] mkPsgMx_TIME [s] mkPsgMx_MEM [MiB]
0.965378 415.555 0.869572 285.477 0.985289 146.388
0.957478 415.555 0.844367 285.477 0.978177 146.388
0.969878 415.555 0.869694 285.477 0.994103 146.388
0.959070 415.555 0.855980 285.477 0.972084 146.388
0.661272 415.555 0.851297 285.477 1.004495 146.388
0.902615 415.555 0.858182 285.477 0.98683 146.388

New version

AGGREGATION_TIME [s] AGGREGATION_MEM [MiB] mkCmtMx_TIME [s] mkCmtMx_MEM [MiB] mkPsgMx_TIME [s] mkPsgMx_MEM [MiB]
0.970889 415.555 0.875064 285.469 0.497077 71.226
0.970832 415.555 0.863808 285.469 0.492479 71.226
0.976225 415.554 0.860635 285.469 0.493527 71.225
1.006727 415.555 0.869938 285.471 0.501679 71.226
0.657816 415.555 0.872459 285.473 0.491906 71.226
0.916498 415.5548 0.868381 285.4702 0.495334 71.2258

By county

Old version

AGGREGATION_TIME [s] AGGREGATION_MEM [MiB] mkCmtMx_TIME [s] mkCmtMx_MEM [MiB] mkPsgMx_TIME [s] mkPsgMx_MEM [MiB]
0.163754 19.509 0.972251 218.461 0.952619 146.243
0.153433 19.509 1.004088 218.461 0.933829 146.243
0.150711 19.509 0.763788 218.461 0.930468 146.243
0.150629 19.509 0.777311 218.461 0.919843 146.243
0.153005 19.509 0.763526 218.461 0.908921 146.243
0.154306 19.509 0.856193 218.461 0.929136 146.243

New version

AGGREGATION_TIME [s] AGGREGATION_MEM [MiB] mkCmtMx_TIME [s] mkCmtMx_MEM [MiB] mkPsgMx_TIME [s] mkPsgMx_MEM [MiB]
0.161269 19.509 0.996443 218.454 0.446715 71.081
0.157475 19.509 0.979300 218.456 0.447303 71.080
0.149928 19.509 0.740349 218.453 0.443181 71.080
0.154606 19.509 0.987242 218.453 0.444287 71.081
0.155693 19.509 0.993001 218.454 0.444809 71.081
0.155794 19.509 0.939267 218.454 0.445259 71.0806

Will put out the notebook in a bit. Fair warning, it's pretty haphazard at the moment, I got a little carried away trying to figure out the best way to present the data and there's some weirdness where I couldn't decide if things should be treated as numbers or strings...

karaign commented 3 years ago

Please test that you can commit directly to master on something safe, like a couple words in the main readme.

I just tried to do that and it seems that the branch is still protected so I can't push to it. Here is the git log:

> git push origin master:master
remote: error: GH006: Protected branch update failed for refs/heads/master.        
remote: error: At least 1 approving review is required by reviewers with write access.        
To https://github.com/yvs314/epi-net-m.git
 ! [remote rejected] master -> master (protected branch hook declined)
error: failed to push some refs to 'https://github.com/yvs314/epi-net-m.git'
yvs314 commented 3 years ago

I just tried to do that and it seems that the branch is still protected so I can't push to it.

Right. Well, I have just scrapped the branch protection entirely, so please try again. Any features must still be put through a pull request and review, but feel free to upload notebooks and update the documentation straight in the mainline branch. Careful, nonbreaking changes are OK in the master too, but these'd still be better served by a separate branch.

Thanks for the data wrangling, the tables are quite clear. So it's safe to say that improvement in mkPsgMx time and memory usage for the new modular version is not a fluke—that's good to know.

Will put out the notebook in a bit. Fair warning, it's pretty haphazard at the moment...

Thanks, I might like to have a look at it. It's OK to have it haphazard (look into any other NBs here), I treat these more as an experiment record rather than as proper code presentation.