Proper modular structure for Oboe (Julia data processing routines)

yvs314 / epi-net-m

Data Processing and Simulation Tools for Networked SIR+

MIT License

1 stars 0 forks source link

Proper modular structure for Oboe (Julia data processing routines) #13

Open yvs314 opened 3 years ago

yvs314 commented 3 years ago

Oboe.jl is growing too large, it ought to be carved up.

Can be done one step at a time, e.g. first isolate the loading, collating and censoring of airport data (BTS and Openflights).

I am not sure about the “proper” way; I have seen “submodules” added via simple include("sub-air.jl") in the main module.

I see the following meaningful submodules and substructures:

loading and processing airport data
nodes and airports: assigning designated APs, passenger shares, air travel matrix
commute matrix and aggregation;

Also, I prefer to separate input and output from the overall logic.

This is a rather vague assignment, we will see what can be worked out here.

karaign commented 3 years ago

For posterity, and to help me organize my thoughts, I'll be writing them here.

The terminology used by the Julia package ecosystem is somewhat confusing, at least to someone like me who's more used to npm than virtualenv. The distinction between "package" and "application" is of particular note here:

Package: a project which provides reusable functionality that can be used by other Julia projects via import X or using X. Application: a project which provides standalone functionality not intended to be reused by other Julia projects. For example a web application or a command-line utility, or simulation/analytics code accompanying a scientific paper.

This project overall seems fit the definition of "application" rather than "package", although I think there is a case for turning Oboe.jl into a package at some point, since the functionality in it can, in theory, be reused for different kinds of projects. I don't know if that's something you would be interested in, and it probably would complicate the setup, so I think it's best to stick with an application for now.

As for how to actually divide Oboe.jl, here's one idea for the submodules:

io for reading and writing stuff
airports for pre-processing airport data and generating the flight matrix
tracts for pre-processing tract data, assigning APs and shares
aggregation for the aggregation and grouping logic
travel for mkPsgMxand mkCmtMx
epi for ns2iv and whatever other epidemic modelling may be added in the future

This structure more or less matches the data flow through the program (io being both the first and last step), so ideally it would allow each submodule to focus on just one step of the processing. If that's too many submodules, airports could be merged with tracts and travel with epi.

karaign commented 3 years ago

I found this package called FromFile that allows for importing modules directly from a file while avoiding the issues with include. To me, this seems more straightforward and less restrictive than grappling with the package system.

yvs314 commented 3 years ago

Apps, packages and DevOps

I found this package called FromFile that allows for importing modules directly from a file while avoiding the issues with include. To me, this seems more straightforward and less restrictive than grappling with the package system.

Oh my, nearly missed that comment. Good catch, let's go with FromFile then.

[Pkg.jl] terminology

Yes, that hit me too.

This project overall seems fit the definition of "application" rather than "package" ... there is a case for turning Oboe.jl into a package at some point

Right now it's definitely an application. And I would rather be light on infrastructure in the sense of minimizing any overhead on updating the code. I can imagine a 3-level overhead classification, all assuming you have necessary packages installed and just changed some function in your code—then,

no overhead, L1 : (1) just run julia whatever.jl ARGS...

little overhead, L2 : (1) run this build.sh (2) and then a.out ARGS...

too much overhead, L3 : (1) re-compile the package, (2) create a new package environment, (3) install ThisPackage.jl, at long last, (4) run julia whatever.jl ARGS...

As it stands, we're hopefully at L1. If a full-blown package is still at L2, it's fine to have it, but we need to stop before trying to hire a full-time DevOps.

Semantically, I would term this data processing routine a package if it were genericized vis-a-vis input data. That is, if it would make sense to use the same aggregation and processing for different input data. Right now reading from FluTE/BTS/Openflights is soldered in, not connected with a modular plug.

Submodules

Overall, the subdivision is sound.

I think airports ought ot be treated separately, no need to lump them together with tract processing.

The epi stuff is too small at present, small enough it is OK to lump it together with output.

One point to make is that this (1) Oboe data processing routine is one of three components, the other two are (2) differential equations/optimal control problem solver (/m-core) and (3) visualizer (that VegaLite scratch notebook).

It would make sense for these to be aware of each other, especially (1) and (3) because showing the air travel network is important. Perhaps commute can be visualized too.

As it stands, (2) reads the problem instance generated by (1), and (3) reads the solution, and the tract data. I think I explicitly made the id column in Oboe's output compatible with state/county IDs in the geo shape data set used to draw county borders.

karaign commented 3 years ago

I have set up the basic submodule structure in the modular branch. This technically works, but it still requires some cleanup and refinement (as well as better documentation). For one thing, I think the submodules are too tightly coupled at the moment, which is mostly due to some submodules importing methods from other submodules for the sake of providing default arguments. However, getting rid of some of these interdependencies would cause breaking changes in the Oboe module, which so far I have avoided. That being said, I'm not sure to what extent the modules should be decoupled, or if breaking changes are acceptable, so please let me know your thoughts on the matter.

yvs314 commented 3 years ago

Some quick thoughts:

Default arguments were there for some lazy debugging without proper tracking of state. The idea was for the entry point (processOboe in oboe-main.jl) to track them explicitly, which is mostly implemented anyhow. It is OK to remove any and all that smells of shortcuts, which they mostly were.
Modules should not be decoupled for decoupling's sake alone (enter MATLAB with its “won't let you call a function in REPL unless it is in its own .m file unless you wrap it into a function object”—ugh!). Some coupling is not just due to ad hoc development but has to do with actual computation.
Frankly, the main decoupling is reading and censoring airport data vs. everything else.
Wonder if the modularization as done in the branch has any effect on performance (of mkPsgMx).

karaign commented 3 years ago

1, 2

Got it.

Frankly, the main decoupling is reading and censoring airport data vs. everything else.

This actually bring me to my next question, should mkPsgMx remain dependent on grpBTS and mkFlightMx2? Seems more appropriate to have the client pass this stuff into mkPsgMx, but once again, breaking stuff is my concern.

Wonder if the modularization as done in the branch has any effect on performance (of mkPsgMx).

Also curious about this. It's kind of an inherently time-consuming test, but I will report on this when I get to it.

yvs314 commented 3 years ago

This actually bring me to my next question, should mkPsgMx remain dependent on grpBTS and mkFlightMx2? Seems more appropriate to have the client pass this stuff into mkPsgMx, but once again, breaking stuff is my concern.

It is indeed more appropriate to call these grpBTS and mkFlightMx2 beforehand, in the main procedure, and pass their results to mkPsgMx. Calling them directly from mkPsgMx was mostly a shortcut, as in, save a couple function arguments.

My original idea of structure was to have a big state object, which would be filled as the main routine went on: first the AP-AP travel matrix, then elementary (tract) nodes with designated APs assigned, then aggregated nodes and partition info, so that most functions that really do lifting would have this Instance object as a common interface.

But here I stumbled in transitioning from C++ to Julia and went full ad hoc. I mean, it's so scary when you can't call by constant reference or declare methods “const”.

Take your time with tests, feel free to automate them. DataFrames feel rather convenient for this statistical treatment.

karaign commented 3 years ago

I finally got around to benchmarking the new implementation compared to the old one. I can upload the full Jupyter notebook I used if necessary (would I have to make a brand new branch just for that?), but the bottom line is that mkPsgMx now takes ~0.5s less time and 74-75MiB less memory. This can partly be attributed to the fact that mkFlightMx2 is no longer called by mkPsgMx but even then, running a few @time tests showed that mkFlightMx2 only takes up around 0.2s and 19.699 MiB, so that doesn't explain it fully. If I had to guess, I'd say that using proper modules and import/using instead of include speeds things up because of the way Julia handles precompilation or something similar.

yvs314 commented 3 years ago

This is good news, and it does make sense with precompilation, etc. Thou shalt respect thy optimizing compiler, and the modular structure is apparently more clear to the latter; perhaps removing the default arguments which were function calls has also done something to help. Can't help ruminating that, judging at least from my C++ experience, the 1970s-style code golf (fewer variables, inlining functions by hand, etc.) can safely be abandoned, or rather, left to the compiler who'd do it better than most of the coders, unless something is distinctly ill-optimized, which will be caught by the benchmarks.

I can upload the full Jupyter notebook I used if necessary

I'd prefer a table-like summary here in this conversation. Consider writing or finding a Markdown table pretty-printer for the benchmarks so it's easier to post them here. This is not the last you'll see of the performance benchmarks, and some would crop up in visualization too—these quality-of-life thingies should be done from time to time.

would I have to make a brand new branch just for that?

No, no need for a new branch. It's quite OK to commit documentation and similar fixes directly to master. In fact, I have updated the contributing guidelines and repository access to that effect about a month ago. Please test that you can commit directly to master on something safe, like a couple words in the main readme.

karaign commented 3 years ago

After much data wrangling, here are the tables containing the time and memory taken by the 3 major steps (aggregation, commute matrix, passenger matrix) for different aggregation levels, before and after the recent changes. I only did 5 runs for each configuration to save time (but now that I think about it, it would be pretty easy to extend the sample by just adding onto the data I already have). Mean values are in bold, and everything here was run on WA+OR+CA.

By tract

Old version

AGGREGATION_TIME [s]	AGGREGATION_MEM [MiB]	mkCmtMx_TIME [s]	mkCmtMx_MEM [MiB]	mkPsgMx_TIME [s]	mkPsgMx_MEM [MiB]
0	0	0.867369	949.982	1.310275	767.869
0	0	1.239195	949.982	1.256512	767.870
0	0	0.936267	949.982	1.439364	767.870
0	0	1.161487	949.982	1.163816	767.870
0	0	1.177220	949.982	1.201242	767.870
0.0	0.0	1.076308	949.982	1.274242	767.8698

New version

AGGREGATION_TIME [s]	AGGREGATION_MEM [MiB]	mkCmtMx_TIME [s]	mkCmtMx_MEM [MiB]	mkPsgMx_TIME [s]	mkPsgMx_MEM [MiB]
0	0	1.146760	949.981	0.731168	693.360
0	0	1.116433	949.981	0.727866	693.360
0	0	1.167107	949.981	0.694101	693.360
0	0	1.105395	949.981	0.718226	693.360
0	0	0.953838	949.981	0.744612	693.360
0.0	0.0	1.097907	949.981	0.723195	693.36

By state

Old version

AGGREGATION_TIME [s]	AGGREGATION_MEM [MiB]	mkCmtMx_TIME [s]	mkCmtMx_MEM [MiB]	mkPsgMx_TIME [s]	mkPsgMx_MEM [MiB]
0.965378	415.555	0.869572	285.477	0.985289	146.388
0.957478	415.555	0.844367	285.477	0.978177	146.388
0.969878	415.555	0.869694	285.477	0.994103	146.388
0.959070	415.555	0.855980	285.477	0.972084	146.388
0.661272	415.555	0.851297	285.477	1.004495	146.388
0.902615	415.555	0.858182	285.477	0.98683	146.388

New version

AGGREGATION_TIME [s]	AGGREGATION_MEM [MiB]	mkCmtMx_TIME [s]	mkCmtMx_MEM [MiB]	mkPsgMx_TIME [s]	mkPsgMx_MEM [MiB]
0.970889	415.555	0.875064	285.469	0.497077	71.226
0.970832	415.555	0.863808	285.469	0.492479	71.226
0.976225	415.554	0.860635	285.469	0.493527	71.225
1.006727	415.555	0.869938	285.471	0.501679	71.226
0.657816	415.555	0.872459	285.473	0.491906	71.226
0.916498	415.5548	0.868381	285.4702	0.495334	71.2258

By county

Old version

AGGREGATION_TIME [s]	AGGREGATION_MEM [MiB]	mkCmtMx_TIME [s]	mkCmtMx_MEM [MiB]	mkPsgMx_TIME [s]	mkPsgMx_MEM [MiB]
0.163754	19.509	0.972251	218.461	0.952619	146.243
0.153433	19.509	1.004088	218.461	0.933829	146.243
0.150711	19.509	0.763788	218.461	0.930468	146.243
0.150629	19.509	0.777311	218.461	0.919843	146.243
0.153005	19.509	0.763526	218.461	0.908921	146.243
0.154306	19.509	0.856193	218.461	0.929136	146.243

New version

AGGREGATION_TIME [s]	AGGREGATION_MEM [MiB]	mkCmtMx_TIME [s]	mkCmtMx_MEM [MiB]	mkPsgMx_TIME [s]	mkPsgMx_MEM [MiB]
0.161269	19.509	0.996443	218.454	0.446715	71.081
0.157475	19.509	0.979300	218.456	0.447303	71.080
0.149928	19.509	0.740349	218.453	0.443181	71.080
0.154606	19.509	0.987242	218.453	0.444287	71.081
0.155693	19.509	0.993001	218.454	0.444809	71.081
0.155794	19.509	0.939267	218.454	0.445259	71.0806

Will put out the notebook in a bit. Fair warning, it's pretty haphazard at the moment, I got a little carried away trying to figure out the best way to present the data and there's some weirdness where I couldn't decide if things should be treated as numbers or strings...

karaign commented 3 years ago

Please test that you can commit directly to master on something safe, like a couple words in the main readme.

I just tried to do that and it seems that the branch is still protected so I can't push to it. Here is the git log:

> git push origin master:master
remote: error: GH006: Protected branch update failed for refs/heads/master.        
remote: error: At least 1 approving review is required by reviewers with write access.        
To https://github.com/yvs314/epi-net-m.git
 ! [remote rejected] master -> master (protected branch hook declined)
error: failed to push some refs to 'https://github.com/yvs314/epi-net-m.git'

yvs314 commented 3 years ago

I just tried to do that and it seems that the branch is still protected so I can't push to it.

Right. Well, I have just scrapped the branch protection entirely, so please try again. Any features must still be put through a pull request and review, but feel free to upload notebooks and update the documentation straight in the mainline branch. Careful, nonbreaking changes are OK in the master too, but these'd still be better served by a separate branch.

Thanks for the data wrangling, the tables are quite clear. So it's safe to say that improvement in mkPsgMx time and memory usage for the new modular version is not a fluke—that's good to know.

Will put out the notebook in a bit. Fair warning, it's pretty haphazard at the moment...

Thanks, I might like to have a look at it. It's OK to have it haphazard (look into any other NBs here), I treat these more as an experiment record rather than as proper code presentation.