workforce-data-initiative / skills-ml

Data Processing and Machine learning methods for the Open Skills Project
https://workforce-data-initiative.github.io/skills-ml/
Other
168 stars 69 forks source link

Implementing versioning #10

Open thcrock opened 7 years ago

thcrock commented 7 years ago

What does it mean for us to have a versioned corpora? Do we create a new version every time we run the pipeline? How do we promote a new version to the tables that power the API?

robinsonkwame commented 7 years ago

We haven't pinned down the exact versioning schema but we know it should allow someone to reconstruct the input data and results of a versioned machine learning model in production.

Ideally the versioning schmea will summarize rows selected within the skills_master_table, jobs_master_table and any nlp transforms, etc etc to preprocess the corpora before being passed to the ML side of things.

thcrock commented 7 years ago

I guess there are two general approaches (in both examples, presupposing that we hold what the API is looking at in a s3 folder called 'current_corpora'):

  1. Literally every time the pipeline runs, a new corpora version is created. The act of publishing a corpora to the API works more like an alias, to point 'current_corpora' at 'current_corpora_xyz123'. Keeping all these around could explode our AWS bill

  2. The pipeline's general behavior is to push to an unversioned store. The act of publishing a corpora to the API actually copies the 'current_corpora' to an archived version, and then replaces 'current_corpora' with what is being promoted from the unversioned store.

robinsonkwame commented 7 years ago

I see the versioned corpora as typically being useful only for the machine learning pipeline itself and most other API endpoints referring the current versions of the table. This is because it's likely we will need to select subsets of table rows, perform different transforms to achieve (and reproduce) the results the ML powered API endpoints need.

So I would argue for the recreated/constructive approach, at least for the ML related work:

Here, generally the skills_master_table (for sake of argument) is always growing with new content. Each row in the table has a row_id larger than the one prior to it (monotonically labelled). Then we can reconstruct any version of the corpora by storing off an unique vector of integer indices within the skills_master_table (or even just an offset from zero for those rows included) as well as include references nlp transform github commits to pull down when transforming the corpora. This exchanges computation for multiple versions of the corpora and is embarrassingly parallel.

If we're afraid the vector of indices will grow too large we can use something like a bloom filter and check the indices from 0 to the size of the current corpus at runtime to yield indices in parallel. Or we can do some kind of simple differential compression scheme on the vector (probably a better choice).

I also think this discussion would benefit from a technical deep dive and/or working group feedback.