workforce-data-initiative / skills-ml

Data Processing and Machine learning methods for the Open Skills Project
https://workforce-data-initiative.github.io/skills-ml/
Other
168 stars 69 forks source link

Figure out versioning of ONET skill and title extractors #7

Open thcrock opened 7 years ago

thcrock commented 7 years ago

ONET is a temporal dataset, but we haven't been treating it as such. It makes sense to have a 'job_titles_master' and 'skills_master' that are not temporal, but how do we populate these? The airflow DAG simply processes our s3 version of the ONET database, regardless of quarter. Similarly, it will re-do this on every quarter that is processed, which under the current non-temporal assumptions just results in wasted work.

robinsonkwame commented 7 years ago

I recommend that we store metadata on the latest ONET version used within 'job_titles_master' and 'skills_master' and periodically query onet with something like the following:

import re
import pandas as pd
db_releases_url = "http://www.onetcenter.org/db_releases.html"
versioned_releases = pd.io.html.read_html(db_releases_url)[0]
latest_release = float(re.search("\d+\.\d+", versioned_releases[0]).group())
if latest_release > current_onet_release:
    <fetch, incorporate new ONET release>

I have ONET table extraction code that downloads and extracts tables across subsets of ONET versions that I will make available soon that will be useful for this. (this code is related to WDI but for a separate analysis that I'm working on with @pviechnicki of Deloitte)

robinsonkwame commented 7 years ago

see: https://github.com/pviechnicki/taskExplorer/blob/master/create_design_matrix.py as having useful code snippets for regularly downloading new ONET database releases, parsing tables out of them.