Closed egonw closed 7 years ago
Any updates on this? Can I help with it somehow?
I plan to continue working on nanopub-based versioning from next week on, and such historic Wikipathway nanopubs would be a perfect dataset for that.
@mkutmon, I like to have historic GPML dumps too for Denise, as we want to count the growth of the number of metabolites over time... can we chat next week to see what is needed, and what Denise might be able to contribute here?
While the complete history of every GPML is represented in the wiki database, we do not have "historic GPML dumps". We only started making and storing versioned releases of GPML this past year. So, as far as I know, the only way to get this information is to setup a MySQL instance, load one of our recent SQL dumps and then compose SQL queries to extract the GPML text per pathway per date field limits. I am familiar with the database schema and can help with composing the query (if needed). But the first obstacle is the fact that we can't make our SQL dumps public; they contain user information. So someone from the dev team needs to do this or at least set up a clean instance for whomever is going to execute the queries.
Here's a rundown of the schema flow for this type of query:
For the purpose of my nanopub case study, a couple of months of data is probably sufficient. More is better, but having 6 months or so of data would already be great. How frequently are these GPML dumps/releases made?
Ah, in that case, you're in luck! :)
We've got 10 months of GPML dumps here: http://data.wikipathways.org/
Just click on a monthly release dir and then on the "gpml" dir.
...I was think of "historical" as the past 9 years :)
Yes, "historic" is a big word for what I am after :)
These 10 monthly snapshots are a very good starting point, thanks! Egon, what is required to make nanopublications from them?
I guess having more fine-grained (e.g. daily) snapshots is as difficult as going farther back in time?
Historical note: when we first discussing making these incremental nanopub sets, it was not 10 months yet :)
The data downloads have GPML sets. We need to convert these to WPRDF and then to nanopubs. Because I don't think we have archived WPRDF for these GPML data sets yet. I've put it on my todo list.
What about the files in the 'rdf' subdirectories? http://data.wikipathways.org/
@AlexanderPico, yes, I realized that too, but had not had time yet to check if there is RDF for every release... have that on my todo list... but from your reply I guess it's all there... that simplifies it a lot :)
@tkuhn, sorry for the delay... some unexpected high-priority things came up...
I have started a code base which is now undergoing a first test run: https://github.com/egonw/histwpnanopubs
I'll ping you when I have a batch of TRiG files soon!
Nice!
Did you manage to look into issue #9 too?
OK, four months of historic nanopubs have been emailed to @tkuhn, based on this code: https://github.com/egonw/histwpnanopubs/commit/94ef84f7476cc9478f31d5301e48c69f30341a8d
based on GPML from the past, maybe the same years as we calculate gene counts for?
Tina, so, a second reason I like to have historic GPML files :)