wikipathways / nanopublications

Project to convert content of WPRDF into nanopublications.
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

make historic nanopublications #8

Closed egonw closed 7 years ago

egonw commented 7 years ago

based on GPML from the past, maybe the same years as we calculate gene counts for?

Tina, so, a second reason I like to have historic GPML files :)

tkuhn commented 7 years ago

Any updates on this? Can I help with it somehow?

I plan to continue working on nanopub-based versioning from next week on, and such historic Wikipathway nanopubs would be a perfect dataset for that.

egonw commented 7 years ago

@mkutmon, I like to have historic GPML dumps too for Denise, as we want to count the growth of the number of metabolites over time... can we chat next week to see what is needed, and what Denise might be able to contribute here?

AlexanderPico commented 7 years ago

While the complete history of every GPML is represented in the wiki database, we do not have "historic GPML dumps". We only started making and storing versioned releases of GPML this past year. So, as far as I know, the only way to get this information is to setup a MySQL instance, load one of our recent SQL dumps and then compose SQL queries to extract the GPML text per pathway per date field limits. I am familiar with the database schema and can help with composing the query (if needed). But the first obstacle is the fact that we can't make our SQL dumps public; they contain user information. So someone from the dev team needs to do this or at least set up a clean instance for whomever is going to execute the queries.

AlexanderPico commented 7 years ago

Here's a rundown of the schema flow for this type of query:

tkuhn commented 7 years ago

For the purpose of my nanopub case study, a couple of months of data is probably sufficient. More is better, but having 6 months or so of data would already be great. How frequently are these GPML dumps/releases made?

AlexanderPico commented 7 years ago

Ah, in that case, you're in luck! :)

We've got 10 months of GPML dumps here: http://data.wikipathways.org/

Just click on a monthly release dir and then on the "gpml" dir.

AlexanderPico commented 7 years ago

...I was think of "historical" as the past 9 years :)

tkuhn commented 7 years ago

Yes, "historic" is a big word for what I am after :)

These 10 monthly snapshots are a very good starting point, thanks! Egon, what is required to make nanopublications from them?

I guess having more fine-grained (e.g. daily) snapshots is as difficult as going farther back in time?

egonw commented 7 years ago

Historical note: when we first discussing making these incremental nanopub sets, it was not 10 months yet :)

The data downloads have GPML sets. We need to convert these to WPRDF and then to nanopubs. Because I don't think we have archived WPRDF for these GPML data sets yet. I've put it on my todo list.

AlexanderPico commented 7 years ago

What about the files in the 'rdf' subdirectories? http://data.wikipathways.org/

egonw commented 7 years ago

@AlexanderPico, yes, I realized that too, but had not had time yet to check if there is RDF for every release... have that on my todo list... but from your reply I guess it's all there... that simplifies it a lot :)

egonw commented 7 years ago

@tkuhn, sorry for the delay... some unexpected high-priority things came up...

I have started a code base which is now undergoing a first test run: https://github.com/egonw/histwpnanopubs

I'll ping you when I have a batch of TRiG files soon!

tkuhn commented 7 years ago

Nice!

Did you manage to look into issue #9 too?

egonw commented 7 years ago

OK, four months of historic nanopubs have been emailed to @tkuhn, based on this code: https://github.com/egonw/histwpnanopubs/commit/94ef84f7476cc9478f31d5301e48c69f30341a8d