wgc-hackathon / covid

Analysis of publicly available COVID-19 data to identify the next variant of concern.
GNU General Public License v3.0
4 stars 2 forks source link

Potentially dominant variants #14

Open maddyboo opened 3 years ago

maddyboo commented 3 years ago

One to file under 'project proposal'/suggestion...

Is there enough sequencing data, spread around the world to enable variants to be identified that appear to be 'winning' the evolution race compared to other strains?

If there is enough data, would a tool that highlighted strains that although currently at relatively low numbers are showing signs of being more dominant be something of use? ('locally' significantly rising percentage of cases - one that you want to contain in a region to prevent further spread). Maybe a bit late for COVID-19, but could be pointed at other datasets in future I guess...?

I've some python experience, although certainly not a data-processing pro - there are definitely better placed/skilled people to do this! Although either way this could be too ambitious/'big picture' for the scope of the hackathon...?! 😁

bethsampher commented 3 years ago

Hi @maddyboo , thanks for your great suggestion! There is a lot of data out there, certainly for the UK and I think this would still be a really useful tool for COVID-19. For example, once everyone is vaccinated and case numbers are lower it'll be important to spot any variants showing signs of spreading as these could potentially be vaccine resistant! If you want to have a go at this, please do- that's what the hackathon is all about! Feel free to do as much or as little as you'd like and shout if you need any help :)

maddyboo commented 3 years ago

Good point about keeping an eye out for escape mutations!

Definitely up for giving something a go! I'll keep an eye on #7 as @mathewcsims will break some of the headache of public COVID data extraction, in the meantime I can make a fake dataset to try and crack the data analysis / trend detection side of things.

Will keep you posted - need to get up to speed on git too! πŸ™ˆ

bethsampher commented 3 years ago

Sounds good! Looking forward to seeing your progress😁

JamesABaker commented 3 years ago

This is a good guide for just the basics of making a pull request on git:

https://www.dataschool.io/how-to-contribute-on-github/

Note you only actually make your coding changes at step 9.

maddyboo commented 3 years ago

Ok, a first proof of concept is here . I've forked, cloned, branched - hopefully correctly?!

I've not done a pull request back to wgc as it just uses some data I randomly typed into a spreadsheet trying to create something reminiscent of what was shown when B.1.1.7 dominance in Kent/Tier 4 regions was shown. This helped keep it simple whilst trying to get head around data analysis in Pandas/python etc.. I suspect there is a more elegant way to iterate through data than the for loops I'm using, but seems to work!

Next step is to import real data into it, but figure I might be best waiting to see if that nut is cracked in other forks before duplicating headscratching...?

The next major issue then is deciding what counts as a worthy trigger to raise it as a variant of interest - statisticians and epidemiologists would be better placed here than this scientist (I'm a chemist)!

JamesABaker commented 3 years ago

This looks great! Nothing wrong with the loops. It looks clean and readable to me :)

Feel free to PR even though the data is not real. Other people are working on extracting the real data, and it should not be that difficult to fudge the format into what your script accepts.

The statistics can certainly come later once we know what form the data takes exactly. I agree.

bethsampher commented 3 years ago

As James said, looks great! Well done @maddyboo ! If you did want to try using real data you could download CSVs for a few variants on Cov-GLUE and go from there. Deciding what the variant of interest trigger should be is definitely a challenge and should probably consider various factors. I think the way you're doing it now based on case number growth is a very good place to start though!

maddyboo commented 3 years ago

Pull requested! πŸ˜ƒ

I agree both - step, by step!

I'll have a look on Cov-GLUE and see how the code performs on a csv combining a few variants (perhaps try and find some of the more notorious variants of interest at the moment too and see if/when they are spotted).

As you say should be pretty easy to add to the code to marry up other efforts on what is the harder bit - sweeping up a huge amount of data and making it nice to use!

bethsampher commented 3 years ago

Now merged :)

maddyboo commented 3 years ago

Cool! About 2 seconds ahead of me updating the version log, but will delete the branch and it'll be in the next as super minor!

Next - try to get my head around Cov-GLUE and find some data to combine and test it with πŸ€“

bethsampher commented 3 years ago

Oh no, sorry about that! Sounds good- thanks for your efforts so far!

maddyboo commented 3 years ago

Some really nice datasets here - https://covariants.org/per-country

Don't want to jinx it, but kinda also wanted to say/tease early signs are positive I think! πŸ€žπŸ˜€

bethsampher commented 3 years ago

Yes, looks good! It's a shame there's no CSV to download from there. I think most of these sites get their data from GISAID, although I'm not quite sure how to use it.

maddyboo commented 3 years ago

Yeah, JSON really isn't my favourite!

I tried through Cov-Glue but kept breaking the website trying to get the largest/most common mutation (the first major variant I think) to use as baseline data... Then stumbled on this which made life (sort of) easier.

After chopping away at the JSON yesterday I got something readable, then hurriedly tried my code. From 14 variants, only 3 flagged (the two major plus a false alarm). Need to check and double check, then clean up my JSON hacking...

bethsampher commented 3 years ago

That's great progress! Looking forward to seeing the results :)

JamesABaker commented 3 years ago

JSON to CSV is a common problem that crops up for me. I have bookmarked this how to.

Simply, you open as a json object, then export as a CSV line.

# Python program to convert
# JSON file to CSV

import json
import csv

# Opening JSON file and loading the data
# into the variable data
with open('data.json') as json_file:
    data = json.load(json_file)

variant_data = data['var_details']

# now we will open a file for writing
data_file = open('data_file.csv', 'w')

# create the csv writer object
csv_writer = csv.writer(data_file)

# Counter variable used for writing
# headers to the CSV file
count = 0

for var in variant_data:
    if count == 0:

        # Writing headers of CSV file
        header = var.keys()
        csv_writer.writerow(header)
        count += 1

    # Writing data of CSV file
    csv_writer.writerow(var.values())

data_file.close()
maddyboo commented 3 years ago

Nice! I'll give that a whirl, thanks!

There is a JSON to Pandas importer, but I couldn't work out if it was me (certainly partially), pandas (maybe.. there are lots of options) or that the JSON has too much information in it (it's kind of like tables within tables - each country has a data table). So I think I have to break it down into countries first (cue butchering in notepad for a rough and ready test).

maddyboo commented 3 years ago

A new pull request is in for v0.5 - big change is obviously real data!!

I've stuck to just using a data snapshot at the moment so it's easier for us to get the same result out the other end if trying modifications to trigger rules etc., but will be relatively easy to switch over to pulling 'live' data I think. It also currently only processes one country at a time whilst being tweaked/refined etc..

I use Spyder so the plots automatically show, but I've added some commented out lines to allow you to output .pngs to the folder you are running from if you don't see plots when running. Likewise I've added a csv output at the end which you could turn back on.

Have a look through the results folder (or give it a run yourselves!) and see what you think... It's still a bit trigger-happy I feel - quite a few false alarms on spikey data from France for example, however setting a threshold of at least 70 sequenced results helped stop those false alarms from the Spain data (when sequences/week were quite low to begin with). If you change line 95 to below you'll see the improvement it had:

total_query = str(""+variant+"> 0")

Not really sure what the next step is to try and improve the trigger rules further, perhaps have a play yourselves with ideas/solutions? A difficult balance between being sensitive enough to pick up variants early enough, but not so sensitive it constantly alerts.

bethsampher commented 3 years ago

Apologies for the delay on approval but I've just had a look through and merged! Definitely a good idea to add that 70 sequences threshold. Happy to have a think about how the trigger rules could be improved further, but I think it's good enough that you could start pulling live data if you wanted :)

maddyboo commented 3 years ago

Great! Thanks! (Tried to resync my /main fork and didn't have much success, so the branch stays! 🀷 )

Likely won't get around to the next step until the weekend - one thought was whether it was worth seeing what CoVariants/Nextstrain thought of it before making it a standalone checker/reporter type thing here?

bethsampher commented 3 years ago

That's strange, does this help- https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork ? It would be good to hear what the experts think and maybe get their input on the trigger rules too! We could potentially contact them next week. Is there anything in particular you would like to know?

maddyboo commented 3 years ago

Phew, back up to date! I'd been following James' guide, however using the Git Desktop interface instead... Which then meant I didn't realise I should've merged my branch back into my main fork before pulling to the wgc one. πŸ€¦β€β™‚οΈ So merged yours back down into my main and all is good!

Yeah, I agree - I think it's worth seeing if it is potentially useful or not, and then offering to add it into the CoVariants development process/GitHub. I think it's wise to do this before putting the effort into making it run nicely on it's own (giving a summary of what is new/interesting after that run etc.) and then later having to trim back/adjust the features still to be added to fit into CoVariants/wherever later sort of doubling the work!

If it is useful, but doesn't fit with their project then I know it makes sense to finish fleshing out the automation/reporting side of it and we keep it running here instead! πŸ˜ƒ Of course it'd be really valuable to get more input on how to make the triggers better too!

Aside from that my main question really is how/when the variants become listed as a variant in the CoVariants data table. If you look at https://covariants.org/per-country you can see grey "other" variants (although much less in the more recent months). At what point does an "other" gain a variant tag (how prevalent does it need to be?).

My thinking - v0.5 says for the UK 20I/501Y.V1 (B1.1.7) was looking interesting by the 9th November 2020 by the rapid trigger. But if it was still just an 'other' at this point, that trigger wouldn't have happened.

So if the variant labels come later, then the alerting only works retrospectively on this data. We'd have to go to the GISAID (or CoV-Glue etc.) horse's mouth where the data is less refined but noisier and more difficult as a result! Hence why I went for the easy CoVariants option first where the data is lovely and refined!

Definitely think it's worth getting in touch and see where it leads! πŸ‘

bethsampher commented 3 years ago

Hi @maddyboo, sorry I'm only just getting round to responding. Glad to hear your fork is up to date again! Very good point about when variants get separated out from the 'other' category. CoVariants may have already implemented something similar to you to identify these significant variants, but if not this is where your work could come in! Definitely worth finding out- I think the best way to start the conversation is adding a discussion on their GitHub πŸ‘

maddyboo commented 3 years ago

You've probably had a notification, but see - https://github.com/hodcroftlab/covariants/issues/143 @bethsampher & @JamesABaker