Soapfuse format support

avkitex commented 8 years ago

Dear yhoogstrate, Thanks for nice program. It was quite easy to install and run it, thanks to prowided wiki page. It would be great to have SOAPfuse format support. There is description of it here: https://sourceforge.net/p/soapfuse/wiki/Output_Files/

Is there some fuma-specific input format? It could be usefull for some other not implemented programms. Users will convert their results into it and use fuma. For example, there is programm ericscript. It works only with hg38 and it's very complicated to compare it's results with others (hg19). It could be nice to convert it's output to fuma-specific format and compare with chimerascan or defuse.

Thank you!

Best regards, Nikita

yhoogstrate commented 8 years ago

Hey Nikita, @avkitex,

Great to see you enthousiasm,

It's quite easy for me to implement other formats. I do not have direct access to an infrastructure capable of quickly running these tools right now, so if you could provide a truncated output file (e.g. as github gist) for ericscript or soapfuse, I can use it for implementation and immediately test whether the patch works. The website of SOAPFuse is for example not clear if the positions are 0 or 1 based, columns are delimited with spaces or tabs and which chars may be used for comments, if they are allowed. This I can easily extract from an example output file and I will update FuMa a.s.a.p.

Is there some fuma-specific input format?

There is not really such thing and I don't want to design yet another file format because there are quite some around. You could use any of the formats that are already implemented.

Please, if you find anything not working properly or if you have more suggestions or questions don't hazitate and feel free to ask.

All best and thank you too,

Youri

avkitex commented 8 years ago

Here are results for prostate cancer dataset ERR031017 obtained with soapfuse ERR031017.final.Fusion.specific.for.genes.txt - specific for genes, ERR031017.final.Fusion.specific.for.trans.txt - specific for transcripts. I think soapfuse is zero-based, tab delimited. There are lots of extra information in soapfuse output, but these 2 are main results for fusions.; chimerascan (for you to compare) ERR031017_chs_res.txt; And ericscan: ERR031017.results.total.txt

They all have common fusion AZGP1--GJC3.

avkitex commented 8 years ago

Soapfuse results are always called as samplename.final.Fusion.specific.for.genes or samplename.final.Fusion.specific.for.transcripts Ericscan results are always called as samplename.results.total.tsv or samplename.results.filtered.tsv Filtered - all with ericscore (some fusion event score) less than 0,5 are filtered out. Thanks a lot for the improvement!

yhoogstrate commented 8 years ago

I have downloaded the files. If you want to keep the data confidential, please let me know. I usually keep 2 lines of such files as example within the code, which I won't do if it's confidential of course...

avkitex commented 8 years ago

These datasets are publicly available so there is no need of keeping it confidential

2016-02-07 18:47 GMT+03:00 yhoogstrate notifications@github.com:

I have downloaded the files. If you want to keep the data confidential, please let me know. I usually keep 2 lines of such files as example within the code, which I won't do if it's confidential of course...

— Reply to this email directly or view it on GitHub https://github.com/yhoogstrate/fuma/issues/14#issuecomment-181035336.

yhoogstrate commented 8 years ago

Hey @avkitex,

I have updated FuMa and included your suggested file formats (https://github.com/yhoogstrate/fuma/releases/tag/v2.11.2). Thanks again, it's really appreciated! Feel free to test and give feedback if you like :).

I have a small concern about EricScript. The output file contains quite often 'Unable to predict breakpoint position'. Based on just such output it is impossible to load those fusion genes. These will be skipped and FuMa will give a warning. If you want to use such files, you should replace those entries with an arbitrary genomic position within the gene of interest. If this is really important to you we can discuss to write some conversion script although I would prefer to let the authors of EricScript propose a solution to this (e.g. adding a column with the gene coordinates).

Todo's before I can close this issue:

[x] update the galaxy wrapper
[x] update readme by updating the file format section
[ ] update readme explaining how to obtain new gene annotation files and what to do when you compare across genomes (e.g. liftOver)

avkitex commented 8 years ago

I've already asked Ericscript autors about it https://groups.google.com/forum/#!topic/ericscript/ucO6vM07hiM ----How can i deal with "Unable to predict breakpoint position"? YARS2 SLC41A2 12 Unable to predict breakpoint position - 12 Unable to predict breakpoint position - ENSG00000139131 ENSG00000136052 4 11 101.75

They answered: -----These are cases for which the predicted fusion junction has a sequence that is distant from the original reference. I use to consider these events as likely false positives.

I think this is because ericscan found no junction reads for this fusion genes pair So skipping these hits is the best solution for now.

avkitex commented 8 years ago

I've downloaded hg19 bed file from https://github.com/yhoogstrate/fuma/tree/master/tests/data Do you have prebuild hg38? Ericscript supports only hg38 -_-

avkitex commented 8 years ago

I've tested the latest version runstring $fuma --strand-specific-matching --acceptor-donor-order-specific-matching -a "hg19:refseq_genes_hg19.bed.txt" -s "chimerascan:chimerascan:ERR031017_chs_res.txt" "soapfuse:soapfuse-final-gene:ERR031017.final.Fusion.specific.for.genes" "ericscript:ericscript:ERR031017.results.total.tsv" -l "chimerascan:hg19" "soapfuse:hg19" "ericscript:hg19" -f "list" -o "fuma_list.txt"

It fails with error

Traceback (most recent call last):
  File "/usr/local/bin/fuma", line 5, in <module>
    pkg_resources.run_script('fuma==2.11.2', 'fuma')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1401, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/fuma-2.11.2-py2.7.egg/EGG-INFO/scripts/fuma", line 145, in <module>

Exception: Sample 'ericscript' could not be parsed as filetype: ericscript

$fuma --formats
FuMa supports the following file formats:

Tools              | File                  | Format string
----------------------------------------------------------
Chimera            | prettyPrint() output  | chimera
ChimeraScan        | chimeras.bedpe        | chimerascan
Complete Genomics  | highConfidenceJu*.tsv | complete-genomics
Complete Genomics  | allJunctionsBeta*.tsv | complete-genomics
DeFuse             | results.txt           | defuse
DeFuse             | results.classify.txt  | defuse
DeFuse             | results.filtered.txt  | defuse
EricScript         | .results.total.txt    | ericscript *************
Fusion Catcher     | final-list_cand*.txt  | fusion-catcher_final
FusionMap          |                       | fusionmap
Trinity + GMAP     |                       | trinity-gmap
OncoFuse           |                       | oncofuse
RNA STAR           | Chimeric.out.junction | rna-star_chimeric
SOAPFuse           | final.*.for.genes.txt | soapfuse-final-gene
SOAPFuse           | final.*.for.trans.txt | soapfuse-final-transcript
STAR Fusion        | _candidates.final     | star-fusion_final
TopHat Fusion pre  | fusions.out           | tophat-fusion_pre
TopHat Fusion post | potential_fusion.txt  | tophat-fusion_post_potential_fusion
TopHat Fusion post | result.txt            | tophat-fusion_post_result
TopHat Fusion post | result.html           | tophat-fusion_post_result_html

 The file formats that are supported in the direction (5' -> 3')
 specific mode are:

- chimerascan
- defuse
- fusion-catcher_final
- tophat-fusion_pre
- tophat-fusion_post_potential_fusion
- rna-star_chimeric
- soapfuse-final-gene
- soapfuse-final-transcript

************* EricScript often contains entries with unknown breakpoints.
Because no genomic coordinates are given those fusion genes can not be
imported into FuMa and only those with breakpoints will be taken into account.

yhoogstrate commented 8 years ago

Hey Nikita,

Excuse me for wasting your time on that mistake. I accidentally choose to use the SOAPfuse parser when the EricScript parser was selected. It should be fixed in v2.11.3 (https://github.com/yhoogstrate/fuma/releases/tag/v2.11.3). I also changed the error handling a bit so we won't get error massages as unclear as the one you got anymore.

I tried to solve a numpy installation issue. Hopefully this doesn't break your installation...

Regarding your question on hg38, if you would obtain such a file from a data provider like e.g. UCSC you're more or less comparing apples with oranges because certain genes annotated in hg38 are not annotated in hg19. Thererore, for your use case I would recommend to liftOver the one from hg19 to hg38 to ensure you're using a consistent nomenclature (exact same gene names). If you don't know about liftOver or don't want to install such tool you can quite easiliy do this on https://usegalaxy.org/ and search for 'liftover' in the left pane. If you would like to do this using a tool on your own computer I would advise you to use CrossMap because it is fast.

Still you have a fair point that I should include a hg38 annotation and explain how to do this for other genomes. I will try do this soon.

Thanks again,

Youri

avkitex commented 8 years ago

Hello, Youri

I've tested last version. runsting: fuma --strand-specific-matching --acceptor-donor-order-specific-matching -a "hg19:../refseq_genes_hg19.bed.txt" "hg38:../hg38.bed" -s "chimerascan:chimerascan:ERR031018_chs_res.txt" "soapfuse:soapfuse-final-gene:ERR031018.final.Fusion.specific.for.genes" "ericscript:ericscript:ERR031018.results.total.tsv" -l "chimerascan:hg19" "soapfuse:hg19" "ericscript:hg38" -f "list" -o "fuma_list_3prog.txt" hg38 was obtained as 1st 4 columns from ucsc https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=473773079_aPcyY1wZINZcANzzS4i8jGLi3WvG&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_knownGene&hgta_ctDesc=table+browser+query+on+knownGene&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbExonBases=0&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED

There are some questions: 1) It it ok that final list file looks a bit truncated in chimerascan column? 2) I think it would be better to use gene names instead of transcripts list for ericscript 3) Here is an xls document with colored lines for fusions found with multiple programs. Look at 18 sample, sheet Tumor_all(18). sf_chs_es.xlsx

yhoogstrate commented 8 years ago

Nikita,

It it ok that final list file looks a bit truncated in chimerascan column?

Yes, this could happen. There are 2 reasons why fusion genes will be removed:

Duplicates (NAIP-OCLN 2x, SERF1-SERF1A 4x, etc.) - incidence of this is really tool specific.
Breakpoints that do not fall within a gene regions. Although the breakpoints fall in genes according to ChimeraScan, it could be that those genes are not present in the gene set I have supplied. This may happen with hypothetical or newly discovered genes for example. When I tested FuMa I saw the indidence of this was about 2%.

Note that if you carefully examine the output you must be able to calculate this back. For example, when I used your chimerascan output and I did:

2016-02-08 13:17:40,072 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Parsing file: ERR031017_chs_res.txt
2016-02-08 13:17:40,072 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Parsed fusion genes: 14
2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Annotating genes on the left junction: chres1 - hg19
2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Annotating genes on the right junction: chres1 - hg19
2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Duplication removal: chres1 (14 fusions)
2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - * Full: 14
2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - * Gene-spanning: 14
2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - * Unique: 14

Here you see that the Full dataset contains 14 entries, of which 14 are Gene-spanning. Of those 14 gene spanning, each of them are Unique based on the annotated genes. The unique ones are those where FuMa will proceed the analysis with. So if you see those numbers drop, you know for which reason they are not taken into account.

I think it would be better to use gene names instead of transcripts list for ericscript

Yes, I would suggest to use gene names as well but maybe it is even more important to use the same gene identifiers across the BED files. I will try to figure out how to obtain such data in the most convenient way and update the manual since this is really important and obviously not clear in the current manual.

Could you try to use this file for hg38 and see if you find (better) overlap: https://raw.githubusercontent.com/yhoogstrate/fuma/hg38_bed_file/tests/data/refseq_genes_hg19__liftOver_to_hg38.bed

This file should be compatible with the one you used on hg19

avkitex commented 8 years ago

The last file you've provided was good. Thanks!

2016-02-08 15:32 GMT+03:00 yhoogstrate notifications@github.com:

Nikita,

It it ok that final list file looks a bit truncated in chimerascan column?

Yes, this could happen. There are 2 reasons why fusion genes will be removed:

Duplicates (NAIP-OCLN 2x, SERF1-SERF1A 4x, etc.) - incidence of this is really tool specific.

Breakpoints that do not fall within a gene regions. Although the breakpoints fall in genes according to ChimeraScan, it could be that those genes are not present in the gene set I have supplied. This may happen with hypothetical or newly discovered genes for example. When I tested FuMa I saw the indidence of this was about 2%.

Note that if you carefully examine the output you must be able to calculate this back. For example, when I used your chimerascan output and I did:

2016-02-08 13:17:40,072 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Parsing file: ERR031017_chs_res.txt 2016-02-08 13:17:40,072 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Parsed fusion genes: 14 2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Annotating genes on the left junction: chres1 - hg19 2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Annotating genes on the right junction: chres1 - hg19 2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - Duplication removal: chres1 (14 fusions) 2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - * Full: 14 2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - * Gene-spanning: 14 2016-02-08 13:17:40,073 - FuMA::Readers::ReadChimeraScanAbsoluteBEDPE - INFO - * Unique: 14

Here you see that the Full dataset contains 14 entries, of which 14 are Gene-spanning. Of those 14 gene spanning, each of them are Unique based on the annotated genes. The unique ones are those where FuMa will proceed the analysis with. So if you see those numbers drop, you know for which reason they are not taken into account.

I think it would be better to use gene names instead of transcripts list for ericscript

Yes, I would suggest to use gene names as well but maybe it is even more important to use the same gene identifiers across the BED files. I will try to figure out how to obtain such data in the most convenient way and update the manual since this is really important and obviously not clear in the current manual.

Could you try to use this file for hg38 and see if you find (better) overlap:

https://raw.githubusercontent.com/yhoogstrate/fuma/hg38_bed_file/tests/data/refseq_genes_hg19__liftOver_to_hg38.bed

This file should be compatible with the one you used on hg19

— Reply to this email directly or view it on GitHub https://github.com/yhoogstrate/fuma/issues/14#issuecomment-181347241.

Left-genes Right-genes Spans large gene (>200000bp) chimerascan soapfuse ericscript AAMDC:INTS4 AK057766 FALSE TRUE FALSE FALSE CD99 RNASEH2B FALSE TRUE FALSE FALSE CD99 RNASEH2B FALSE TRUE FALSE FALSE ... ADCK4 NUMBL FALSE TRUE TRUE TRUE RAD50 BC030525:PDLIM4 FALSE TRUE TRUE TRUE AZGP1 GJC3 FALSE TRUE TRUE TRUE

Left-genes Right-genes Spans large gene (>200000bp) chimerascan soapfuse ericscript AAMDC:INTS4 AK057766 FALSE CLUSTER22=chr11:77602364-chr7:64693537
CD99 RNASEH2B FALSE CLUSTER6=chrY:2594412-chr13:51530899
CD99 RNASEH2B FALSE CLUSTER24=chrX:2644412-chr13:51530899
... ANAPC16 AK097351:CD37 TRUE 850=chr10:72234414-chr19:49336279 KIF20B CD3EAP:ERCC1 FALSE 1360=chr10:89774696-chr19:45410482 NRP1 ZN

yhoogstrate commented 8 years ago

Hey @avkitex ,

I have added a small utility that allows you to convert GTF files to BED files (https://github.com/yhoogstrate/fuma/pull/22; https://github.com/yhoogstrate/fuma/releases/tag/v2.11.5). Therefore I will close this issue as I believe it has been completely resolved.

Thanks again :+1:

avkitex commented 8 years ago

Thank you!

2016-03-11 14:37 GMT+03:00 yhoogstrate notifications@github.com:

Hey @avkitex https://github.com/avkitex ,

I have added a small utility that allows you to convert GTF files to BED files (#22 https://github.com/yhoogstrate/fuma/pull/22; https://github.com/yhoogstrate/fuma/releases/tag/v2.11.5). Therefore I will close this issue as I believe it has been completely resolved.

Thanks again [image: :+1:]

— Reply to this email directly or view it on GitHub https://github.com/yhoogstrate/fuma/issues/14#issuecomment-195332713.

yhoogstrate / fuma

Soapfuse format support #14