xfengnefx / PPGfinder

Discover processed pseudogene from long structural variations.
MIT License
2 stars 2 forks source link

proper README file #1

Open AhmedArslan opened 2 years ago

AhmedArslan commented 2 years ago

Hi, I have a human bam fine, sniffles-vcf file, gff3 file and protein-coding-sequencing (.fa) file and I want to identify possible PPG. I am facing problem of understanding the code(s), and input files; is it possible for you to elaborate more each step and please explain what is the input file(s)/format in each step and in which sequence code(s) needed to run, which step is first and which is second and so on... ? This information would be very useful and your feedback is much appreciated. Many thanks. Ahmed.

xfengnefx commented 2 years ago

Hi, sorry for the mess and late reply. The repo was opened to document the scripts used for a manuscript. Let me cleanup the code and write a readme. I will try to get back to you within this week.

AhmedArslan commented 2 years ago

Hi, Thank you very much for doing that.

AhmedArslan commented 2 years ago

Hi, Any updates regarding the issue? Thanks.

xfengnefx commented 2 years ago

@AhmedArslan sorry I was caught up in some covid related stuff last week. I'm refactoring the code today.

xfengnefx commented 2 years ago

@AhmedArslan updated on master branch. Tested but let me know if something does not work, thanks.

AhmedArslan commented 2 years ago

Thank you, I will keep you updated. Can you please let me know what is the source of refgene file?

xfengnefx commented 2 years ago

It is part of a BED file converted from Gencode's GTF release (hs38, v31). I grepped "TDG". The format conversion was not performed by me, but I guess it was paftools.js gff2bed.

xfengnefx commented 2 years ago

By the way, github seems to not notify people about edits. Please feel free to append new comments.

AhmedArslan commented 2 years ago

Hi, I am getting following error at step 4:

python3 PPGfinder/findPPG_parsealn.py -t8 gencode.v40.annotation.pc.bed GRCh38_latest_genomic.fa aln_SV.paf > PPGFparse findPPG.py Version v0.1.5, Jun 28, 2020 [log::main] requested nb_cpu==8 [log::main] is_outputall: False [log::main] loading references... [log::main] Got references. [log::main] multi thread [log_multi::main] total 8223 querie(s) [log_multi::main] each thread will have 1028 querie(s) to process multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "tools/PPGfinder/findPPG_parsealn.py", line 215, in worker cov = group_overlaps(intervals, queries) File "tools/PPGfinder/findPPG_parsealn.py", line 97, in group_overlaps intervals.sort() numpy.AxisError: axis -1 is out of bounds for array of dimension 0 """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "PPGfinder/findPPG_parsealn.py", line 325, in p.map(worker, bundles) File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value numpy.AxisError: axis -1 is out of bounds for array of dimension 0

xfengnefx commented 2 years ago

Oh, you need to pull the current HEAD. I updated both the code (functionally identical to the manuscript's version, just CLI changes) and the readme.

Output looks like it's the older commit.

AhmedArslan commented 2 years ago

Got it, I have downloaded the new version of the codes: with step-two I am getting the following error, I have sniffles output vcf file:

python3 tools/PPGfinder/findPPG_fileprep.py -o Pc.sorted.filter.new.tsv /analysis/sniffles/Pc.sorted.filter.vcf 1> SV.fa [M] at /analysis/sniffles/Pc.sorted.filter.vcf... total SVs now: 0.

xfengnefx commented 2 years ago

The script used to accept VCF in addition to paftools' .var file, specifically for using sniffles. However, when revamping I recalled that (iirc) some VCF files may specify a . base while some wouldn't, e.g. REF=. ALT=A vs REF=G ALT=GA, and removed the support. Please correct me if this is false. It maybe does not matter, but I am not testing it at the moment.

If you just need it to work for sniffles' VCF, I can bring it back.

AhmedArslan commented 2 years ago

The script used to accept VCF in addition to paftools' .var file, specifically for using sniffles. However, when revamping I recalled that (iirc) some VCF files may specify a . base while some wouldn't, e.g. REF=. ALT=A vs REF=G ALT=GA, and removed the support. Please correct me if this is false. It maybe does not matter, but I am not testing it at the moment.

If you just need it to work for sniffles' VCF, I can bring it back.

Presently, I am working with sniffles' VCF file. That would be great, if you can also have a function to adjust this file formate.

xfengnefx commented 2 years ago

Ok, let me bring it back. Will ping you once I push.

xfengnefx commented 2 years ago

I went to modify the code but remembered that there were actually some issues with the sniffles/VCF - we tried this for some time, hoping it to be a compliment to the assembly-based results (therefore there was a vcf2fa stub in the old code), but:

  1. With read variant calling, the query read of the SV (i.e. the read that contains this SV) and the query-to-reference alignment information is lost. We do need them when we look for TSDs and polyA tails in the next steps. Rather than variant calling and then recovering the info, it makes more sense to just assemble and call from the contig alignments. We did not try to do local assembly or alignment-guided assembly for this.
  2. We thought that pulling long SVs from long read alignment was not worth it. The original plan is to say something about assembly-vs-assembly's PPG discovery false negatives using the reads, but it was hard to justify.

Sorry, I completely forgot that we dropped the sniffles route in the end.

If you want to try 2)'s SV pulling anyway, you will need to roll a script and output a .var-formated file. paftools call is for getting SVs out of assembly-vs-assembly alignment. It makes some assumptions about that iirc (see manual here), and will not do read variant calling.

AhmedArslan commented 2 years ago

Hi Feng, after generating the de novo assembly at step-4, I'm getting the following error:

python /tools/PPGfinder/findPPG_parsealn.py -o C.PPGFparse -t14 -e 2 /reference/GRCh38/PPGfinder.gff3bed /reference/GRCh38/GRCh38_latest_genomic.primary.2.fa aln_SV.paf [M] loading references... [M] total 12908 SVs aligned [M] each thread will have 922 queries to process multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/anaconda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/anaconda3/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/tools/PPGfinder/findPPG_parsealn.py", line 115, in worker intervals = np.array(gencode[name_gene][-1]) KeyError: 'ENST00000465292.5|processed_transcript|KCNS3' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/tools/PPGfinder/findPPG_parsealn.py", line 223, in p.map(worker, bundles) File "/anaconda3/lib/python3.8/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/anaconda3/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value KeyError: 'ENST00000465292.5|processed_transcript|KCNS3'

I have generated "PPGfinder.gff3bed" file with paftools, as suggested. Please suggest as solution. Thanks

xfengnefx commented 2 years ago

Blind guess, but how did you do the alignment for GRCh38_latest_genomic.primary.2.fa aln_SV.paf? Its reference should be the gene references (refgene.fa in the demo), not the whole genome (ref.fa in the demo). All sequence names in this reference are expected to present in the gff/bed file, with coordinates with respect to the whole genome reference.

AhmedArslan commented 2 years ago

Blind guess, but how did you do the alignment for GRCh38_latest_genomic.primary.2.fa aln_SV.paf? Its reference should be the gene references (refgene.fa in the demo), not the whole genome (ref.fa in the demo). All sequence names in this reference are expected to present in the gff/bed file, with coordinates with respect to the whole genome reference.

in step-3, I aligned with "refgene.fa" but in step-1 and step-4, I used whole genome as you mentioned "ref.fa"; that is whole genome reference?

xfengnefx commented 2 years ago

Thanks for checking up, and sorry it's actually my fault. I've just pushed a fix to the master branch, please retry at step 4 and further. There was two remnant lines that required the gene reference to be labelled as "protein_coding". I forgot to alter them when refactoring the interface...

I would suggest to only use protein coding genes as the reference if you're looking for processed pseudogenes.

AhmedArslan commented 2 years ago

Thanks for checking up, and sorry it's actually my fault. I've just pushed a fix to the master branch, please retry at step 4 and further. There was two remnant lines that required the gene reference to be labelled as "protein_coding". I forgot to alter them when refactoring the interface...

I would suggest to only use protein coding genes as the reference if you're looking for processed pseudogenes.

Thanks, it did work.