xunchen85 / ERVcaller

ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tools using both simulated and real benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable to accurately detect various TE insertions of any lengths, particularly ERVs. It allows for the use of a TE reference library regardless of sequence complexity, such as the entire RepBase database. It is easy to install and use with command lines.
http://www.uvm.edu/genomics/software/ERVcaller.html
14 stars 4 forks source link

Do you have any recommended standard filter? #8

Closed Leehyeonjin93 closed 4 years ago

Leehyeonjin93 commented 5 years ago

thank you for developing great tool. I had some final vcf files. I didn't find additional filters and others filters. So, status of detected TE : 0 to 5, type 0 is it ok? and Are chr and position in vcf breaking point? I don't understand where is breaking point, what are START & END means. for example,

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TE_seq

chr1 5617379 . T . . TSD=NULL,NULL;INFOR=HERVK, 1,7831,7831,+,4;CR=64;SR=3;GTF=YES;GR=1.000 GT:GQ:GL:DPN:DPI 1/1:40:0,0,1:0:67

where is HERVK's insertion position?

xunchen85 commented 5 years ago

According to feedback from users, researchers have different filtering standards. It did indicate type 0 is less confident according to supporting reads, although i will not suggest to directly filter them out. chr and position did indicate the breakpoints as most tools does. the START & END you highlighted below are the information on the inserted TEs but not human coordinates. As the example showed, the predicated HERV's insertion position will be chr1:5617379.

I will suggest to filter low quality TE genotypes, such as GQ>20 or others; check if a called TE insertions located within same type of reference TE; filter by total depth of supporting reads (DPI) according to your depth; filter out TEs with extremely long TSD, which is rare but may still real, unless you want to analyze this group.

Xun

Leehyeonjin93 commented 5 years ago

thanks, I have other questions, I didn't find gatk process, like markduplicates, realign, recalibration, on your paper. why didn't use it?

xunchen85 commented 4 years ago

Thanks for your suggestions.

We want to make ERVcaller easy to install and use. We did consider to include QC steps, although they can be preprocessed by many other tools. And we also assumed most users would already perform read QC separately, such as removing redundant reads, low-quality reads, adaptor sequences etc. thanks again, we may reconsider to include those steps in our next version.

We performed a realignment process as well in ERVcaller, which may be slightly different with GATK.

We did not have the recalibration process because it is for evaluating quality scores for SNPs, InDels but not for TE insertions as I know.

Thanks, Xun

Leehyeonjin93 commented 4 years ago

Thanks for perfect answer.