xunchen85 / ERVcaller

ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tools using both simulated and real benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable to accurately detect various TE insertions of any lengths, particularly ERVs. It allows for the use of a TE reference library regardless of sequence complexity, such as the entire RepBase database. It is easy to install and use with command lines.
http://www.uvm.edu/genomics/software/ERVcaller.html
14 stars 4 forks source link

Assemble inserted virus sequence through intermediate output files #35

Open xxYaaoo opened 1 month ago

xxYaaoo commented 1 month ago

Dear Professor Chen,

First of all, I would like to thank you and your team for developing ERVcaller, which has great assisted my project.

Recently, I am trying to assemble the inserted virus sequence for further investigation. The performance of de novo assembly by other software is not very satisfactory. So, I am wondering if some intermediate output files of ERVcaller could be useful. For example, I think the '*_ERV.visualization' file in '_temp' folder might be effective, but I am not really sure. Could you please give some interpretation to the files or share some suggestions? Thank you so much!

Sincerely, Yaaoo

xunchen85 commented 1 month ago

Hi, you are correct that you could use the "_ERV.visualization" file in the '_temp' folder to reconstruct the inserted viral sequences. Definitely it is useful to get he boundary sequences at single-nucleotide resolution.

However, we only kept the chimeric and split reads, thus if the loci you want is too long, you may lost the middle region.

The visualization figure have the TE reference ID and read ID, you could view if in the terminal or text. You could also manually extracted those reads and then align against the TE consensus sequence too using blastn\blat\etc.

Best, Xun

xxYaaoo commented 1 month ago

Thanks for your reply and suggestions, I'll try all of the following! Thank you very much!

Sincerely, Yaaoo

xxYaaoo commented 4 weeks ago

Dear Professor Chen,

I have successfully assembled the viral insertion sequence, based on your inspiration and suggestions. Currently, I attempt to build a pipeline for assembling batch of loci. However, cause I am not very clear about the content and meaning of 'visualization' file, especially every 'O1' line, I can not get the corresponding INFO(part) in 'visualization' of each loci (loci in sample VCF file) through scripts. It seems that '*visualization' file part doesn't have the same insertion pos record for the loci as that in the VCF (so could not directly search by pos). I don't know how to locate and find the correct part. Could you please share some ideas or suggestions? Thank you so much!

Sincerely, Yaaoo

xunchen85 commented 3 weeks ago

Hi,

the coordinates in the .visualization file should be consistent with the human and ERV coordinates in the VCF file. Specifically, O1 includes the summary of all reads; O2 has the alignment info per read; You could also check the O3 section, which visualizes the read mapping on the human and viral genomes while the lower characters refer to the mapped sequences and upper characters refer to the clipped and unmapped sequences.

Best, Xun

xunchen85 commented 3 weeks ago

Hi,

Sorry, it is not easy to login in GitHub and reply to your question here. If you have questions, you could email me too.

So the coordinates in the .visualization file should be consistent with the human and ERV coordinates in the VCF file. Specifically, O1 includes the summary of all reads; O2 has the alignment info per read; You could also check the O3 section, which visualizes the read mapping on the human and viral genomes while the lower characters refer to the mapped sequences and upper characters refer to the clipped and unmapped sequences.

Best, Xun

xxYaaoo commented 3 weeks ago

Dear Professor Chen,

(I rely to your message directly through mailbox interface this time. I am not sure you could receive my mail successfully and hope it will be more convenient for you~). I append two files of one corresponding insertion loci — the txt file shows the loci INFO in final sample VCF calling output and the other file shows the content in 'visualization' file.  I tried locating the loci by the 'VCF-chr pos' and 'O1 line - chr pos pos' [such as, the VCF pos locates in the latter pos region], but found a few failed exceptions. For example:     The position info in VCF 'chr6:100872571'     The info in visual 'chr6 100872117 100872514 unknown DF0004301.2' I think the '100872117' might be the position of the front-most read, but I am confused about the meaning of '100872514'. How could I find the matched INFO of this exceptions precisely, cause I wanna to build up an assembly-pipeline, or do you have any other good idea?

Very grateful for your continuing help and replies! Appreciate!

Sincerely,

Yaaoo 2024/8/19

  ------------------ Original ------------------ From: @.>; Date:  Sun, Aug 18, 2024 02:29 PM To: @.>; Cc: @.>; @.>; Subject:  Re: [xunchen85/ERVcaller] Assemble inserted virus sequence through intermediate output files (Issue #35)

 

Hi,

the coordinates in the .visualization file should be consistent with the human and ERV coordinates in the VCF file. Specifically, O1 includes the summary of all reads; O2 has the alignment info per read; You could also check the O3 section, which visualizes the read mapping on the human and viral genomes while the lower characters refer to the mapped sequences and upper characters refer to the clipped and unmapped sequences.

Best, Xun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xunchen85 commented 3 weeks ago

Hi,

I did not get the files you in the email. So I am not sure about the positions you mentioned here.

Anyhow, if you are interested in the de novo assembly of inserted TEs, you could rely on the actual read alignment in the visualization. Some of the breakpoints in the VCF file were predicted due to the lack of split reads. You could even check the bam files for the read names and positions in the visualization file too.

Meanwhile, if you do de novo assembly, you could reconstruct the entire region including the breakpoints, and then the actual breakpoints could be fine-mapped accurately.

Best, Xun

xxYaaoo commented 3 weeks ago

Dear Professor Chen,

Thank you for your reply and suggestions! I've gained a lot of inspiration!! I'm sorry to have to come back and disturb your time to ask for further advice if I have questions afterwards. Appreciate!   Sincerely,

Yaaoo 2024/8/22  ------------------ Original ------------------ From: @.>; Date:  Wed, Aug 21, 2024 01:48 PM To: @.>; Cc: @.>; @.>; Subject:  Re: [xunchen85/ERVcaller] Assemble inserted virus sequence through intermediate output files (Issue #35)

 

Hi,

I did not get the files you in the email. So I am not sure about the positions you mentioned here.

Anyhow, if you are interested in the de novo assembly of inserted TEs, you could rely on the actual read alignment in the visualization. Some of the breakpoints in the VCF file were predicted due to the lack of split reads. You could even check the bam files for the read names and positions in the visualization file too.

Meanwhile, if you do de novo assembly, you could reconstruct the entire region including the breakpoints, and then the actual breakpoints could be fine-mapped accurately.

Best, Xun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>