Assemble inserted virus sequence through intermediate output files

xxYaaoo commented 3 months ago

Dear Professor Chen,

First of all, I would like to thank you and your team for developing ERVcaller, which has great assisted my project.

Recently, I am trying to assemble the inserted virus sequence for further investigation. The performance of de novo assembly by other software is not very satisfactory. So, I am wondering if some intermediate output files of ERVcaller could be useful. For example, I think the '*_ERV.visualization' file in '_temp' folder might be effective, but I am not really sure. Could you please give some interpretation to the files or share some suggestions? Thank you so much!

Sincerely, Yaaoo

xunchen85 commented 3 months ago

Hi, you are correct that you could use the "_ERV.visualization" file in the '_temp' folder to reconstruct the inserted viral sequences. Definitely it is useful to get he boundary sequences at single-nucleotide resolution.

However, we only kept the chimeric and split reads, thus if the loci you want is too long, you may lost the middle region.

The visualization figure have the TE reference ID and read ID, you could view if in the terminal or text. You could also manually extracted those reads and then align against the TE consensus sequence too using blastn\blat\etc.

Best, Xun

xxYaaoo commented 3 months ago

Thanks for your reply and suggestions, I'll try all of the following! Thank you very much!

Sincerely, Yaaoo

xxYaaoo commented 3 months ago

Dear Professor Chen,

I have successfully assembled the viral insertion sequence, based on your inspiration and suggestions. Currently, I attempt to build a pipeline for assembling batch of loci. However, cause I am not very clear about the content and meaning of 'visualization' file, especially every 'O1' line, I can not get the corresponding INFO(part) in 'visualization' of each loci (loci in sample VCF file) through scripts. It seems that '*visualization' file part doesn't have the same insertion pos record for the loci as that in the VCF (so could not directly search by pos). I don't know how to locate and find the correct part. Could you please share some ideas or suggestions? Thank you so much!

Sincerely, Yaaoo

xunchen85 commented 3 months ago

Hi,

the coordinates in the .visualization file should be consistent with the human and ERV coordinates in the VCF file. Specifically, O1 includes the summary of all reads; O2 has the alignment info per read; You could also check the O3 section, which visualizes the read mapping on the human and viral genomes while the lower characters refer to the mapped sequences and upper characters refer to the clipped and unmapped sequences.

Best, Xun

xunchen85 commented 3 months ago

Hi,

Sorry, it is not easy to login in GitHub and reply to your question here. If you have questions, you could email me too.

So the coordinates in the .visualization file should be consistent with the human and ERV coordinates in the VCF file. Specifically, O1 includes the summary of all reads; O2 has the alignment info per read; You could also check the O3 section, which visualizes the read mapping on the human and viral genomes while the lower characters refer to the mapped sequences and upper characters refer to the clipped and unmapped sequences.

Best, Xun

xxYaaoo commented 3 months ago

Dear Professor Chen,

(I rely to your message directly through mailbox interface this time. I am not sure you could receive my mail successfully and hope it will be more convenient for you~). I append two files of one corresponding insertion loci — the txt file shows the loci INFO in final sample VCF calling output and the other file shows the content in 'visualization' file. I tried locating the loci by the 'VCF-chr pos' and 'O1 line - chr pos pos' [such as, the VCF pos locates in the latter pos region], but found a few failed exceptions. For example: The position info in VCF 'chr6:100872571' The info in visual 'chr6 100872117 100872514 unknown DF0004301.2' I think the '100872117' might be the position of the front-most read, but I am confused about the meaning of '100872514'. How could I find the matched INFO of this exceptions precisely, cause I wanna to build up an assembly-pipeline, or do you have any other good idea?

Very grateful for your continuing help and replies! Appreciate!

Sincerely,

Yaaoo 2024/8/19

------------------ Original ------------------ From: @.>; Date: Sun, Aug 18, 2024 02:29 PM To: @.>; Cc: @.>; @.>; Subject: Re: [xunchen85/ERVcaller] Assemble inserted virus sequence through intermediate output files (Issue #35)

Hi,

the coordinates in the .visualization file should be consistent with the human and ERV coordinates in the VCF file. Specifically, O1 includes the summary of all reads; O2 has the alignment info per read; You could also check the O3 section, which visualizes the read mapping on the human and viral genomes while the lower characters refer to the mapped sequences and upper characters refer to the clipped and unmapped sequences.

Best, Xun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xunchen85 commented 3 months ago

Hi,

I did not get the files you in the email. So I am not sure about the positions you mentioned here.

Anyhow, if you are interested in the de novo assembly of inserted TEs, you could rely on the actual read alignment in the visualization. Some of the breakpoints in the VCF file were predicted due to the lack of split reads. You could even check the bam files for the read names and positions in the visualization file too.

Meanwhile, if you do de novo assembly, you could reconstruct the entire region including the breakpoints, and then the actual breakpoints could be fine-mapped accurately.

Best, Xun

xxYaaoo commented 3 months ago

Dear Professor Chen,

Thank you for your reply and suggestions! I've gained a lot of inspiration!! I'm sorry to have to come back and disturb your time to ask for further advice if I have questions afterwards. Appreciate! Sincerely,

Yaaoo 2024/8/22 ------------------ Original ------------------ From: @.>; Date: Wed, Aug 21, 2024 01:48 PM To: @.>; Cc: @.>; @.>; Subject: Re: [xunchen85/ERVcaller] Assemble inserted virus sequence through intermediate output files (Issue #35)

Hi,

I did not get the files you in the email. So I am not sure about the positions you mentioned here.

Anyhow, if you are interested in the de novo assembly of inserted TEs, you could rely on the actual read alignment in the visualization. Some of the breakpoints in the VCF file were predicted due to the lack of split reads. You could even check the bam files for the read names and positions in the visualization file too.

Meanwhile, if you do de novo assembly, you could reconstruct the entire region including the breakpoints, and then the actual breakpoints could be fine-mapped accurately.

Best, Xun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xxYaaoo commented 3 weeks ago

Dear Professor Chen,

Thanks to the development of ERVcaller and congratulations to the GraffiTE for publishing in NC !! These two softwares both play very important roles in my research projects!! Appreciate!!

While using these two softwares, I meet some difficulties and really desire your help and advice:

For ERVcaller: In my own pipeline, I apply the 'Calculate_reads_among_nonTE_locations.pl' in ERVcaller to calculate the nonTE of every sample. My cohort usual includes over 1500 individuals. After merging all potential candidate loci, the VCF file maybe contain over 6k-8k loci. In this case, the 'Calculate nonTE locations' step for each individual will need around 24h with 30GB maxRSS to complete the step. The resource consumption and elapse time (especially!), become one of my difficulties. I am wondering if there are some ways to seep up or deal with this problem.

For GraffiTE: Also inquire about the resource aspect. Currently, I use 105 individuals as the input of GraffiTE to test the output between 'GT-sv-GA' and 'GT-svsn-GA' pattern. But I find that the intermediate files of these two will reach over 12TB in total. Is this a normal phenomenon..？Or any method could alleviate storage burden? Besides, which pattern might you recommend? haha! Sorry to bother you again and looking forward to your reply! Thank you for your constant help in the past!

Best wishes!!

Yao 2024/11/3

------------------ Original ------------------ From: @.>; Date: Wed, Aug 21, 2024 01:48 PM To: @.>; Cc: @.>; @.>; Subject: Re: [xunchen85/ERVcaller] Assemble inserted virus sequence through intermediate output files (Issue #35)

Hi,

I did not get the files you in the email. So I am not sure about the positions you mentioned here.

Anyhow, if you are interested in the de novo assembly of inserted TEs, you could rely on the actual read alignment in the visualization. Some of the breakpoints in the VCF file were predicted due to the lack of split reads. You could even check the bam files for the read names and positions in the visualization file too.

Meanwhile, if you do de novo assembly, you could reconstruct the entire region including the breakpoints, and then the actual breakpoints could be fine-mapped accurately.

Best, Xun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xunchen85 commented 2 weeks ago

Hi,

Thanks, I am also glad that you find our tools useful. Regarding the file sizes and ERVcaller running time to get the nonTE loci, you could do a filtering step before running the script, which could lower your resource usage significantly. You could follow our protocol paper for the details.

Regarding the GraffiTE tool. I think the use of 12 Tb is abnormal and Cristian is taking care the tool maintenance. I would suggest you reach him for some advice.

Best, Xun

xxYaaoo commented 2 weeks ago

Dear Professor Chen,

Thank you for your reply!! Congrats again for your work and publication!!

Best wishes!

Yao 2024/11/8 ------------------ Original ------------------ From: @.>; Date: Thu, Nov 7, 2024 06:17 PM To: @.>; Cc: @.>; @.>; Subject: Re: [xunchen85/ERVcaller] Assemble inserted virus sequence through intermediate output files (Issue #35)

Hi,

Thanks, I am also glad that you find our tools useful. Regarding the file sizes and ERVcaller running time to get the nonTE loci, you could do a filtering step before running the script, which could lower your resource usage significantly. You could follow our protocol paper for the details.

Regarding the GraffiTE tool. I think the use of 12 Tb is abnormal and Cristian is taking care the tool maintenance. I would suggest you reach him for some advice.

Best, Xun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xunchen85 / ERVcaller

Assemble inserted virus sequence through intermediate output files #35