xunchen85 / ERVcaller

ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tools using both simulated and real benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable to accurately detect various TE insertions of any lengths, particularly ERVs. It allows for the use of a TE reference library regardless of sequence complexity, such as the entire RepBase database. It is easy to install and use with command lines.
http://www.uvm.edu/genomics/software/ERVcaller.html
14 stars 4 forks source link

ERVcaller v1.4

Introduction

ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposon elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tool using both simulated and benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable of accurately detecting various TE insertions of any length, particularly ERVs. It can be applied to both paired-end and single-end WGS, WES, or targeted DNA sequencing data. It supports the use of FASTQ or BAM files(s) generated by different aligners (only BWA, Bowtie were tested). In addition, ERVcaller is capable of detecting insertion breakpoints at single-nucleotide resolution. It allows for the use of either consensus TE sequences or a TE library containing abundant TE sequences as the reference, such as the entire RepBase database. Thus, ERVcaller can be used to detect insertions from highly mutated or novel TE sequences. It is easy to install and use with the command line. Complementary to ERVcaller, other bioinformatics tools designed to detect large deletions may be used to detect TEs that are present in the human reference genome but not in testing samples.

We have also published a book character which provided a step-by-step guide on using ERVcaller and other tools to characterize polymorphic TE insertions in human populations.

• Xun Chen, Guillaume Bourque, and Clement Goubert (2023): Genotyping of Transposable Element Insertions Segregating in Human Populations Using Short-Read Realignments, Transposable Elements: Methods and Protocols, Methods in Molecular Biology, vol. 2607, https://doi.org/10.1007/978-1-0716-2883-6_4

Installation

Extract the latest ERVcaller installer

$ tar vxzf ERVcaller_v.1.4.tar.gz  

Installing dependent software

Users need to successfully install the following software separately and make them available in the default search path (such as by using the Linux command “export” or adding them to your .bashrc).

• BWA-0.7.10: http://bio-bwa.sourceforge.net/bwa.shtml
• Samtools-1.6 (or later than 1.2): http://www.htslib.org/doc/samtools.html
• R-3.3.2 (or higher): https://www.r-project.org/
• SE_MEI (Modified version included in the Scripts folder of the ERVcaller installer)

Preparing the references

Human reference genome (hg38 by default. If BAM file(s) are used as input, the same build as the reference used for alignment should be used)

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz  
$ gunzip hg38.fa.gz  
$ bwa index hg38.fa  

TE reference genome. A TE reference is provided by the ERVcaller installer (i.e., the TE consensus sequences consisting of one Alu, LINE1, SVA, and HERV-K consensus sequence each; the human TE library containing 23 TE sequences; and the ERV library extracted from the Repbase database); or a user-defined TE reference library.

$ cd user_installed_full_path/Database/  
$ bwa index TE_consensus.fa  

Running ERVcaller

Make the installed dependent tools available in the default search path

$ export PATH=$PATH:$home/bwa-master/  
$ export PATH=$PATH:$home/samtools-1.6/  
$ export PATH=$PATH:$home/SE-MEI/  
$ export PATH=$PATH:$home/R/  

Print help page

$ perl user_installed_full_path/ERVcaller_v1.4.pl  

ERVcaller: running command line

$ perl user_installed_path/ERVcaller_v1.4.pl -i sample_ID -f .bam -H hg38.fa -T TE_consensus.fa –S 20 -BWA_MEM –t No._threads  

Detecting TE insertions using a BAM file as input

$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM  

Detecting TE insertions using paired-end FASTQ file as input

$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .fq.gz -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM  

Detecting TE insertions using multiple BAM files as input

$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .list -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -m  

Detecting and genotyping TE insertions using a BAM file as input

$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -G  

Output file

Output for each sample

The output VCF file (VCFv4.2) will be generated after running. All annotations are listed below:

##fileformat=VCFv4.2
##fileDate=2019121
##source=ERVcaller_v.1.4
##reference=file:hg38.fa
##ALT=<ID=INS:MEI:HERVK,Description="HERVK insertion">
##INFO=<ID=TSD,Number=2,Type=String,Description="NUCLEOTIDE,LEN, Nucleotides and length of the Target Site Duplication (NULL for unknown)">
##INFO=<ID=INFOR,Number=6,Type=String,Description="NAME,START,END,LEN,DIRECTION,STATUS; NULL for unknown values. Status of detected TE: 0 = Inconsistent direction for the supporting reads; 1 = One breakpoint detected by only chimeric and/or improper reads without split reads; 2 = Only one breakpoint is detected and covered by split reads; 3 = Two breakpoints detected, and both of them are not covered by split reads; 4 = Two breakpoints detected, and one of them are not covered by split reads; 5 = Two breakpoints detected, and both of them are covered by split reads;">
##INFO=<ID=CR,Number=1,Type=Integer,Description="Number of chimeric and improper reads support the TE insertion">
##INFO=<ID=SR,Number=1,Type=String,Description="Number of split reads support TE insertion and the breakpoint">
##INFO=<ID=GTF,Number=1,Type=String,Description="If the detected TE insertions genotyped">
##INFO=<ID=GR,Number=1,Type=Float,Description="The ratio of the number of reads support TE insertions versus the total number of reads at this TE insertion location">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality (Phred transformed)">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype likelihood">
##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="The number of reads support TE insertions">
##FORMAT=<ID=DPN,Number=1,Type=Integer,Description="The number of reads support non-TE insertions">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TE_seq
chr1    5617379 .       T       <INS_MEI:HERV>  .       .       TSD=NULL,NULL;INFOR=HERVK,1,7831,7831,+,4;CR=64;SR=3;GTF=YES;GR=1.000   GT:GQ:GL:DPN:DPI        1/1:40:0,0,1:0:67

Merging multiple samples

Create a file containing the sample list

Combine multiple samples with providing a list of consensus TE loci

$ perl user_installed_path/Scripts/Combine_VCF_files.pl -l sample_list -c 1KGP.TE.sites.vcf -o Output_merged.vcf  

Combine multiple samples without providing a list of consensus TE loci

$ perl user_installed_path/Scripts/Combine_VCF_files.pl -l sample_list -o Output_merged.vcf  

Calculate the number of reads support non-insertions at the consensus TE loci per sample (It is recommended to filter out low-quality TE loci from the combined VCF file first before running this script)

$ perl user_installed_path/Scripts/Calculate_reads_among_nonTE_locations.pl -i Output_merged.vcf -S sampleID -o output.nonTE -b bamFile.bam -s paired-end -l length_insertsize -L std_insertsize -r read_length -t threads

Distinguish missing genotypes and non-insertion genotypes at the consensus TE loci to get the final genotypes for all samples

$ cat *.nonTE >nonTE_allsamples
$ perl user_installed_path/Scripts/Distinguish_nonTE_from_missing_genotype.pl -n nonTE_allsamples -v Output_merged.vcf -o Output_merged-final.vcf

FAQ

How to install dependent tools?

You can follow the links listed below for information about downloading and/or installing all the dependent tools except the modified SE_MEI which is already included with ERVcaller:
• BWA-0.7.10: http://bio-bwa.sourceforge.net/bwa.shtml
• Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
• Samtools-1.6 (or later than 1.2): http://www.htslib.org/doc/samtools.html
• R: https://www.r-project.org/

How to set the shell environment variables for the installed dependent tools?

You can set temporary variables by using the Linux “export” command line before you run ERVcaller every time, or you can modify the shell profile file (ie. .bashrc) for longtime use. You should run for all tools above, except R which is mostly set when installed. For example:

$ export PATH=$PATH:/home/Tools/samtools/  

Where to get the human reference genome?

You can download hg38 here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/. It is recommended that the file hg38.fa.gz is downloaded and unzipped for reference indexing.

Can we use other TE references we collected ourselves?

Yes, you can. You should be able to use any TE reference sequences specific to your research.

Where can I find test data?

You can find the test input data under the ERVcaller_v.1.4/test/ folder. There is example input data in both BAM and FASTQ format for testing.
There is also an example VCF output file in the folder: ERVcaller_v.1.4/test/example_output/

Where can I find more information about the output format?

You can find the full information here: https://samtools.github.io/hts-specs/VCFv4.2.pdf.

Which parameters were used to produce the example test output file?

The following command line was used to produce the example file:

$ perl ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -G  

How to speed up ERVcaller?

You can use “-t ” to use multi-thread computing. You can skip the genotyping function which can significantly speed up ERVcaller. You may also increase the length of split reads (-S ) to reduce the number of split reads which potentially caused by sequencing errors.

Do we need to provide the full path to the human reference genome and ERV reference genome in the command line, even if they’re in the executable’s directory?

Yes.

Do we need to provide the full path to the ERVcaller in the command line?

Yes.

Does ERFcaller can be used to detect potential nested TE insertions?

Yes, we include the TE insertions even within the same type of reference TE sequences。 However the accuracy will be significantly increased through the removal of potential nested TEs.

Any filtering steps suggested to keep the confident TE insertions after you obtain the output?

To keep high qualitly TE insertions, it is important to filter out TE insertions within the same reference TEs using BEDtools and filter out the TE loci with a low genotype quality (e.g., GQ < 10)

Copyright

ERVcaller is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. It may be used for non-commercial use only. For inquiries about a commercial license, please contact the first or the corresponding author or The University of Vermont Innovations.

Download

Download: www.uvm.edu/genomics/software/ERVcaller.html

Citation

Chen X and Li D. ERVcaller: Identifying and genotyping non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) using next-generation sequencing data. Bioinformatics, Volume 35, Issue 20, Pages 3913–3922. https://doi.org/10.1093/bioinformatics/btz205.