Cannot progress beyond step 2

xunchen85 / ERVcaller

ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tools using both simulated and real benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable to accurately detect various TE insertions of any lengths, particularly ERVs. It allows for the use of a TE reference library regardless of sequence complexity, such as the entire RepBase database. It is easy to install and use with command lines.

http://www.uvm.edu/genomics/software/ERVcaller.html

14 stars 4 forks source link

Cannot progress beyond step 2 #34

Open jasminjlee opened 3 months ago

jasminjlee commented 3 months ago

Hello Professor,

I am running ERVcaller with the aim of applying it to my dataset and a consensus HERV-K sequence, to detect HERV-K polymorphisms in my samples. I have been testing your tool using the following command:

perl "$SOFTWARE"/ERVcaller_v1.4.pl \ -i TE_seq \ -f .bam \ -H "$HUMANGENOME"/hg38.fa \ -T "$CONSENSUS"/TE_consensus.fa \ -I "$INPUT"/ \ -O "$OUTPUT"/ \ -t 12 \ -S 20 \ -BWA_MEM

However, when I check the output file, I noticed that the tool is not going beyond Chimeric reads:

And the error file is as below.

I'm at a total loss for what is going on. I've made sure that the dependent tools are available in the search path. I would appreciate any guidance. Thank you very much.

xunchen85 commented 3 months ago

Hi,

It has something to do with your samtools. Can you check the version?

You could also type samtools fastq command to check if they have the -@ option to specify the number of cpus. I have checked and samtools v1.20 still have this variable.

Best, Xun

jasminjlee commented 3 months ago

Hello Dr. Chen,

Thank you for your help! Changing to samtools v1.2 solved the issue. I have successfully run through ERVcaller with the test data. I am now trying to get ERVcaller to run on one of my samples before I run it on the whole dataset.

However, I am running into errors in step 2 with this file as well. The header looks like this, it starts out fine but start running into errors for line 356/358 for ERVcaller.pl. The error lines continue until the tail end. FYI, the I have censored the sample IDs with pink.

The tail end after the NA error lines look like this.

After seeing "failed to locate index files" I thought that bwa index might be a problem. I have bwa index for TE_consensus.ca and GRCh38_full_analysis_set_plus_decoy_hla.fa (tried hg38.fa but had the same issue) at the start of my script.

My dependencies are as follows: samtools v1.2 bwa 0.7.1 R 4.3 (could this be the issue? The test files ran fine).

Again, thank you in advance for your guidance - I really appreciate it.

xunchen85 commented 3 months ago

Hi,

Can you show the list of temporary files generated in the folder? What is your command line and parameters?

Have you also tried to see if the extractSoftclipped command works correctly under the script folder? You could follow the installation step in the manual (./extractSoftclipped)...

It should be fine if you already indexed the reference sequences for both human and TEs. There is another step of indexing for our validation steps of candidate loci. It may also be because of the empty candidate loci found too.

Best, Xun

jasminjlee commented 3 months ago

Hi Dr. Chen,

Here is my command line, where the input directory contains just one participant file from my dataset:

Here is the list of temporary files generated - as you can see, it looks like some of the files are empty. Outside of the temporary directory (so the outputs folder), ERVcaller generates a .vcf file, but it is empty.

When I run ./extractSoftclipped in the SE-MEI folder (ERVcaller-1.4/Scripts/SE-MEI), it works fine. However, it does not when I run it anywhere else (i.e., where the script for my project is, in the below manner). Could this be what is causing the issue? I did not run into these issues when I was going through your test data, so it makes me think that there is something going on with my sample files.

Thank you very much for your suggestions.

xunchen85 commented 2 months ago

Hi, I realized that you are using the ERV_library.fa as the reference genome. The genome-wide polymorphic ERV loci is relative rare. To confirm it, you could run the pipeline with the TE_consensus.fa first.

The "TE.f" file indicates that the pipeline went well. Are the errors you have relative to the bwa index during the validation steps? If so it may be potentially due to the empty output too.

You could check some positive loci first, for example some well-known polymorphic ERV loci that we previously reported. You could also confirm the sequencing depth you have. You may not have enough reads if the depth is too low.

best, Xun