xunchen85 / VIcaller

A software to detect virome-wide integrations
14 stars 5 forks source link

Vicaller.pl 1164 line is strange. #2

Closed byeonggill closed 4 years ago

byeonggill commented 5 years ago

Hi, I have three issue.

  1. I find a strange line at Vicaller.pl I don't know perl well but line 1164 seems to something wrong. image system ("perl ${directory}Scripts/Extract_fasta.pl $GI $virus_genome >${directory}Database/GI/${GI}.fa");

  2. and file name is incorrect: "Scripts/Extract_read_informationpl "--> "Scripts/Extract_read_information.pl" Are the two changes correct?

  3. I have trouble running validate function. I get the following warning: Warning: [blastn] Query is Empty! Warning: [blastn] Query is Empty! Warning: [blastn] Query is Empty!

but my config and blastn probably have no problem. so, I take a look at Vicaller.pl blastn run using query "${input_sampleID2}_aligned_both.fas" That means "${input_sampleID2}_aligned_both.fas" is not created.

${input_sampleID2}_aligned_both.fas is made this function.

validate function

sub v_obtain_seq { system ("perl ${directory}Scripts/Extract_specific_loci_final_reads.pl ${input_sampleID}_f2 ${input_sampleID2}.virus_f2 >${input_sampleID2}.information"); my $cmd=q(awk '{print$8}'); system ("$cmd ${input_sampleID2}.information |sort |uniq >${input_sampleID2}.id"); system ("perl ${directory}Scripts/Extract_fastq.pl -f ${input_sampleID}_1.1fuq -b ${input_sampleID2}.id -o ${input_sampleID2}_aligned_1.fuq"); system ("perl ${directory}Scripts/Extract_fastq.pl -f ${input_sampleID}_2.1fuq -b ${input_sampleID2}.id -o ${input_sampleID2}_aligned_2.fuq"); system ("perl ${directory}Scripts/Extract_fuq_split.pl -f ${input_sampleID}_1sf.fuq -b ${input_sampleID2}.id -o ${input_sampleID2}_aligned_sf.fuq"); system ("perl ${directory}Scripts/Convert_fastq_to_fasta.pl ${input_sampleID2}"); system ("perl ${directory}Scripts/Convert_fastq_to_fasta_for_split_read.pl ${input_sampleID2}"); system ("perl ${directory}Scripts/Convert_fastq_to_fasta.pl ${input_sampleID2}"); system ("perl ${directory}Scripts/Convert_fastq_to_fasta_for_split_read.pl $input_sampleID2"); system ("cat ${input_sampleID2}_aligned.fas ${input_sampleID2}_aligned_sf.fas >${input_sampleID2}_aligned_both.fas"); }

In conclusion, i wonder ${input_sampleID2}.virus_f2 is a previously generated file. Or is there something else wrong?

  1. Among the parameter values ​​in the validation function, Is the virus name only an abbreviation? Should not the full name in the output file 'Candidate_virus' column is possible?
xunchen85 commented 5 years ago

Thanks for running the VIcaller_v1.1.

  1. Yes, that path is the one i use on our server, i have revised it correspondingly.

  2. I have delete the Extract_read_information.pl script which is only used when i debug the tool. And the two changes you made are correct.

  3. There is a file is not created, I have revised the VIcaller.pl script. I also corrected the script of "Extract_fuq_split.pl" which have a bug about the "\n".

With those changes, you should be able to run the validation function now.

Please let me know if you found any other bugs. Thanks.

byeonggill commented 5 years ago

I appreciate your help.

I download revised scripts and running validate function and calculate function. but, I got the same error.

My data is targeted sequencing data. and I attach output file. fcd_Genareal_1.xlsx

file lists: image

validate cmd: nohup perl /data/program/VIcaller_v1.1/VIcaller.pl validate -i 11FCD -c /data/program/VIcaller_v1.1/VIcaller.config -t 4 -S 11FCD_14_75126278_75126928_abelson_murine_leukemia_virus_61487 -G 61487 -V murine_leukemia_virus > validate.out validate.out: image

calculate cmd: nohup perl /data/program/VIcaller_v1.1/VIcaller.pl calculate -i 11FCD -c /data/program/VIcaller_v1.1/VIcaller.config -t 4 -F .fastq.gz -C 14 -P 75126936 -B 2 -N 147 > cal.out cal.out: image

xunchen85 commented 5 years ago

A small correction was made for the VIcaller.pl script, can you try the latest one.

May I ask if you enriched specific viruses for target sequencing? What is the read length?

byeonggill commented 5 years ago

I'm sorry that the answer was delayed. The validate function seems to work normally. but, calculate function have error. In my samtools version, system ("${samtools_d}samtools view ${input_sampleID}_s_h.bam chr${Chr}:${position1}-${position2} >${inputsampleID}${Chr}_${Position}_h.sort.sam") chr1 not 1

Anyway, why does it change to 1 even if I put another number in the c parameter?

cmd: nohup perl /data/program/VIcaller_v1.1/VIcaller.pl calculate -i PYH024 -t 4 -F _s_h.bam -C 4 -I -P 112387642 -B 2 -N 96 > cal.out

this is my revision. image

this is results. image

xunchen85 commented 5 years ago

If the input file is FASTQ format, you can use "" for -F parameter. Or you can change to "-F _h.bam" if you have the PYH024_h.bam file indexed.

For the chromosome ID issue, because the "chr" for the chromosome IDs are deleted in the output file in the current VIcaller version, thus you can add the "chr" in the script to extract corresponding reads from the BAM file for now.

I will update the VIcaller.pl in the next version to support any chromosome ID format soon.

byeonggill commented 5 years ago

My previous question is even if i take parameter -C 4, VIcaller receive chr1.

Sorry but i have another question. I want to modify the virus library. I want viral genome library composed only hepatitis. so, I modified viral.fa, viral.taxonomy and viral.virus_list files.
but, virus names in output files were recorded of unknown. virus GI seem to normally.

Please let me know how to modify viral genome library in detail.

xunchen85 commented 5 years ago

Sorry for my late reply.

Really appreciate for your feedback, i have correct the bug in the VIcaller main script. Specifically, you can either downloaded the latest VIcaller script, or directly revised the line 8 defining the chromosome ID as follow: " "C|Chr=s" => \$Chr,"

For your second question: Can you show me an example of your modified three files? If you are able to detect the virus GI in the FASTA file, but cannot find the corresponding virus name, it means something wrong with your modified virus_list, or viral.taxonomy file; Or it may means that you did not specify the correct paths for those two files in the config file.

virus_list file: only have two column, 1) GI #, and 2) virus_name without space; viral.taxonomy: 11 columns separated by "$": column 1: GI; column 2: GB, column 3: length; column 6: original virus name; column 9: original virus name; column 10: modified_virus_name (no space)

As an example of HBV, you can modify the corresponding columns, and you should be able to find all the info from the NCBI NT database using the GI:

110225259$AB246317.1$3185$Hepatitis B virus; Viruses; Retro-transcribing viruses; Hepadnaviridae; Orthohepadnavirus.$Hepatitis B virus DNA, complete genome, isolate: TA112.$Hepatitis B virus$TA112$0$Hepatitis B virus$hepatitis_b_virus$0

byeonggill commented 5 years ago

thank you, I'm running it just using non-modified virome reference.

but, I have another question, I have some problem when using parameter -r in detect function such as Repeatmasker is so confused and i can't download Rebase database and TRF output position is strange.

so, I did not use parameter -r. but, I'm worried that this might be big difference. Can you give me advice on removing repeat region?

xunchen85 commented 5 years ago

For the repeatmasker, you need to apply for Rebase database and then use it with repeatmasker follow their online manuscript.

You can show some example of TRF output, I can check it out for you.

I would suggest you to disable Repeatmasker function in the script, and then only use TRF and DUST to remove repeat regions. I would be happy to help you with it if you want.