Extract spanning reads and junction reads around integration site

yeli7068 commented 5 years ago

Hi Dr. Xun

Thanks for your works.

Could you give some suggestions to extract spanning reads and junction reads around integration site?

According to the manual, the file seq_1_24020575_24020787_hum an_papillomavirus_type_2189314 04.CS3 might be the one I am looking for. But It was not produced.

Here the codes to run example data: VIcaller.pl detect -d WGS -i seq -f .fastq.gz -m standard -t 12 -Q 20 -a -r -c /VIcaller_v1.1/VIcaller.config

The seq.output looks fine: ` Sample_ID VIcaller_mode QC Reciprocal_alignment Candidate_virus GI Chr. Start End No._chimeric_reads No._split_reads Upstream_breakpoint_on_human Downstream_breakpoint_on_human Upstream_breakpoint_on_virus Downstream_breakpoint_on_virus Information_of_both_upstream_and_downstream_breakpoints Integration_site_in_the_human_genome Integration_allele_fraction No._reads_supporting_nonVI No._reads_supporting_VI Average_alignment_score Is_cell_line_contamination Is_vector Validation_chimeric_confident Validation_chimeric_weak Validation_chimeric_false Validation_split_confident Validation_split_weak Validation_split_false seq standard yes yes human_papillomavirus_type 218931404 1 23694041 23694297 0 0 23694257 23694211 7876 7905 D(+-);D(-+) 23694234 - - 38 64.1578947368421 - - - -

`

Best Regards,

Yang

xunchen85 commented 5 years ago

Hi Yang,

Thanks for your interest.

1) You can find out the read ID from the file seq.virus_f2 (column 31, e.g. read_200_3266186|1|read_200_1803970|1|read_200_6441240|1), the supporting reads of all VIs in FASTQ format can be found in seq_1.1fuq, seq_2.1fuq, and seq_sf.1fuq (split reads).

2) seq.CS3 can also be produced after the running of the validation function per VI. After you run it, the supporting reads will be separate by each validated VI location.

Best, Xun

yeli7068 commented 5 years ago

Thx for your response.

There are some difference between reads reported in .1fuq and .output. The manual suggested the reads reported in .1fuq are potential. Could you please give me some details about the potential? The soft-clip part of reads are larger than 20 bp?

Another question is for the .visulization file. The seqs reported are not the same in the specific region. For example, the region of human part started from 43200, all the seqs reported are very different to each other. To my understandings, the visulization file suggested these seqs shared this region.

Cheers,

Yang

xunchen85 commented 5 years ago

Sorry for my late reply.

As you said, reads reported in .1fuq are potential. So you can use the read ID reported in the seq.virus_f2 file as i mentioned previously to extract the read sequences.

The soft-clip part of reads are larer than 20 bp.

For the .visualization file, you can use the start and end position and reported GI in the .output file to extract corresponding records in the .visualization file. PS: make sure the reference build is the same. The current example output is using hg19 as reference.

You also can send me the .output file and .visualization file to my email, I will be more than happy to check it out.

Cheers, Xun

xunchen85 / VIcaller

Extract spanning reads and junction reads around integration site #4