wurmlab / flo

Same species annotation lift over pipeline.
96 stars 28 forks source link

flo does not work if run with '-qMask' blat option is included in the .yaml file #3

Open TakashiKoyama opened 8 years ago

TakashiKoyama commented 8 years ago

Hi,

I recently started liftover my gff with flo. Since I concern if repeats in the genome affect the liftover result, I would like to try running flo with target genome with and without RepeatMask. For the target genome with RepeatMask, I added 'qMask=mygenome.fasta.out' where mygenome.fasta.out is outfile of RepeatMasker.

The flo seems stuck at blat step because target genome is split and mygenome.fasta.out is not. Could you give me any solutions?

Thanks, Takashi

yeban commented 8 years ago

Do you mean it is taking too long or that the program aborts at blat step. Is it because mygenome.fasta.out can't be found or is it because it should have been split as well. I haven't worked with repeat masked data, so I am not sure what to expect.

TakashiKoyama commented 8 years ago

Hi,

Thank you for your quick response. I mean the program aborts. The log is as below.

mkdir run
cp ../source/source.fasta run/source.fa
cp ../target/mygenome.fasta.masked run/target.fa
faToTwoBit run/source.fa run/source.2bit
faToTwoBit run/target.fa run/target.2bit
twoBitInfo run/source.2bit stdout | sort -k2nr > run/source.sizes
twoBitInfo run/target.2bit stdout | sort -k2nr > run/target.sizes
faSplit sequence run/target.fa 35 run/chunk_
parallel --joblog run/joblog.faSplit -j 8 -a run/joblst.faSplit
946 pieces of 1371 written
2718 pieces of 3871 written
3401 pieces of 4619 written
3154 pieces of 4711 written
2942 pieces of 4086 written
3777 pieces of 5254 written
58962 pieces of 67611 written
59191 pieces of 68066 written
parallel --joblog run/joblog.blat -j 8 -a run/joblst.blat
Loaded 631070304 letters in 358189 sequences
Searched 18871550 bases in 3777 sequences
Loaded 631070304 letters in 358189 sequences
Searched 4728198 bases in 946 sequences
Loaded 631070304 letters in 358189 sequences
Searched 13581909 bases in 2718 sequences
Loaded 631070304 letters in 358189 sequences
Searched 14694033 bases in 2942 sequences
Loaded 631070304 letters in 358189 sequences
Searched 15754023 bases in 3154 sequences
Loaded 631070304 letters in 358189 sequences
Searched 16996379 bases in 3401 sequences
Loaded 631070304 letters in 358189 sequences
Searched 295953418 bases in 59191 sequences
Loaded 631070304 letters in 358189 sequences
Searched 294805428 bases in 58962 sequences
parallel --joblog run/joblog.liftUp -j 8 -a run/joblst.liftUp
Got 3401 lifts in run/chunk_01.fa.lft
Lifting run/chunk_01.fa.psl
run/chunk_01.fa.psl is empty
Got 3154 lifts in run/chunk_04.fa.lft
Lifting run/chunk_04.fa.psl
run/chunk_04.fa.psl is empty
Got 2718 lifts in run/chunk_00.fa.lft
Lifting run/chunk_00.fa.psl
run/chunk_00.fa.psl is empty
Got 2942 lifts in run/chunk_03.fa.lft
Lifting run/chunk_03.fa.psl
run/chunk_03.fa.psl is empty
Got 946 lifts in run/chunk_05.fa.lft
Lifting run/chunk_05.fa.psl
run/chunk_05.fa.psl is empty
Got 3777 lifts in run/chunk_02.fa.lft
Lifting run/chunk_02.fa.psl
run/chunk_02.fa.psl is empty
Got 59191 lifts in run/chunk_07.fa.lft
Lifting run/chunk_07.fa.psl
run/chunk_07.fa.psl is empty
Got 58962 lifts in run/chunk_06.fa.lft
Lifting run/chunk_06.fa.psl
run/chunk_06.fa.psl is empty
parallel --joblog run/joblog.axtChain -j 8 -a run/joblst.axtChain
run/chunk_02.fa.psl.lifted is empty
run/chunk_01.fa.psl.lifted is empty
run/chunk_03.fa.psl.lifted is empty
run/chunk_05.fa.psl.lifted is empty
run/chunk_00.fa.psl.lifted is empty
run/chunk_07.fa.psl.lifted is empty
run/chunk_04.fa.psl.lifted is empty
run/chunk_06.fa.psl.lifted is empty
parallel --joblog run/joblog.chainSort -j 8 -a run/joblst.chainSort
chainMergeSort run/*.chn.sorted | chainSplit run stdin -lump=1
mv run/000.chain run/combined.chn.sorted
rake aborted!
Errno::ENOENT: No such file or directory @ rb_file_s_rename - (run/000.chain, run/combined.chn.sorted)
/home/koyama/Documents/GenomeAssembly/13LiftOver/flo/wRM/Rakefile:189:in `block in <top (required)>'
/home/koyama/Documents/GenomeAssembly/13LiftOver/flo/wRM/Rakefile:211:in `block in <top (required)>'
Tasks: TOP => run/liftover.chn
(See full trace by running task with --trace)

The problem seems that run/chunk_*.fa.psl is empty. So I checked joblst.blat and the code is as below.

blat -noHead -fastMap -tileSize=12 -minIdentity=98 -qMask=../target/mygenome.fasta.out run/source.fa run/chunk_01.fa run/chunk_01.fa.psl

In the ../target/mygenome.fasta.out, position of the repeats are shown as below.

$ head ../target/mygenome.fasta.out
   SW   perc perc perc  query     position in query                 matching          repeat                  position in repeat
score   div. del. ins.  sequence  begin     end            (left)   repeat            class/family        begin   end    (left)      ID

   15    0.0  0.0  0.0  chr1              1        17   (5013216) + (GT)n             Simple_repeat             1     17     (0)      1  
   14   15.9  0.0  0.0  chr1           1149      1176   (5012057) + (ATA)n            Simple_repeat             1     28     (0)      2  
   42    0.0  0.0  0.0  chr1           3000      3035   (5010198) + (GT)n             Simple_repeat             1     36     (0)      3  
   33    0.0  0.0  0.0  chr1           3415      3442   (5009791) + (GT)n             Simple_repeat             1     28     (0)      4  
   37    5.1  0.0  0.0  chr1           3871      3911   (5009322) + (GT)n             Simple_repeat             1     41     (0)      5  
   22    0.0  0.0  0.0  chr1           4073      4096   (5009137) + (CA)n             Simple_repeat             1     24     (0)      6  
   42    0.0  0.0  0.0  chr1           6914      6949   (5006284) + (CT)n             Simple_repeat             1     36     (0)      7

I think the problem is conflict between ../target/mygenome.fasta.out and run/chunk_*.fa.

Thanks, Takashi

yeban commented 8 years ago

Very likely - because of the chunking, the coordinates specified in RepeatMasker's output no longer tally with target assembly's coordinate.

I think you will need to get your hands dirty here. Relevant file here is Rakefile. The first half is helper fucntions, and the second half the scripting. I would comment out L166-168, remove -fastMap from blat_opts and try again.

TakashiKoyama commented 8 years ago

Mmm, I still got errors as below.

mkdir run
cp ../source/source.fasta run/source.fa
cp ../target/mygenome.fasta.masked run/target.fa
faToTwoBit run/source.fa run/source.2bit
faToTwoBit run/target.fa run/target.2bit
twoBitInfo run/source.2bit stdout | sort -k2nr > run/source.sizes
twoBitInfo run/target.2bit stdout | sort -k2nr > run/target.sizes
faSplit sequence run/target.fa 35 run/chunk_
parallel --joblog run/joblog.blat -j 8 -a run/joblst.blat
Loaded 631070304 letters in 358189 sequences
Searched 26256550 bases in 5 sequences
Query sequence chr10 has size 6424017, it might take a while.
Query sequence chr12 has size 8541060, it might take a while.
Query sequence chr14 has size 5470449, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 6853198 bases in 1 sequences
Query sequence chr24 has size 6853198, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 23539023 bases in 5 sequences
Query sequence chr19 has size 5582122, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 23086379 bases in 5 sequences
Query sequence chr5 has size 5469732, it might take a while.
Query sequence chr9 has size 5583017, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 20414033 bases in 4 sequences
Query sequence chr15 has size 6390933, it might take a while.
Query sequence chr18 has size 5455016, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 19346909 bases in 4 sequences
Query sequence chr1 has size 5013233, it might take a while.
Query sequence chr2 has size 5346433, it might take a while.
Query sequence chr4 has size 5249697, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 338050428 bases in 1 sequences
Query sequence chrUn1 has size 338050428, it might take a while.
Loaded 631070304 letters in 358189 sequences
Searched 340328418 bases in 1 sequences
Query sequence chrUn2 has size 340328418, it might take a while.
parallel --joblog run/joblog.liftUp -j 8 -a run/joblst.liftUp
Couldn't open run/chunk_01.fa.lft , No such file or directory
Couldn't open run/chunk_07.fa.lft , No such file or directory
Couldn't open run/chunk_04.fa.lft , No such file or directory
Couldn't open run/chunk_00.fa.lft , No such file or directory
Couldn't open run/chunk_06.fa.lft , No such file or directory
Couldn't open run/chunk_03.fa.lft , No such file or directory
Couldn't open run/chunk_02.fa.lft , No such file or directory
Couldn't open run/chunk_05.fa.lft , No such file or directory
rake aborted!
Command failed with status (8): [parallel --joblog run/joblog.liftUp -j 8 -...]
/home/koyama/Documents/GenomeAssembly/13LiftOver/flo/wRM/Rakefile:143:in `parallel'
/home/koyama/Documents/GenomeAssembly/13LiftOver/flo/wRM/Rakefile:174:in `block in <top (required)>'
/home/koyama/Documents/GenomeAssembly/13LiftOver/flo/wRM/Rakefile:211:in `block in <top (required)>'
Tasks: TOP => run/liftover.chn
(See full trace by running task with --trace)

I might have to seriously look into or give up to use masked one. Since I have never use Ruby, it takes a while. If you find out a solution, please let me know.

Takashi

yeban commented 8 years ago

My bad. Since we aren't chunking liftUp doesn't need to happen either and thus L174-176 should be commented out as well.

Repeats mess up alignment edges, causing the resulting chains to be miss out parts of an annotation. Annotations that are only partially covered by a chain are not lifted. I have observed such cases in our fire ant dataset when trying to understand why some annotations aren't lifted, but don't have an exact count.

I am very interested in your experiment - understanding the effect of repeat masking on the lift over process. But it is unlikely that this is something I will pursue myself in the foreseeable future. I am happy to help if you would like to continue and incorporate a solution in the pipeline once it presents itself (I am thinking of adding an option to disable chunking, but can make the change only if it works).