wososa / PSI-Sigma

PSI-Sigma
Other
35 stars 10 forks source link

Lots of errors are generated, files keep getting longer and longer names #8

Closed angel-bee2018 closed 4 years ago

angel-bee2018 commented 4 years ago

Hi! Thank you for developing this tool! I am very excited to include it in my analysis! However, I haven't been able to get it up and running. I have updated to the latest version, and I have all the latest versions of the prerequisite tools installed. I ran STAR according to the guide in the wiki and following the instructions, I created groupa.txt, groupb.txt etc. but it always seems to throw the following errors:

[E::hts_idx_push] Unsorted positions on sequence #23: 14104 followed by 14067
[E::sam_index] Read 'HWI-ST1213:112:C0NNEACXX:3:1101:1105:40419' with ref_name='MT', ref_length=16569, flags=163, pos=14067 cannot be indexed
samtools index: failed to create index for "BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.SortedbyName.bam.tmp.0032.bam"
BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.SortedbyName.bam.tmp.0033.bam.bai doesn't exist. Creating a new index...
[E::hts_idx_push] Unsorted positions on sequence #24: 155942013 followed by 155939731
[E::sam_index] Read 'HWI-ST1213:112:C0NNEACXX:3:1101:1130:62488' with ref_name='X', ref_length=156040895, flags=163, pos=155939731 cannot be indexed
samtools index: failed to create index for "BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.SortedbyName.bam.tmp.0033.bam"
BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12h_r2_Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12h_r3_Aligned.sortedByCoord.out.bam.bai is ready.
Generating mapping file...
Checking splice-junction files...
Generating .SJ.out.tab will need to re-sort BM_MSC_to_OB_12h_r1__STARpass1:.SJ.out.tab.SJ.out.tab file by read names.
It will consume a lot of time, do you want to proceed? (Y/N)[main_samview] fail to read the header from "BM_MSC_to_OB_12h_r1__STARpass1:.SJ.out.tab.SJ.out.tab".
Starting to sort BM_MSC_to_OB_12h_r1__STARpass1:.SJ.out.tab.SJ.out.tab by read names...
samtools sort: failed to read header from "BM_MSC_to_OB_12h_r1__STARpass1:.SJ.out.tab.SJ.out.tab"
[E::hts_open_format] Failed to open file "BM_MSC_to_OB_12h_r1__STARpass1:.SJ.out.tab.SJ.out.tab.SortedbyName.bam" : No such file or directory
samtools view: failed to open "BM_MSC_to_OB_12h_r1__STARpass1:.SJ.out.tab.SJ.out.tab.SortedbyName.bam" for reading: No such file or directory
Generating .SJ.out.tab will need to re-sort BM_MSC_to_OB_12h_r2_.SJ.out.tab file by read names.
It will consume a lot of time, do you want to proceed? (Y/N)[main_samview] fail to read the header from "BM_MSC_to_OB_12h_r2_.SJ.out.tab".
Starting to sort BM_MSC_to_OB_12h_r2_.SJ.out.tab by read names...
samtools sort: failed to read header from "BM_MSC_to_OB_12h_r2_.SJ.out.tab"
[E::hts_open_format] Failed to open file "BM_MSC_to_OB_12h_r2_.SJ.out.tab.SortedbyName.bam" : No such file or directory
samtools view: failed to open "BM_MSC_to_OB_12h_r2_.SJ.out.tab.SortedbyName.bam" for reading: No such file or directory
Generating .SJ.out.tab will need to re-sort BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab file by read names.
It will consume a lot of time, do you want to proceed? (Y/N)[main_samview] fail to read the header from "BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab".
Starting to sort BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab by read names...
samtools sort: failed to read header from "BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab"
[E::hts_open_format] Failed to open file "BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab.SortedbyName.bam" : No such file or directory
samtools view: failed to open "BM_MSC_to_OB_12h_r1_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab.SortedbyName.bam" for reading: No such file or directory
Generating .SJ.out.tab will need to re-sort BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab file by read names.
It will consume a lot of time, do you want to proceed? (Y/N)[main_samview] fail to read the header from "BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab".
Starting to sort BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab by read names...
samtools sort: failed to read header from "BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab"
[E::hts_open_format] Failed to open file "BM_MSC_to_OB_12d_r3_Aligned.sortedByCoord.out.bai.SJ.out.tab.SJ.out.tab.SortedbyName.bam" : No such file or directory

The files outputted look very bizzare like this:

image

I have confirmed that STAR successfully outputted the Aligned.sortedByCoord.out.bam.bai, Aligned.sortedByCoord.out.bam and the *SJ.out.tab files, yet, the script seems to still throw errors that the .bam files can't be read or aren't sorted etc...

I am at a loss to explain this. I would greatly appreciate any assistance in troubleshooting this!

Kind Regards

wososa commented 4 years ago

In the groupa.txt file, could you please try putting the full .bam file name? Could you modify the “-“ to “.” in the bam file name? For example, BM_MSC_to_OB_12h_r1.Aligned.sortedByCoord.out.bam. It should be “.Aligned” instead of “_Aligned”. Thanks for reporting the bug.

Thanks, Woody

angel-bee2018 commented 4 years ago

Hi, thanks for the fix, it runs past that point now! However, I have run into a separate problem. I'm running into these frightening messages at the *IR.out generation step. Has something gone wrong?

The output:

Getting intron reads....
Checking BM_MSC_to_OB_12h_r1.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12h_r1.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12h_r1
chromosome format = 0
read format = paired-end
Doing... chromosome 1
Undefined subroutine &main::median called at (eval 1) line 229.
Checking BM_MSC_to_OB_12h_r2.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12h_r2.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12h_r2
chromosome format = 0
read format = paired-end
mkdir: cannot create directory ‘_wososatmp’: File exists
Doing... chromosome 1
Undefined subroutine &main::median called at (eval 1) line 229.
Checking BM_MSC_to_OB_12d_r1.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12d_r1.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12d_r1
chromosome format = 0
read format = paired-end
mkdir: cannot create directory ‘_wososatmp’: File exists
Doing... chromosome 1

Regards, Angel

angel-bee2018 commented 4 years ago

Update: I rolled back to PSI-Sigma 1.7 and just ran that - The "undefined subroutine" error was not present and the r_ir.txt and r_ir_sorted.txt files were able to be generated. However, the *IR.out.tab files are all empty.

wososa commented 4 years ago

Hi @angel-bee2018 , Sorry for the late reply. Could you please try version 1.8 instead? The error "subroutine &main::median" looks strange. Some codes in 1.7 might be missing. Thanks, Woody

angel-bee2018 commented 4 years ago

Hi Woody @wososa , version 1.8 was acting up, that's why I tried every single one of the older versions to no avail.

angel-bee2018 commented 4 years ago

UPDATE: I re-downloaded PSI-Sigma 1.8 and ran it again. This time, the error is not present, but the temporary intron files are empty.

Number of valid .SJ.out.tab files = 6
time = 0.27 mins
===Database spent 0.3161 hours.===
Getting intron reads....
Checking BM_MSC_to_OB_12d_r2.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12d_r2.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12d_r2
read format = paired-end
Doing... 1
Output... 1
Doing... 10

image

wososa commented 4 years ago

@angel-bee2018 Thanks for trying. Is the chromosome name in BM_MSC_to_OB_12d_r2.Aligned.sortedByCoord.out.bam like "1"?

angel-bee2018 commented 4 years ago

@wososa I don't think so. I am aware that there can be errors where the chromosome name is in the format "chr1" but i couldn't find any "chr" anywhere.

> samtools view /srv/scratch/z3463471/PGNEXUS_kassem_MSC/Kassem_OB/mapping_pass_1/BM_MSC_to_OB_12d_r2.Aligned.sortedByCoord.out.bam | head

HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 163 1   10563   0   67M2D34M    =   181305  170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:1  AS:i:178    nM:i:7
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   10839   170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:2  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   181276  170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:3  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   10738   170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:4  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   10709   170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:5  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   10680   170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:6  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   181189  170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:7  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   10651   170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:8  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 419 1   10563   0   67M2D34M    =   181160  170963  CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCC   B?<BDDF3CD1AC<AG:@?:0??FFAGGI<;AF@FDC;7CC57?E>B</=A>;>;?25;?BB;B>BC##################################   NH:i:9  HI:i:9  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 339 1   10651   0   27M170777N32M1I39M2S    =   10563   -170963 GCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGCACGCCTACGGGTG   ##############################A@AA:C@33@:@;8755==<?B?@@>E=?EAGGIIHGBG>HF>GGHEG@FF;IHBDAEDHDBDADDDB@@@   NH:i:9  HI:i:8  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 339 1   10680   0   27M170748N32M1I39M2S    =   10563   -170963 GCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGCACGCCTACGGGTG   ##############################A@AA:C@33@:@;8755==<?B?@@>E=?EAGGIIHGBG>HF>GGHEG@FF;IHBDAEDHDBDADDDB@@@   NH:i:9  HI:i:6  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 339 1   10709   0   27M170719N32M1I39M2S    =   10563   -170963 GCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGCACGCCTACGGGTG   ##############################A@AA:C@33@:@;8755==<?B?@@>E=?EAGGIIHGBG>HF>GGHEG@FF;IHBDAEDHDBDADDDB@@@   NH:i:9  HI:i:5  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 339 1   10738   0   27M170690N32M1I39M2S    =   10563   -170963 GCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGCACGCCTACGGGTG   ##############################A@AA:C@33@:@;8755==<?B?@@>E=?EAGGIIHGBG>HF>GGHEG@FF;IHBDAEDHDBDADDDB@@@   NH:i:9  HI:i:4  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1316:10480:22620 339 1   10839   0   27M170589N32M1I39M2S    =   10563   -170963 GCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGCACGCCTACGGGTG   ##############################A@AA:C@33@:@;8755==<?B?@@>E=?EAGGIIHGBG>HF>GGHEG@FF;IHBDAEDHDBDADDDB@@@   NH:i:9  HI:i:2  AS:i:177    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:2306:8598:2244   355 1   12008   0   22S79M  =   12028   121 CACGTGGCATGTTACCACTGTGAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGG   CCCFFFFFHHHHHJJJIJJFHIHJIAFFHIJIIIJJJIJJJJJJJJIJJJJJJHHJJIJJJ3CGEIIIHHHHHEFCEFAE;ABDB;ACDCDDDDDCAAA<B   NH:i:7  HI:i:6  AS:i:178    nM:i:0

> samtools view /srv/scratch/z3463471/PGNEXUS_kassem_MSC/Kassem_OB/mapping_pass_1/BM_MSC_to_OB_12d_r2.Aligned.sortedByCoord.out.bam | tail

HWI-ST1213:112:C0NNEACXX:5:1304:18188:63259 403 KI270718.1  29521   0   101M    =   29512-110   TACAATAGATGCTTAATCAGTGCCTTTAATAGGAAGTTAGAAACTCCCAATCCAATCAACAAGGCTTTGATTCTACCTCCTAAGTACTACCCAGATCACAGCCDEEDDDCEEECEEFEFDFHHEGHGIJJIJIJIJIJJIGGIHIIIIJJHHEIJIGIGGJJJJJIJJIIGGIHFEGEBGIGJHEGEHFCGHHHFFFFDC@C  NH:i:5HI:i:3    AS:i:200    nM:i:0
HWI-ST1213:112:C0NNEACXX:5:2315:16342:75458 403 KI270718.1  29524   0   100M1S  =   29449-175   AATAGATGCTTAATCAGTGCCTTTAATAGGAAGTTAGAAACTCCCAATCCAATCAACAAGGCTTTGATTCTACCTCCTAAGTACTACCCAGATCACAGACG@B>DC>:CACDCCC@@?@6C;EAA?E?GEIGHHCGA@F9IIH=ECIFED??>CGHGB>D;BCHHEIHHHCEADF>C<?EJHICF<C@DBFFFA>FDDB?=B  NH:i:5HI:i:3    AS:i:197    nM:i:1
HWI-ST1213:112:C0NNEACXX:5:1114:15306:45341 403 KI270718.1  29596   1   47M1D54M    =   29507   -191    CCTCCTAAGTACTACCCAGATCACAGAAAATCCAAATATCAGAACTAGGGGCTGGGCAGAGAGGACAAATCATCTATTAGGGAGTGGGACAGAAAGTGGAA   >>3ACDCEC@A=??D@DHHHCA;CJHGIHC=:IGIJHGGJGHGFDDJIHGCFJJJIJIIIIHEHGIJIIFHDEJIIIIJGJIIIJIHGFHFHBFDFFFCCCNH:i:3 HI:i:3  AS:i:194    nM:i:1
HWI-ST1213:112:C0NNEACXX:5:1308:8894:98570  419 KI270718.1  29600   0   43M1D58M    =   29647   148 CTAAGTACTACCCAGATCACAGAAAATCCAAATATCAGAACTAGGGGCTGGGCAGAGAGGACAAATCATCTATTAGGGAGTGGGACAGAAAGTGGAAACAT   B?@DD=B?<?FFHIE@?FBFBBFDFDHHIE>GG9FBDBDCFADFBHEHJIIBBBFCEEEHEEH?CAC;CB?BDF3;@AC9,;??BAB?:CC@:AC>AC>>>NH:i:5 HI:i:4  AS:i:192    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:1308:8894:98570  339 KI270718.1  29647   0   101M    =   29600-148   GCTGGGCAGAGAGGACAAATCATCTATTAGGGAGTGGGACAGAAAGTGGAAACATTACAAAGGAGCAAGTAGGCTCAGAACAGAGGGGAGAGTAGTACTGG?;CBCCCCCCEEEDBD>?;EEEAGD==;GHCDGBFFHGGIIHGFBDGFD9*?ECF?:IGIIIEHG@EHHGGGF>CA<F?CIHF?IHBHDHFDHDB?BD@@@  NH:i:5HI:i:4    AS:i:192    nM:i:2
HWI-ST1213:112:C0NNEACXX:5:2102:7506:96478  355 KI270718.1  29704   1   101M    =   29748145    CAAAGGAGCAAGTAGGCTCAGAACAGAGGGGAGAGTAATACTGGGGAAACTCCCTACTGAGGACAGGTTTGCCACAGTGGATATGTATGTGATGCTTTTGTCCCFFFFFHHHHHJJJJJJJJJJJJJIJJJJGHFHBGGJJJJJJIJIIGIJJIJJJJIHHHHHFFFFACEDEDDDDDCDDCDDDDDEFEDEEEDDDEDDDB  NH:i:3HI:i:3    AS:i:200    nM:i:0
HWI-ST1213:112:C0NNEACXX:5:2102:7506:96478  403 KI270718.1  29748   1   101M    =   29704-145   GGAAACTCCCTACTGAGGACAGGTTTGCCACAGTGGATATGTATGTGATGCTTTTGTCCTCATGGGCTGGAATGGAAGCTGATAAAGAAGTTCAGACGCTACEDC?DCDDDEDDEEEEEEFFFFHHHHHJIJJJJIJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJIJJJIIJJJJJJIJJJJJJJJHHHHHFFFFFCCC  NH:i:3HI:i:3    AS:i:200    nM:i:0
HWI-ST1213:112:C0NNEACXX:5:2106:18822:50243 99  KI270718.1  29810   1   2S99M   =   29995286    CGTGGGTTGGAATGGAAGCTGATAAAGAAGTTCAGATGCTACATGTTGCTTAGGTCTGCTTTGTTGAAGCCTGTCTCTTTCAGAGTTCCTAAACACAATAT@@?DFF@DHHBFHGHIIIJIIIJJIGAFGHFCECEHGIJHIIIDHHGGIJJEIGBFG>GIGGCHGE@@EHIEHHF=BDFFDCDACA@C@CA:AC@?BDD@A  NH:i:3HI:i:2    AS:i:188    nM:i:5
HWI-ST1213:112:C0NNEACXX:5:1214:1816:57743  163 KI270718.1  29879   255 2S99M   =   29994214    CCTGTCTCTTTCAGAGTTCCTAAACACAATATTCCTGTGGCTTCTAATCCCAGTGAGTGCTCCAAATCCAGGCTGTGTATCAGATGCCACGGGAAATTCTG@@@FDDFFHHHHHGEHCFEHGIJIIGHHEHGGIIIJIHEEGHGEGDDGGIEDGHHIJ<<FGHGE;F@GHEGJI@HCE=EEHEHC>>@BDCCDB>/<CCDDD  NH:i:1HI:i:1    AS:i:190    nM:i:3
HWI-ST1213:112:C0NNEACXX:5:2306:10174:23925 163 KI270718.1  29901   3   101M    =   29937137    CACAATATTCCCGTGGCATCTAATCCCAGTGAGTGCTCCGAATCCAGGCTGTGTATCAGATGCCACGGGAAATTCTGCCTCAGGACTGAGGTTGGTTCAAGC@CFFFFFHHHHHIJJIIJJJJJJJJJIJHIIJEGHIIJ0:DHIJJJJJJGGHEIJIJJJJJJJJJHHFDFDEEEEECDDDDDDCDDDCAB?BDD<ACDEC  NH:i:2HI:i:1    AS:i:198    nM:i:1
HWI-ST1213:112:C0NNEACXX:5:2306:10174:23925 83  KI270718.1  29937   3   101M    =   29901-137   TCCAAATCCAGGCTGTGTATCAGATGCCACGGGAAATTCTGCCTCAGGACTGAGGTTGGTTCAAGCATCTGGATGATGTCAAAAGTCCACGTTGTGTGCTGCDDDDDDDDDDDDDDDDDEDDDDDDACDFFFHHHHHHHJJGIIFJJIIHJIJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIFHHHHFFFFFCCC  NH:i:2HI:i:1    AS:i:198    nM:i:1
angel-bee2018 commented 4 years ago

@wososa after leaving it to run overnight, there was no error output to the console and results are as follows:

image

inside ~/_wososatmp/ ...

image

Basically I've got everything, except for the intronic reads!

wososa commented 4 years ago

Hi @angel-bee2018,

Thanks for posting the screenshots! Very helpful. Could you please post the file content of your groupa.txt? It looks like file name error. Sorry about my late reply.

Woody

angel-bee2018 commented 4 years ago

hi @wososa , that's no problem!

groupa.txt: image

groupb.txt: image

angel-bee2018 commented 4 years ago

i just generated the contents of groupa and groupb.txt by using ls

wososa commented 4 years ago

Everything looks normal. Is the chromosome name in your .gtf file like "chr1" or "1"? Could you please try PSIsigma-ir-v.1.1.pl [database name].db BM_MSC_to_OB_12d_r1.Aligned.sortedByCoord.out.bam 1? The command will extract the IR files from the .bam file. It may show errors.

angel-bee2018 commented 4 years ago

@wososa the GTF file is in the format "1" (see below): NOTE that the GTF that I provide has been pre-sorted.

$ head -n 10 Homo_sapiens.GRCh38.98.sorted.gtf
#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.28
#!genebuild-last-updated 2019-06
1       havana  exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1       havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1";
1       havana  exon    12010   12057   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; exon_id "ENSE00001948541"; exon_version "1"; tag "basic"; transcript_support_level "NA";
1       havana  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; tag "basic"; transcript_support_level "NA";
angel-bee2018 commented 4 years ago

I also ran what you told me to run. It outputs an identical message to when I ran it before and I still get no intronic reads.

$ perl $PSIsigma_package_dir"PSIsigma-ir-v.1.1.pl" BM_MSC_to_OB_12d_vs_BM_MSC_to_OB_12h.db BM_MSC_to_OB_12d_r1.Aligned.sortedByCoord.out.bam 1
read format = paired-end
mkdir: cannot create directory ‘_wososatmp’: File exists
Doing... 1
Output... 1
Doing... 10
Output... 10
Doing... 11
Output... 11
Doing... 12
Output... 12
Doing... 13
Output... 13
Doing... 14
Output... 14
Doing... 15
Output... 15
Doing... 16
Output... 16
Doing... 17
Output... 17
Doing... 18
Output... 18
Doing... 19
Output... 19
Doing... 2
Output... 2
Doing... 20
Output... 20
Doing... 21
Output... 21
Doing... 22
Output... 22
Doing... 3
Output... 3
Doing... 4
Output... 4
Doing... 5
Output... 5
Doing... 6
Output... 6
Doing... 7
Output... 7
Doing... 8
Output... 8
Doing... 9
Output... 9
Doing... MT
Output... MT
Doing... X
Output... X
Doing... Y
Output... Y
Spent 48.4500 mins

image image

angel-bee2018 commented 4 years ago

UPDATE: I also tested using the original unsorted GTF file. still same result.

wososa commented 4 years ago

@angel-bee2018 Interesting. It ran for 48 mins and didn't generate any output numbers. You mentioned that only v1.8 has no error for the median function. It made me wonder if the error is due to the version of Perl. Could you please check which version of Perl are you using? perl --version I will take a closer look at why the Undefined subroutine &main::median called at (eval 1) line 229. happened in earlier versions later.

wososa commented 4 years ago

@angel-bee2018 By the way, are you using the bash function of Windows 10 to run PSI-Sigma?

angel-bee2018 commented 4 years ago

@wososa version is perl/5.28.0

and no im using bash terminal on an HPC cluster.

$ perl --version

This is perl 5, version 28, subversion 0 (v5.28.0) built for x86_64-linux-thread-multi

Copyright 1987-2018, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
wososa commented 4 years ago

@angel-bee2018 I found a bug in the database generation step. PSI-Sigma will add "chr" to the chromosome name if it doesn't have "chr" prefix... I will fix this soon!

wososa commented 4 years ago

@angel-bee2018 Please try https://github.com/wososa/PSI-Sigma/releases/tag/v1.8a Thanks! Woody

angel-bee2018 commented 4 years ago

@wososa oh amazing! i will try this out straight away!]

thanks so much

wososa commented 4 years ago

@angel-bee2018 Could you please use https://github.com/wososa/PSI-Sigma/releases/tag/v1.8b? I removed all scripts that will edit chromosome names. Thanks!

angel-bee2018 commented 4 years ago

@wososa I am currently halfway through running the first comparison with PSI-Sigma 1.8b, and the first thing I noticed was that the database generation step is on average over 4 times faster than before (0.3117 hours vs. 0.0744 hours).

This seemed a bit fishy so I checked the .db file, and it is empty.

image

UPDATE: the IRtmp.txt files are also empty image

wososa commented 4 years ago

@angel-bee2018 I found another line editing the chromosome names in dummyai.pl. Please removed the .db file and run https://github.com/wososa/PSI-Sigma/releases/tag/v1.8c. Thanks!

angel-bee2018 commented 4 years ago

UPDATE2: the console output:

name = BM_MSC_to_OB_12d_vs_BM_MSC_to_OB_12h
type = 1
nread = 5
skipratio = 0.05
fmode = 3
Path = /home/z3463471/isoform_software/PSIsigma_1.8
BM_MSC_to_OB_12d_r1.Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12d_r2.Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12d_r3.Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12h_r1.Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12h_r2.Aligned.sortedByCoord.out.bam.bai is ready.
BM_MSC_to_OB_12h_r3.Aligned.sortedByCoord.out.bam.bai is ready.
Generating mapping file...
Checking splice-junction files...
===Splice-junction files spent 0.0000 hours.===
Generating BM_MSC_to_OB_12d_vs_BM_MSC_to_OB_12h.db...
Doing... chr1
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr10
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr11
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... chr12
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr13
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr14
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr15
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... chr16
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr17
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr18
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr19
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... chr2
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr20
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr21
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr22
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... chr3
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr4
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr5
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... chr6
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr7
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr8
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chr9
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... chrMT
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chrX
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... chrY
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
===Database spent 0.0744 hours.===
Getting intron reads....
Checking BM_MSC_to_OB_12h_r1.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12h_r1.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12h_r1
read format = paired-end
Doing... 1
Output... 1
Doing... 10
Output... 10
Doing... 11
angel-bee2018 commented 4 years ago

@angel-bee2018 I found another line editing the chromosome names in dummyai.pl. Please removed the .db file and run https://github.com/wososa/PSI-Sigma/releases/tag/v1.8c. Thanks!

@wososa The database generation step just finished running with v1.8c. Unfortunately, the .db, .bed AND the IRtmp.txt files are still empty :(

What is different this time round is that the database generation also included unplaced reference contigs and also the X, Y and MT chromosomes. (I forgot if they were there before or not, but the X, Y and MT chromosomes are important.)

Additionally, the console output has error message:

cat: chr*.db: No such file or directory cat: chr*.bed: No such file or directory rm: cannot remove ‘chr*.db’: No such file or directory rm: cannot remove ‘chr*.bed’: No such file or directory The whole output:

Doing... KI270734.1
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... KI270744.1
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... KI270750.1
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
Doing... MT
Number of valid .SJ.out.tab files = 6
time = 0.17 mins
Doing... X
Number of valid .SJ.out.tab files = 6
time = 0.48 mins
Doing... Y
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
cat: chr*.db: No such file or directory
cat: chr*.bed: No such file or directory
rm: cannot remove ‘chr*.db’: No such file or directory
rm: cannot remove ‘chr*.bed’: No such file or directory
===Database spent 0.3411 hours.===
Getting intron reads....
Checking BM_MSC_to_OB_12h_r1.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12h_r1.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12h_r1
read format = paired-end
Doing... 1
Output... 1
Doing... 10
Output... 10
wososa commented 4 years ago

@angel-bee2018 PSI-Sigma will merge database files of different chromosomes to a single file. I forgot to remove the “Chr” prefix in the step in dummy’s I.pl. I will fix it soon.

wososa commented 4 years ago

@angel-bee2018 Could you please try this: https://github.com/wososa/PSI-Sigma/releases/tag/v1.8d? Thanks!

angel-bee2018 commented 4 years ago

@wososa Thanks so much for the fast update.

v1.8d just finished generating the database. I can get back the .db and .bed files now, but the pesky Undefined subroutine &main::median called at (eval 1) line 204. problem is present now when PSI-SIgma tries to write the intron reads.

I checked the IRtmp.txt files and it only contains that of chromosome 1 and is empty.

The context:

time = 0.48 mins
Doing... Y
Number of valid .SJ.out.tab files = 6
time = 0.18 mins
===Database spent 0.3439 hours.===
Getting intron reads....
Checking BM_MSC_to_OB_12d_r2.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12d_r2.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12d_r2
read format = paired-end
Doing... 1
Output... 1
Undefined subroutine &main::median called at (eval 1) line 204.
Checking BM_MSC_to_OB_12h_r1.Aligned.sortedByCoord.out.bam...
Checking BM_MSC_to_OB_12h_r1.IR.out.tab...
Generating .IR.out for BM_MSC_to_OB_12h_r1
wososa commented 4 years ago

@angel-bee2018 Could you please check if your Perl has "Statistics::Basic" installed? I think that it caused the median problem.

angel-bee2018 commented 4 years ago

@wososa It seems to be installed... for good measure, I reinstalled it and checked the module again and they appear to be installed succesfully. EDIT: furthermore, I have confirmed that the directory containing Statistics/basic.pm is included in the $PERL5LIB variable EDIT2: re-running the analyses after reinstalling did not help.

$ perl -e "use PDL::LiteF"
$ perl -e "use PDL::GSL::CDF"
$ perl -e "use Statistics::Multtest"
$ perl -e "use Statistics::Basic"
$ cpanm Statistics::Basic
--> Working on Statistics::Basic
Fetching http://www.cpan.org/authors/id/J/JE/JETTERO/Statistics-Basic-1.6611.tar.gz ... OK
Configuring Statistics-Basic-1.6611 ... OK
Building and testing Statistics-Basic-1.6611 ... OK
Successfully installed Statistics-Basic-1.6611
1 distribution installed
$ perl -e "use Statistics::Basic"
$
wososa commented 4 years ago

@angel-bee2018 Thanks for the detailed test. I have added a subroutine "median" to replace Statistics::Basic's median. Please try: https://github.com/wososa/PSI-Sigma/releases/tag/v1.8e. Thanks for your generous helps!

angel-bee2018 commented 4 years ago

@wososa many thanks for the quick fix! the IRtmp.txt and IR.out.tab files seem to have contents now, and nothing unexpected has occurred in the database generation step as far as I can tell. I will give another update once the whole comparison is finished. hopefully they all would have run successfully

Fingers crossed!

angel-bee2018 commented 4 years ago

@wososa UPDATE: over the course of today, a few comparisons have run already without any error messages and I have confirmed that the r_ir.(out|sorted.out) files contain retained intron info. Many thanks for all your help to get this working so far!

At this stage, I would like to confirm that few things are working as intended (i.e. nothing is silently broken):

  1. I noticed that there is no strand information in the final results output table r_ir.(out|sorted.out). However, in the *.gtf.mapping.txt file, there is ENST ID, gene ID AND strand. Was there supposed to be a strand column incorporated into a separate column or the target/event co-ordinates? If not, are there plans to include it in the future? It could help streamline the workflow for the end user.

  2. Intron retention events in the final *sorted.txt table seem to be further classified into IR and IR (overlapping region). Is this normal behaviour? (see below). If this is normal, then what is the difference between the two? image

  3. Lastly, this is not an issue with the software but could you please describe the criteria in order for differential exons to be considered as NMD? And has this information been described elsewhere? (I could not find this information in any of the papers which used PSI-Sigma)

Regards, Angel

wososa commented 4 years ago

@angel-bee2018 glad that your runs were successful!

  1. Strand information is not available now because most RNA-seq data is not stranded. I can add an extra column in the future.

  2. Overlapping region means that the “event region” overlaps with two or more gene regions in the .gtf file. Such IR event may not be true because of overlapping UTRs. Sometimes the .gtf file annotates a large region due to promoter regions. It can overestimate the overlapping regions.

  3. The NMD annotation is based on the “reference transcript” column. If the transcript is an NMD transcript, PSI-Sigma will annotate the exon as an NMD exon. This is based on .gtf file, and may not be accurate.

angel-bee2018 commented 4 years ago

Thank you so much for your help @wososa The program is working perfectly now, so consider this issue closed.

I would like to leave brief troubleshooting summary for anyone else who might encounter similar problems in the future.

export PERL5LIB=/home/<myusername>/perl5/lib/perl5/x86_64-linux-thread-multi:/home/<myusername>/perl5/lib/perl5/x86_64-linux-thread-multi/PDL:/home/<myusername>/perl5/lib/perl5