wososa / PSI-Sigma

PSI-Sigma
Other
35 stars 10 forks source link

Attempt to reload PDL/GSL/CDF.pm aborted #35

Closed umranyaman closed 2 years ago

umranyaman commented 2 years ago

Dear Woody,

I am running ONT long-read data, and getting the "Attempt to reload PDL/GSL/CDF.pm aborted." error, also Segmentation fault (core dumped).

All modules seem to be uploaded, previously I was having the "Can't locate X module" errors, but I suppose I managed to install the modules properly now.

Do you know how can I solve this?

perl dummyai.pl --gtf /media/data/driuserData/psi/gencode.vM28.chr_patch_hapl_scaff.sorted.gtf --name PSIsigma --type 2 -nread 10

gtf = /media/data/driuserData/psi/gencode.vM28.chr_patch_hapl_scaff.sorted.gtf name = PSIsigma type = 2 nread = 10 skipratio = 0.05 fmode = 0 irmode = 0 adjp = 0 trimp = 5 denominator = 0 irrange = 5 variance assumption = equal (Student's t-test) Path = /media/data/driuserData/psi/PSI-Sigma barcode01.Aligned.sortedByCoord.out.bam.bai is ready. barcode02.Aligned.sortedByCoord.out.bam.bai is ready. Generating mapping file... Checking splice-junction files... ===Splice-junction files spent 0.0000 hours.=== ===Database spent 0.0000 hours.=== Getting intron reads.... Checking barcode02.Aligned.sortedByCoord.out.bam... Checking barcode02.IR.out.tab... barcode02.IR.out.tab existed. Pass... Checking barcode01.Aligned.sortedByCoord.out.bam... Checking barcode01.IR.out.tab... barcode01.IR.out.tab existed. Pass... ===Intron-read file spent 0.0000 hours.=== Ready to do PSI analysis... Attempt to reload PDL/GSL/CDF.pm aborted. Compilation failed in require at (eval 1) line 830. BEGIN failed--compilation aborted at (eval 1) line 830. ===PSI analysis spent 0.0003 hours.=== Filtering ΔPSI results... Not enough samples for p-value calculation, so switch to fmode = 1. Filtering mode = 1 Reading... /media/data/driuserData/psi/gencode.vM28.chr_patch_hapl_scaff.sorted.gtf.mapping.txt Reading... PSIsigma.db Aborting.. Can't open PSIsigma_r10_ir3.txt : No such file or directory ===Filtering spent 0.0003 hours.===

***Total: 0.0006 hours (or 0.036mins). Segmentation fault (core dumped)

Thanks very much, Best, Umran

umranyaman commented 2 years ago

Hi Woody,

I actually managed to run it. Below is what I get do you think the sample number is the only reason for the p-value calculation problem e.g. switching mode to the short-read?

perl dummyai.pl --gtf /media/data/driuserData/psi/gencode.vM28.chr_patch_hapl_scaff.sorted.gtf --name PSIsigma --type 2 -nread 10 gtf = /media/data/driuserData/psi/gencode.vM28.chr_patch_hapl_scaff.sorted.gtf name = PSIsigma type = 2 nread = 10 skipratio = 0.05 fmode = 0 irmode = 0 adjp = 0 trimp = 5 denominator = 0 irrange = 5 variance assumption = equal (Student's t-test) Path = /media/data/driuserData/psi/PSI-Sigma barcode01.Aligned.sortedByCoord.out.bam.bai is ready. barcode02.Aligned.sortedByCoord.out.bam.bai is ready. Generating mapping file... Checking splice-junction files... ===Splice-junction files spent 0.0000 hours.=== ===Database spent 0.0000 hours.=== Getting intron reads.... Checking barcode02.Aligned.sortedByCoord.out.bam... Checking barcode02.IR.out.tab... barcode02.IR.out.tab existed. Pass... Checking barcode01.Aligned.sortedByCoord.out.bam... Checking barcode01.IR.out.tab... barcode01.IR.out.tab existed. Pass... ===Intron-read file spent 0.0000 hours.=== Ready to do PSI analysis... Group A has 1 samples. Group B has 1 samples. Reading... barcode01 Reading... barcode01.SJ.out.tab accession = barcode01 (N) barcode01 Checking IR reads checking... barcode01.IR.out.tab Checking SJ reads... Calculating PSI values... Reading... barcode02 Reading... barcode02.SJ.out.tab accession = barcode02 (T) barcode02 Checking IR reads checking... barcode02.IR.out.tab Checking SJ reads... Calculating PSI values... Number of events = 76619 Number of samples = 2 Statistics option = Not enough samples for t-test number of p-value = 39033 Number of final p-value = 37993 Skipping p-value adjustment. number of fdr(BH) = 37993 Attempt to free unreferenced scalar: SV 0x5616ea5e00d8, Perl interpreter: 0x5616ea5de2a0 during global destruction. ===PSI analysis spent 0.0178 hours.=== Filtering ΔPSI results... Not enough samples for p-value calculation, so switch to fmode = 1. Filtering mode = 1 Reading... /media/data/driuserData/psi/gencode.vM28.chr_patch_hapl_scaff.sorted.gtf.mapping.txt Reading... PSIsigma.db ===Filtering spent 0.0006 hours.=== Archiving... PSIsigma_r10_ir3.txt PSIsigma_r10_ir3.txt

***Total: 0.0184 hours (or 1.104mins). Segmentation fault (core dumped)

wososa commented 2 years ago

@umranyaman ,

You have an interesting working environment. Attempt to reload PDL/GSL/CDF.pm aborted this is a strange error message. I haven't seen this before. The Segmentation fault (core dumped) message in the end of your post shows that your environment doesn't have enough RAM (memory) while running PSI-Sigma. Attempt to free unreferenced scalar: SV 0x5616ea5e00d8, Perl interpreter: 0x5616ea5de2a0 during global destruction. <-- this one is also new to me.

Would it be possible for you to use Singularity or Docker to run PSI-Sigma?

singularity exec docker://woodydon/psi_sigma_pipeline:3.6 perl /usr/local/bin/PSI-Sigma-1.9r/dummyai.pl --gtf Homo_sapiens.GRCh38.100.sorted.gtf --nread 10 --name PSIsigma1d9r --type 2 --fmode 3 --irmode 0

If this doesn't solve your problem, we can schedule a meeting or discuss through email (klin@cshl.edu). It will be more efficient.

Best, Woody

umranyaman commented 2 years ago

Dear Woody,

Thanks for your quick reply,

I am able to run the Psi-sigma after following the previous issue about "Can't locate the PDL/GSL/CDF.pm".

It takes ~6-8 hrs for 4 samples now, and I am planning to run with 12 samples. It is going to be shorter for the first 4, but not sure how long it would take in total. Is there any solution to this?

Also, I have condition and gender groups, and I want to include gender in the model when identifying the AS events between disease and controls. How can I incorporate the genders, please? I only have groupa.txt and groupb.txt

Thanks very much, All the best, Umran

wososa commented 2 years ago

@umranyaman ,

Wonderful~! I am glad that the issue has been resolved. Yes, there's a ways to speed up by generating IR.out.tab files in parallel. Each IR.out.tab file needs ~1 hour to generate, so it will take 12 hours for 12 samples. You can follow these two steps to make PSI-Sigma faster!

  1. Database creation manual
  2. IR.out.tab creation manual

Please let me know if this works for you. I will be happy to help.

Best, Woody

wososa commented 2 years ago

@umranyaman ,

Identifying gender-associated splicing pattern will require splitting groupa and groupb based on gender. In other words, you will need to do more pair-wise comparisons. You should use the .db, .SJ.out.tab, and IR.out.tab files as input for PSI-Sigma. It will speed up the analysis significantly.

Best, Woody

umranyaman commented 2 years ago

Dear Woody,

Thanks very much,

The results are quite interesting and some of the splicing events confirms the DEXSeq exon usage analysis results I have performed previously. Actually I want to perform differential exon usage analysis with the same bam files (Aligned.sortedByCoord.out.bam) or same aligner settings, just to compare the two again. I will need to find a solution for the different results coming from pairwise comparison in PSI-Sigma vs fitted glm model in DEXSEq later on.

I have now PSIsigma_r10_ir3.txt and PSIsigma_r10_ir3.sorted.txt results. I was thinking sorted.txt is coming from the PSIsigma_r10_ir3.txt. How these files generated? How they differ from each other?

I will need one list of detailed event type (PSIsigma_r10_ir3.sorted.txt) and all the FDR values - not sorted, how can I generate it?

Thanks very much, Best, Umran

wososa commented 2 years ago

@umranyaman ,

Glad that you find the output of PSI-Sigma useful. If you visualize .bam file with IGV, you should see the events more clearly.

Yes, PSIsigma_r10_ir3.sorted.txt is directly reorganized from PSIsigma_r10_ir3.txt. PSIsigma_r10_ir3.sorted.txt is shown in a more human friendly format and filtered records based on the parameters that you specified.

I am not sure that I understand your third question. Could you elaborate a bit more?

Best, Woody

umranyaman commented 2 years ago

Thanks very much @wososa,

I understand now, the PSIsigma_r10_ir3.txt is the list I need. I may just need the more detailed event type names. How can I get more detailed info about the event types for the full-list? It would be interesting to check full-list and compare with exon usage results in my case.

Last thing I wanted to ask is that why I only have 1 gene name for the gene below in filtered, but 2-3 gene name*(fusion genes?) in the PSIsigma_r10_ir3.txt. If they are fusion genes, would be also very important to know.

One example:

chr2_164675117_164675949_W_ENSMUST00000127650.8_1 Ctsa, Neurl2 chr2:164675728-164675770 W 6 6 - ENSMUST00000127650.8 -10.46 0.0004   0.0004 59.3967517401392, 60.6508875739645, 57.6237623762376, 64.7230320699708, 63.8190954773869, 54.5936395759717

Filtered:

chr2:164675117-164675949 Ctsa chr2:164675728-164675770 SES 6 6 - ENSMUST00000127650.8 -10.46 0.0004 0.0004 73.2421875\ 71.7687074829932\ 72.419627749577\ 66.9291338582677\ 70.6948640483384\ 68.5009487666034 59.3967517401392\ 60.6508875739645\ 57.6237623762376\ 64.7230320699708\ 63.8190954773869\ 54.5936395759717 chr2_164675117_164675949_W_ENSMUST00000127650.8_1

Thanks very much, All the best, Umran

wososa commented 2 years ago

Hi @umranyaman ,

You should use --fmode 3 to get full table in PSIsigma_r10_ir3.sorted.txt. PSIsigma_r10_ir3.txt is not recommended because the format isn't easy to follow.

The Ctsa, Neurl2 is saying that the event is located at the overlapping region of Ctsa and Neurl2 genes. In other words, Ctsa and Neurl2 are neighbors. You will need to manually check if the event is from read through transcripts. PSI-Sigma doesn't report whether the event is from fusion-gene or read-through transcripts.

Best, Woody

umranyaman commented 2 years ago

Thanks very much @wososa,

I have a results file T test p-value and FDR BH values are the same, how can I get the correct FDR values please?

Event Region Gene Symbol Target Exon Event Type N T Exon Type Reference Transcript ΔPSI (%) T-test p-value FDR (BH) N Values T Values Database ID
chr4:44025561-44032400 Clta chr4:44031400-44031435 MES 6 6 - ENSMUST00000107851.10 5.33 0.00001 0.00001 60.9573091849935|61.4223692371304|59.3663366336634|60.8349420849421|60.9047443913203|62.8776012530768 66.4478482859227|65.6721008920332|66.8020108275329|67.8717763316462|66.2293853073463|65.3428432485006 chr4_44025561_44032400_W_ENSMUST00000107851.10_2
chr4:44025561-44032400 Clta chr4:44030215-44030268 MES 6 6 - ENSMUST00000107851.10 5.24 0.00001 0.00001 63.2082794307891|62.1459582385776|61.3267326732673|62.2828185328185|63.2585509378448|64.6229581561871 68.9764162411865|66.6564134112581|69.0255220417633|68.3056158110388|68.2533733133433|67.0935321770141 chr4_44025561_44032400_W_ENSMUST00000107851.10_1
chr17:8530118-8537675 Sft2d1 chr17:8536682-8536703 SES 6 6 - ENSMUST00000135627.8 9.17 0.00001 0.00001 56.8336425479283|52.6481384373361|52.6555386949924|51.1750881316099|52.7750730282376|52.9741863075196 62.9792583280955|60.6990014265335|63.7176050044683|59.1060707138092|64.7181628392484|62.8318584070796 chr17_8530118_8537675_W_ENSMUST00000135627.8_1
chr1:132156775-132184434 Lemd1 chr1:132183159-132183232 SES 3 4 NMD ENSMUST00000187339.7 -15.13 0.00001 0.00001 na|21.7391304347826|22.7272727272727|na|na|22.2222222222222 7.14285714285714|na|8.69565217391304|6.66666666666667|na|5.88235294117647 chr1_132156775_132184434_W_ENSMUST00000187339.7_1
chr3:9000101-9009670 Tpd52 chr3:9001499-9001540 SES 6 6 - ENSMUST00000091355.12 3.7 0.00007 0.00007 41.1384217335058|40.0954653937947|41.2984670874662|42.0157068062827|42.2932330827068|42.7218934911243 45.5357142857143|43.6893203883495|45.8372310570627|44.7570332480818|46.6950959488273|45.2423698384201 chr3_9000101_9009670_W_ENSMUST00000091355.12_1

To make sure I included the disease vs control group to groupa.txt and groupb.txt, respectively. Is deltaPSI calculated as groupa(level 1) - groupb(level 0)?

Thanks so much, Best, Umran

wososa commented 2 years ago

Hi @umranyaman ,

By default, PSI-Sigma doesn't do FDR BH, so the value of FDR BH will be the same as p-value. You can use --adjp 2 to get FDR BH reported (you will need R installed).

delta-PSI is calculated as the average PSI between groupa and groupb samples. I am not sure who you meant by level 1 and level 0. Let me know if I didn't answer your question correctly.

Best, Woody

umranyaman commented 2 years ago

Dear @wososa,

Thanks for your quick replies, very helpful!

I was wondering about the direction of the deltaPSI values, if group a is disease and group b is control (as in my case), does positive deltaPSI mean the higher PSI in groupa or groupb? Sorry for this basic question, I just want to make sure I included my disease and control groups in the right groups.

Best, Umran

wososa commented 2 years ago

Dear @umranyaman ,

Positive means increased in ‘groupb’. You can check the N and T columns for the PSI values of individual samples in ‘groupa’ and ‘groupb’, respectively.

Best, Woody

umranyaman commented 2 years ago

Dear @wososa,

Thanks for your help, I have na values in FDR values where (I suppose) I have deltaPSI -1<0<1. What would be the reason for getting significant p values but cannot get any FDR values for those examples? Are these less likely to be correct although p value is significant? One example:

chr7:19431573-19432112 Apoe chr7:19432111-19432113 A5SS 6 6 - Ex.ENSMUST00000003066.16 0.01 0.04272 na 0.0418521477761291\ 0.0292804333504136\ 0.0364732772455381\ 0.0496175315278065\ 0.0451739195904231\ 0.0526788153119757 0.0636672325976231\ 0.0445344129554656\ 0.0508060576453346\ 0.0539083557951483\ 0.0655539307145378\ 0.0464990118959972 chr7_19431573_19432112_S_Ex.ENSMUST00000003066.16_1

Not particularly a technical issue, but as I am running PCR-cDNA long-read data, I may need to discard the spurious antisense event, I am looking for a solution for this. Do you think there is any indication whether the event type is the results of those unreliable matches/events e.g. A5SS events with low deltaPSI values may indicate spurious antisense? or do other event types with low deltaPSI values are less reliable although the p value and FDR is significant?

Thanks very much, Best, Umran

wososa commented 2 years ago

@umranyaman ,

The splicing record of APOE has a value of 0.01% delta-PSI. The small delta-PSI value suggests that the record can be artifacts from the sequencing data. Also, 0.04 p-value shows that the splicing pattern isn't too distinct between groupa and groupb. I usually use |delta-PSI| >= 10%+p<0.01 as cutoff for short-read RNA-seq and |delta-PSI| >= 5%+p<0.01 for nanopore RNA-seq. I don't use FDR as cutoff because FDR is sometimes too stringent.

Please note that the delta-PSI column reports % value, so you don't need to time 100 to get the %.

Best, Woody

umranyaman commented 2 years ago

Thanks very much @wososa, all good now!