zhqingit / giremi

GIREMI is a method that can identify RNA editing sites using one RNA-seq data set without requiring genome sequence data.
42 stars 15 forks source link

Allelic depth calculated, but glm & ifRNAE not calculated, MI statistics are off #27

Closed JohnMCMa closed 7 years ago

JohnMCMa commented 7 years ago

I'm running GIREMI with a list of 2,233 locations, with the first 10 lines as follows:

chr1 135124 135125 Inte 0 # chr1 944681 944682 NOC2L 0 - chr1 1296622 1296623 ACAP3 0 - chr1 1635542 1635543 CDK11B 0 - chr1 1825442 1825443 GNB1 0 - chr1 6185501 6185502 RPL22 0 - chr1 6223953 6223954 ICMT 0 - chr1 6646402 6646403 DNAJC11 0 - chr1 7970927 7970928 PARK7 0 + chr1 7985013 7985014 PARK7 0 +

The commandline output was something like this:

[mpileup] 1 samples in 1 input files coor:135125 chr:chr1coor:944682 chr:chr1[long list of locations with no newlines]

Calculating the MI values... meanMI:-nan sdMI:-nan Estimating the allelic ratios... R CMD BATCH --no-save --no-restore '--args finput="/scratch/lym_myl_rsch/mma/Biovest/RNA-seq/RNAEditingSites/GIREMI/1112961-Tumor/1112961-Tumor-GIREMI.out-2.txt" fout="/scratch/lym_myl_rsch/mma/Biovest/RNA-seq/RNAEditingSites/GIREMI/1112961-Tumor/1112961-Tumor-GIREMI.out-2.txt.res"' giremi.r /scratch/lym_myl_rsch/mma/Biovest/RNA-seq/RNAEditingSites/GIREMI/1112961-Tumor/1112961-Tumor-GIREMI.out-2.txt.Rout Analysis DONE!

GIREMI failed to run the R script, and reviewing the intermediate output, it looks like samtools was able to do its job as the allelic depths were counted corrected, but the stats were strange. ifRNAE doesn't exist, the GLM stats are all zero, and while raw MI looks fine, ifMI, pvalue_MI, and allelic ratio were off:

chr coor strand ifsnp gene refB upB downB majorB majorN totN majorR ifmi mi mip ar ifneg iflm lmp type A C G T chr1 135125 # 0 Inte C C G T 89 89 1 -1 -1 -1.00E+00 0.5 0 0 0 CT 0 0 0 89 chr1 944682 - 0 NOC2L A C G A 92 95 0.968421 2 0.064177 0.00E+00 0.5 0 0 0 TC 92 0 3 0 chr1 1296623 - 0 ACAP3 C G G G 16 18 0.888889 -1 -1 -1.00E+00 0.5 0 0 0 GC 0 2 16 0 chr1 1635543 - 0 CDK11B T G C C 39 39 1 2 0 0.00E+00 0.5 0 0 0 AG 0 39 0 0 chr1 1825443 - 0 GNB1 A A G A 288 290 0.993103 -1 -1 -1.00E+00 0.5 0 0 0 TC 288 0 2 0 chr1 6185502 - 0 RPL22 A G G A 248 250 0.992 2 0.016064 0.00E+00 0.5 0 0 0 TC 248 0 2 0 chr1 6223954 - 0 ICMT A C A A 99 102 0.970588 2 0.059706 0.00E+00 0.5 0 0 0 TC 99 0 3 0 chr1 6646403 - 0 DNAJC11 T G G T 6 9 0.666667 -1 -1 -1.00E+00 0.5 0 0 0 AC 0 0 3 6 chr1 7970928 + 0 PARK7 A G A A 282 283 0.996466 -1 -1 -1.00E+00 0.5 0 0 0 AG 282 0 1 0 chr1 7985014 + 0 PARK7 T G G T 278 292 0.952055 -1 -1 -1.00E+00 0.5 0 0 0 TC 3 7 4 278

I would appreciate any assistance.

PS: I can confirm the chromosome names are fine, and since I used the same index files for GATK, I take that to mean it wasn't corrupted.

zhqingit commented 7 years ago

Hi,

How big is the bam file? It looks GIREMI can't calculate the MI value.

Best, Qing

2016-10-21 11:13 GMT-07:00 John Ma notifications@github.com:

I'm running GIREMI with a list of 2,233 locations, with the first 10 lines as follows:

chr1 135124 135125 Inte 0 # chr1 944681 944682 NOC2L 0 - chr1 1296622 1296623 ACAP3 0 - chr1 1635542 1635543 CDK11B 0 - chr1 1825442 1825443 GNB1 0 - chr1 6185501 6185502 RPL22 0 - chr1 6223953 6223954 ICMT 0 - chr1 6646402 6646403 DNAJC11 0 - chr1 7970927 7970928 PARK7 0 + chr1 7985013 7985014 PARK7 0 +

The commandline output was something like this:

[mpileup] 1 samples in 1 input files coor:135125 chr:chr1coor:944682 chr:chr1[long list of locations with no newlines]

Calculating the MI values... meanMI:-nan sdMI:-nan Estimating the allelic ratios... R CMD BATCH --no-save --no-restore '--args finput="/scratch/lym_myl_rsch/ mma/Biovest/RNA-seq/RNAEditingSites/GIREMI/1112961-Tumor/1112961-Tumor-GIREMI.out-2.txt" fout="/scratch/lym_myl_rsch/mma/Biovest/RNA-seq/RNAEditingSites/GIREMI/ 1112961-Tumor/1112961-Tumor-GIREMI.out-2.txt.res"' giremi.r /scratch/lym_myl_rsch/mma/Biovest/RNA-seq/RNAEditingSites/GIREMI/ 1112961-Tumor/1112961-Tumor-GIREMI.out-2.txt.Rout Analysis DONE!

GIREMI failed to run the R script, and reviewing the intermediate output, it looks like samtools was able to do its job as the allelic depths were counted corrected, but the stats were strange. ifRNAE doesn't exist, the GLM stats are all zero, and while raw MI looks fine, ifMI, pvalue_MI, and allelic ratio were off:

chr coor strand ifsnp gene refB upB downB majorB majorN totN majorR ifmi mi mip ar ifneg iflm lmp type A C G T chr1 135125 # 0 Inte C C G T 89 89 1 -1 -1 -1.00E+00 0.5 0 0 0 CT 0 0 0 89 chr1 944682 - 0 NOC2L A C G A 92 95 0.968421 2 0.064177 0.00E+00 0.5 0 0 0 TC 92 0 3 0 chr1 1296623 - 0 ACAP3 C G G G 16 18 0.888889 -1 -1 -1.00E+00 0.5 0 0 0 GC 0 2 16 0 chr1 1635543 - 0 CDK11B T G C C 39 39 1 2 0 0.00E+00 0.5 0 0 0 AG 0 39 0 0 chr1 1825443 - 0 GNB1 A A G A 288 290 0.993103 -1 -1 -1.00E+00 0.5 0 0 0 TC 288 0 2 0 chr1 6185502 - 0 RPL22 A G G A 248 250 0.992 2 0.016064 0.00E+00 0.5 0 0 0 TC 248 0 2 0 chr1 6223954 - 0 ICMT A C A A 99 102 0.970588 2 0.059706 0.00E+00 0.5 0 0 0 TC 99 0 3 0 chr1 6646403 - 0 DNAJC11 T G G T 6 9 0.666667 -1 -1 -1.00E+00 0.5 0 0 0 AC 0 0 3 6 chr1 7970928 + 0 PARK7 A G A A 282 283 0.996466 -1 -1 -1.00E+00 0.5 0 0 0 AG 282 0 1 0 chr1 7985014 + 0 PARK7 T G G T 278 292 0.952055 -1 -1 -1.00E+00 0.5 0 0 0 TC 3 7 4 278

I would appreciate any assistance.

PS: I can confirm the chromosome names are fine, and since I used the same index files for GATK, I take that to mean it wasn't corrupted.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zXqMhxqoYvswuqhNYJwAw3lgcSDWks5q2QDGgaJpZM4Kddop .

JohnMCMa commented 7 years ago

1,179,375kb. This is the default bamout from GATK, of which I start to wonder if it has dropped too much reads. There're several flags in GATK of which I can obtain a more comprehensive set of reads; I may have to play with them.

I'm convinced, however, that if the location list used in GIREMI originates from HaplotypeCaller or MuTect2, then it must be some kind of GATK bamout from the same run that should be used as the BAM as well.

JohnMCMa commented 7 years ago

Hi Qing, Do you think a .bam with 51,996,757 reads (as counted by samtools view -c ) has enough reads for GIREMI?

JohnMCMa commented 7 years ago

Hi Qing,

I have just re-ran GIREMI with a fresh more "conventional" BAM but with the same list. This time the situation seems to have improved in that the the mi and if_mi fields seems to be calculated correctly, but the pvalue of MI (mip) in the file and the mean and SDs of MI emitted on the screen are still off. mip is either 0 or -1, and both meanMI and sdMI are -nan.

I have attached the output of the C code for your reference. SNPiR_1112961-Tumor.UC.GIREMI.results.txt

zhqingit commented 7 years ago

Hi John,

I found the nucleotide ratios of many variants are very high 0.98~1. I believed there are many homozygous SNV in your list that might affect the calculation. Please use more strict filters to remove these SNVs and try again. Thanks.

Best, Qing

2016-11-03 12:43 GMT-07:00 John Ma notifications@github.com:

Hi Qing,

I have just re-ran GIREMI with a fresh more "conventional" BAM but with the same list. This time the situation seems to have improved in that the the mi and if_mi fields seems to be calculated correctly, but the pvalue of MI (mip) in the file and the mean and SDs of MI emitted on the screen are still off. mip is either 0 or -1, and both meanMI and sdMI are -nan.

I have attached the output of the C code for your reference. SNPiR_1112961-Tumor.UC.GIREMI.results.txt https://github.com/zhqingit/giremi/files/569949/SNPiR_1112961-Tumor.UC.GIREMI.results.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/27#issuecomment-258252537, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zd_RhNMnKAX5EwBZ9eThqADgFtdwks5q6jlZgaJpZM4Kddop .

JohnMCMa commented 7 years ago

Sorry for the delay; I had other projects to handle.

I have further filtered the input to limit it to SNPs that have alt allelic fraction between (and including) 0.1-0.9 from both UnifiedGenotyper and SNPiR. It comes out the same--mip are either -1 or 0.

The list and the output by the C code attached. The coordinates are from GRCh38, primary chromosomes only, GENCODE version.

Tumor.filtered-rmhomop.GIREMI.list.txt Tumor.filtered-rmhomop.GIREMI.out.txt

zhqingit commented 7 years ago

Hi John,

I found there are no any known SNP in your list.

Best, Qing

2016-11-09 14:40 GMT-07:00 John Ma notifications@github.com:

Sorry for the delay; I had other projects to handle.

I have further filtered the input to limit it to SNPs that have alt allelic fraction between (and including) 0.1-0.9 from both UnifiedGenotyper and SNPiR. It comes out the same--mip are either -1 or 0.

The list and the output by the C code attached. The coordinates are from GRCh38, primary chromosomes only, GENCODE version.

Tumor.filtered-rmhomop.GIREMI.list.txt https://github.com/zhqingit/giremi/files/581881/Tumor.filtered-rmhomop.GIREMI.list.txt Tumor.filtered-rmhomop.GIREMI.out.txt https://github.com/zhqingit/giremi/files/581882/Tumor.filtered-rmhomop.GIREMI.out.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/27#issuecomment-259534280, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zZAur1SBWMfI0_3MF_OMWo-vXxHTks5q8j24gaJpZM4Kddop .

JohnMCMa commented 7 years ago

@zhqingit: Yes, I have filtered away dbSNP positions beforehand. Are those required for GIREMI calculations?

zhqingit commented 7 years ago

Yes, giremi needs these known SNP information to create the MI distribution.

Best, Qing

2016-11-10 7:11 GMT-07:00 John Ma notifications@github.com:

@zhqingit https://github.com/zhqingit: Yes, I have filtered away dbSNP positions beforehand. Are those required for GIREMI calculations?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/27#issuecomment-259699330, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zXa5pSyBZl0ZO0EyXqocCHXmxMTOks5q8yYvgaJpZM4Kddop .

JohnMCMa commented 7 years ago

@zhqingit: Spiking in dbSNP variants does complete the program. Thanks! Closing ticket