zhqingit / giremi

GIREMI is a method that can identify RNA editing sites using one RNA-seq data set without requiring genome sequence data.
42 stars 15 forks source link

Error when run giremi.r #6

Open liyao001 opened 8 years ago

liyao001 commented 8 years ago

When I run giremi.pl, I will get an error in the *.Rout file. The error message is: Error in data.frame(..., check.names = FALSE) : 参数值意味着不同的行数: 0, 1 Calls: cbind -> cbind -> data.frame And here is my SNV list: chr21 47739577 47739578 A 0 - chr21 47739643 47739644 B 0 - chr21 47739646 47739647 C 0 - chr21 47739723 47739724 D 0 - chr21 47739724 47739725 E 0 - chr21 47739763 47739764 F 0 - chr21 47740294 47740295 G 0 - chr21 47741149 47741150 H 0 - chr21 47741220 47741221 I 0 - The output of giremi(the executable file) is: chr coor strand ifsnp gene refB upB downB majorB majorN totN majorR ifmi mi mip ar ifneg iflm lmp type A C G T chr21 47739578 - 0 A T C G T 14 17 0.823529 2 0.095463 0.000000e+00 0.500000 0 0 0.000000 AG 0 3 0 14 chr21 47739644 - 0 B T C A T 14 20 0.700000 4 0.262167 0.000000e+00 0.500000 0 0 0.000000 AG 0 6 0 14 chr21 47739647 - 0 C T C A C 12 20 0.600000 4 0.173686 0.000000e+00 0.500000 0 0 0.000000 AG 0 12 0 8 chr21 47739724 - 0 D T A T T 6 8 0.750000 4 0.167866 0.000000e+00 0.500000 0 0 0.000000 AG 0 2 0 6 chr21 47739725 - 0 E T T G T 5 8 0.625000 4 0.256663 0.000000e+00 0.500000 0 0 0.000000 AG 0 3 0 5 chr21 47739764 - 0 F T A T T 5 7 0.714286 2 0.151391 0.000000e+00 0.500000 0 0 0.000000 AG 0 2 0 5 chr21 47740295 - 0 G T G G T 8 12 0.666667 -1 -1.000000 -1.000000e+00 0.500000 0 0 0.000000 AG 0 4 0 8 chr21 47741150 - 0 H T C T T 33 36 0.916667 1 0.278906 0.000000e+00 0.500000 0 0 0.000000 AG 0 3 0 33 chr21 47741221 - 0 I T C A T 52 57 0.912281 1 0.278906 0.000000e+00 0.500000 0 0 0.000000 AG 0 5 0 52 Would you mind to help me to figure it out?

doughertyr commented 8 years ago

I am running into an identical problem.

giremi_output.22.Rout.txt giremi_output.22.txt snv_list.22.txt

zhqingit commented 8 years ago

Hi

  1. Could you send me your output file for my checking.
  2. I noticed you might use the old version, please download the latest one from the github.

Best, Qing

liyao001 commented 8 years ago

Hi When I switched to the newest version, I got the same output. Below is the input snv file snv.txt And here are the outputs giremi_out.Rout.txt giremi_out.txt My reference genome is GRCh38(release 81). Best, Li

zhqingit commented 8 years ago

Hi Li,

I found your SNVs list only includes about 120 SNVs, which caused no MIs could be calculated. I guess you didn't use all SNVs. So please try all the SNVs and chromosomes.

Best, Qing

2016-03-01 22:05 GMT-07:00 Li Yao notifications@github.com:

Hi When I switched to the newest version, I got the same output. Below is the input snv file snv.txt https://github.com/zhqingit/giremi/files/154130/snv.txt And here are the outputs giremi_out.Rout.txt https://github.com/zhqingit/giremi/files/154134/giremi_out.Rout.txt giremi_out.txt https://github.com/zhqingit/giremi/files/154133/giremi_out.txt My reference genome is GRCh38(release 81). Best, Li

— Reply to this email directly or view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-191063945.

ghost commented 7 years ago

Hi Qing,

I find my .Rout.file has many N's in the columns refB, upB, downB and majorB, nearly half of the snv sites. Then I've met the same question when i run giremi.pl. And the error in .Rout.file is :

dat.pos <- cbind(dat.pos[,c("upstream_1base","downstream_1base")],abs(dat.pos[,"major_ratio"]-dat.pos[,"estimated_allelic_ratio"]),1) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1 Calls: cbind -> cbind -> data.frame Execution halted

I used GATK to call variants. I think it 's a strict calling process. But my meanMI is only 0.192806, and sdMI is 0.310413. I wonder what's the cause of this question. Would you mind to help me figure it out?

Best, Zhang

zhqingit commented 7 years ago

Hi Zhang,

Could you send me piece of your input files for my testing? Thanks.

Best, Qing

2017-02-19 17:24 GMT-07:00 oytree notifications@github.com:

Hi Qing,

I find my .Rout.file has many N's in the columns refB, upB, downB and majorB, nearly half of the snv sites. Then I've met the same question when i run giremi.pl. And the error in .Rout.file is :

dat.pos <- cbind(dat.pos[,c("upstream_1base","downstream_1base")], abs(dat.pos[,"major_ratio"]-dat.pos[,"estimated_allelic_ratio"]),1) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1 Calls: cbind -> cbind -> data.frame Execution halted

I used GATK to call variants. I think it 's a strict calling process. But my meanMI is only 0.192806, and sdMI is 0.310413. I wonder what's the cause of this question. Would you mind to help me figure it out?

Best, Zhang

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-280962281, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zTjudNXPac2nkvQwoKIsPdyK3xT1ks5reN0hgaJpZM4HjwZF .

ghost commented 7 years ago

Hi Qing,

The attachment is my input and output files. The input file is test.GiremiInput.txt, which is transformed from the output of GATK calling variants. The output files are test.GiremiOutput.txt and test.GiremiOutput.Rout.txt. Thanks for your help.

Best, Zhang

At 2017-02-22 00:35:59, "zhqingit" notifications@github.com wrote: Hi Zhang,

Could you send me piece of your input files for my testing? Thanks.

Best, Qing

2017-02-19 17:24 GMT-07:00 oytree notifications@github.com:

Hi Qing,

I find my .Rout.file has many N's in the columns refB, upB, downB and majorB, nearly half of the snv sites. Then I've met the same question when i run giremi.pl. And the error in .Rout.file is :

dat.pos <- cbind(dat.pos[,c("upstream_1base","downstream_1base")], abs(dat.pos[,"major_ratio"]-dat.pos[,"estimated_allelic_ratio"]),1) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1 Calls: cbind -> cbind -> data.frame Execution halted

I used GATK to call variants. I think it 's a strict calling process. But my meanMI is only 0.192806, and sdMI is 0.310413. I wonder what's the cause of this question. Would you mind to help me figure it out?

Best, Zhang

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-280962281, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zTjudNXPac2nkvQwoKIsPdyK3xT1ks5reN0hgaJpZM4HjwZF .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

从网易126邮箱发来的云附件 test.GiremiInput.txt (5.67M, 2017年3月9日 10:48 到期) 在线预览 | 下载 test.GiremiOutput.Rout.txt (2.54K, 2017年3月9日 10:48 到期) 在线预览 | 下载 test.GiremiOutput.txt (17.39M, 2017年3月9日 10:48 到期) 在线预览 | 下载

dfv commented 7 years ago

Hi @ liyao001,

Can you Please tell me, how did you prepare the SNV file list after calling the variants ?

Thanks. I look forward to hear back from you soon.

liyao001 commented 7 years ago

Hi @dfv , I wrote a helper python script to prepare this file which will add gene annotations and SNP annotations to the final file with the help of reference files (If you are working with hg38, you can try my files. Or you can download them from UCSC genome Table Browser). You can obtain this file from here. Before using this script you need to run pip install pysam. Here is an example of running this script: python prepare-data-for-giremi.py -i input_file.txt -s snp146Common.gtf.gz -g refGene.gtf.gz input_file.txt is the result of variants calling. If you are using GATK, extract chromosome info and position info from the vcf as the first and second column separately.

dfv commented 7 years ago

Dear Li,

Thanks a lot for replying.

Is it possible to know which is this snp146Common.gtf.gz file , Is this dbsnp file ?

Thanks

liyao001 commented 7 years ago

Hi, snp146Common.gtf.gz is the dbSNP file. By the way, refGene.gtf.gz is the gene annotation file. I uploaded the two files with their indexes for human reference genome hg38 to dropbox. Hope this will be helpful to you.

dfv commented 7 years ago

Hi,

Thank you so much for your help!

Cheers.

dfv commented 7 years ago

Hi Li,

Is it possible to how you extracted the third and fourth column in snp146Common.sorted.bed.gz from dbsnp.vcf ?

I couldn't find the information of third and fourth column of snp146Common.sorted.bed.gz in dbsnp.vcf.

Thanks

liyao001 commented 7 years ago

Hi dfv, You can try to download a dbSNP file in gtf format from UCSC genome Table Browser, I obtained my copy from that site. image

zhqingit commented 7 years ago

Thanks, Li.

Best, Qing

2017-04-21 22:55 GMT-07:00 Li Yao notifications@github.com:

Hi dfv, You can try to download a dbSNP file in gtf format from UCSC genome Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=589282007_ipU29X5WkP7dQfNRdeirUQ6aAW7R&clade=mammal&org=Human&db=hg38&hgta_group=varRep&hgta_track=snp146&hgta_table=snp146Common&hgta_regionType=genome&position=chr9%3A133252000-133280861&hgta_outputType=gff&hgta_outFileName=, I obtained my copy from that site. [image: image] https://cloud.githubusercontent.com/assets/17058337/25301821/42ae8db8-2763-11e7-8cd7-b40b84bb90d5.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-296350319, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zTopQ8_sx0fGYi2KBhLdLu4nvdQOks5ryZZCgaJpZM4HjwZF .

dfv commented 7 years ago

Hi Li,

So for the preparation of SNV list. I am getting following output : chr1 10001 10002 Inte 0 # chr1 10002 10003 Inte 0 # chr1 10003 10004 Inte 0 # chr1 10004 10005 Inte 0 # chr1 10005 10006 Inte 0 # chr1 10006 10007 Inte 0 # chr1 10007 10008 Inte 0 # chr1 10008 10009 Inte 0 # chr1 10009 10010 Inte 0 # chr1 10010 10011 Inte 0 # chr1 10011 10012 Inte 0 # chr1 10012 10013 Inte 0 # chr1 10013 10014 Inte 0 # chr1 10014 10015 Inte 0 # chr1 10015 10016 Inte 0 # chr1 10016 10017 Inte 0 # chr1 10017 10018 Inte 0 # chr1 10018 10019 Inte 0 # chr1 10019 10020 Inte 0 # chr1 10020 10021 Inte 0 # chr1 10021 10022 Inte 0 # chr1 10022 10023 Inte 0 # chr1 10023 10024 Inte 0 #

Is it possible to know if its right from the script ? I am using hg19 genome.

Thanks

liyao001 commented 7 years ago

Hi dfv, I think the result is a little bit odd, would you mind share the original vcf file or part of the vcf file to the forum?

dfv commented 7 years ago

Hi Li,

Thanks a lot for replying. Below mentioned is the part of vcf file:

chr1 10001 chr1 10002 chr1 10003 chr1 10004 chr1 10005 chr1 10006 chr1 10007 chr1 10008 chr1 10009 chr1 10010 chr1 10011 chr1 10012 chr1 10013 chr1 10014 chr1 10015 chr1 10016 chr1 10017 chr1 10018 chr1 10019 chr1 10020 chr1 10021 chr1 10022 chr1 10023 chr1 10024 chr1 10025 chr1 10026 chr1 10027 chr1 10028 chr1 10029 chr1 10030

Thanks!

liyao001 commented 7 years ago

Hi dfv, I think there may be some error in your variant calling process. Because in the human reference genome, positions less than 10,000 are all 'N's and the real sequence begins from nucleartides where the position is greater than 10,000. According to your result, it seems that your variant calling process declared that every position is varianted. So please recheck whether you are using the right reference genome file.

dfv commented 7 years ago

Hi Li,

I checked twice there was no error in variant calling . Yes I am using the correct human reference that is hg19 . I have got the output from GATK and verified with the GATK team. I have used other algorithm for Identification of RNA editing sites and it worked fine.

dfv commented 7 years ago

I have rechecked and the real sequences in my Variant calling file are beginning after 10,000 only. I think the problem is in python script, because with other datasets also I am getting the same output.

ghost commented 7 years ago

Hi Qing,

I tried to apply Giremi on RNA-seq data of D.melanogaster, but I always get error in *.Rout.file :

dat.pos <- cbind(dat.pos[,c("upstream_1base","downstream_1base")],abs(dat.pos[,"major_ratio"]-dat.pos[,"estimated_allelic_ratio"]),1) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1 Calls: cbind -> cbind -> data.frame Execution halted I used GATK to call variants and my meanMI is only 0.018867, and sdMI is 0.083809. I wonder what's the cause of this question. Or is Giremi need some specially additional filter steps on non-human data, like Rat or Mouse ? I'm looking forward to your reply. The attachment is my input file. The input file is SRR974820.INPUT.txt, which is transformed from the output of GATK calling variants.

Best, Zhang SRR974820.INPUT.txt

zhqingit commented 7 years ago

Hi Zhang,

Based on the meanMI values, I believed that most of the SNPs are not the real SNP. They might be the sequencing error or RNA editing sites. I suggest that you use more strict filters to call the SNV. The theoretical MI value should be around 0.6. Thanks.

Let me know freely if you have other questions.

Best, Qing

2017-06-06 7:14 GMT-07:00 oytree notifications@github.com:

Hi Qing,

I tried to apply Giremi on RNA-seq data of D.melanogaster, but I always get error in *.Rout.file :

dat.pos <- cbind(dat.pos[,c("upstream_1base","downstream_1base")], abs(dat.pos[,"major_ratio"]-dat.pos[,"estimated_allelic_ratio"]),1) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1 Calls: cbind -> cbind -> data.frame Execution halted I used GATK to call variants and my meanMI is only 0.018867, and sdMI is 0.083809. I wonder what's the cause of this question. Or is Giremi need some specially additional filter steps on non-human data, like Rat or Mouse ? I'm looking forward to your reply. The attachment is my input file. The input file is SRR974820.INPUT.txt, which is transformed from the output of GATK calling variants.

Best, Zhang SRR974820.INPUT.txt https://github.com/zhqingit/giremi/files/1055265/SRR974820.INPUT.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-306499198, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zacJeg-48OilLDm3IDgtO3P_U2qpks5sBV7DgaJpZM4HjwZF .

ghost commented 7 years ago

Hi Qing,

Could you suggest a strict SNV-calling method? I used STAR+samtools to call variants and filtered the low qual variants, which I think is strict but also met the same error. It seems that my method is not strict enough.

Also, I am confused about "-m, --min INT minimal number of total reads covering candidate editing sites [default: 5]" parameter in Giremi. I think "-m" is parameter indirectly controlling the number of predicted RNA editing number. I want to have more predicted RNA editings so I set "-m 1", but I get fewer RNA editings than I set "-m 5". I wonder is there some other parameter to get more predicted RNA editings ?

Thank you for your help. Best, Zhang

zhqingit commented 7 years ago

Hi Zhang,

Sorry for the delay reply. Actually, you can find we used 7 filters to remove the sequencing error, PCR errors and other errors in our Nature Methods paper. But I don't have a public script to do the filtering.

For the -m option, I suggested not to use 1 or smaller than 5, though you can chose them. Because the low value of m might generate the unstable MI distribution, which makes the results unpredictable. Thanks.

Please let me know freely if you have more questions.

Best, Qing

2017-06-07 20:26 GMT-07:00 oytree notifications@github.com:

Hi Qing,

Could you suggest a strict SNV-calling method? I used STAR+samtools to call variants and filtered the low qual variants, which I think is strict but also met the same error. It seems that my method is not strict enough.

Also, I am confused about "-m, --min INT minimal number of total reads covering candidate editing sites [default: 5]" parameter in Giremi. I think "-m" is parameter indirectly controlling the number of predicted RNA editing number. I want to have more predicted RNA editings so I set "-m 1", but I get fewer RNA editings than I set "-m 5". I wonder is there some other parameter to get more predicted RNA editings ?

Thank you for your help. Best, Zhang

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-306988135, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zeHpTPF4H3dMiYtFc7iw2m9rtxgEks5sB2n_gaJpZM4HjwZF .

ghost commented 7 years ago

Hi Qing,

I noticed there are 7 filters to remove sequencing error and I conducted what you say in your paper. But I still got error in giremi.r and got a low meanMI, especially in non-human species, like rat or mouse or D.mel. I wonder what the reason is. Also, I have another question. Can Giremi apply to a species without dbsnp file? What should I do to perform Giremi if there isn't dbsnp for the species? Thanks!

Best, Zhang

zhqingit commented 7 years ago

Hi Zhang,

In theory, giremi should work well for any diploid species. Would you mind to send me the output file from giremi? thanks.

So far giremi can't work without dbSNP information. I am working a new version on it, but I can't guarantee when the new software can be released. Thanks.

Best, Qing

2017-06-16 5:18 GMT-07:00 camellia notifications@github.com:

Hi Qing,

I noticed there are 7 filters to remove sequencing error and I conducted what you say in your paper. But I still got error in giremi.r and got a low meanMI, especially in non-human species, like rat or mouse or D.mel. I wonder what the reason is. Also, I have another question. Can Giremi apply to a species without dbsnp file? What should I do to perform Giremi if there isn't dbsnp for the species? Thanks!

Best, Zhang

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-309011821, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zUf2mU-bfmSgR3StL36TblpS9TTnks5sEnKCgaJpZM4HjwZF .

ghost commented 7 years ago

Hi Qing,

The attachment is my input and output files. The input file is wgEncodeCaltechRnaSeq10t12C3hFR2x75Th131Il200Rep1.INPUT.txt, which is filtered by 7 steps. The output files are wgEncodeCaltechRnaSeq10t12C3hFR2x75Th131Il200Rep1.GIREMI.txt and wgEncodeCaltechRnaSeq10t12C3hFR2x75Th131Il200Rep1.GIREMI.Rout.txt. Thanks for your help.

Best, Zhang wgEncodeCaltechRnaSeq10t12C3hFR2x75Th131Il200Rep1.INPUT.txt wgEncodeCaltechRnaSeq10t12C3hFR2x75Th131Il200Rep1.GIREMI.txt wgEncodeCaltechRnaSeq10t12C3hFR2x75Th131Il200Rep1.GIREMI.Rout.txt

ghost commented 7 years ago

Hi Qing, I used STAR and GATK to call the variants. And the raw RNA-seq data is from Mouse Encode project.

Best, Zhang

zhqingit commented 7 years ago

Hi Zhang,

Sorry for the late reply. I have checked your files and plot the nucleotide frequencies of the SNP (see attachment). It looks most of these marked SNP are actually not the real SNP or they are the homozygous SNP because there should be a peak at around 0.5 for the real heterozygous SNPs. Please double check. Thanks.

Best, Qing

2017-06-22 5:06 GMT-07:00 camellia notifications@github.com:

Hi Qing, I used STAR and GATK to call the variants. And the raw RNA-seq data is from Mouse Encode project.

Best, Zhang

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/6#issuecomment-310360722, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zXSAkGNVZBt3F4x_vKV3UBwSYcPHks5sGljOgaJpZM4HjwZF .