extract the variant information

by-young commented 4 years ago

Hi Zilu, the DENDRO is an wonderful tool. It may be very useful to me. But I have a big problem when I am building the pipeline. problem:

in the example script you show me, you call the indels based on the sample resolution, right? and then you get a GVCF file which contains indels called from the sample.bam(like SRR5023621.bam).So how do you extract the indels information based on cell
how do you get the three matrix based on the gvcf you get after calling the indels? 我不知道我有没有表述清楚，所以用中文在问一遍：
首先，基于您给的脚本例中，我能理解您的每个步骤，我有个疑问就是，您是基于每个样本去call的indels，每个样本是包含很多细胞的，且每个样本是并未做细胞分型的混合的样本（不知道我理解的对不对），那call出来的gvcf是基于一个样本的indel信息，那怎么提取出每个细胞的indels的信息呢？
就是R包的输入的三个矩阵分别是怎么得到的X, N, Z Thanks.

mabraao commented 4 years ago

Hi,

I have the same question about the steps after generating the combined VCF file. In the vignette, it says that "DENDRO extract information of X,N,Z". How is it performed?

Thanks!

zhouzilu commented 4 years ago

Hi all,

Apologize for the delay. I have made a quick R script to convert the vcf to DENDRO input here. Please find the following script and let me know if it works.

Best, Zilu

by-young commented 4 years ago

thank you for your script. But I still have a question:

the last step of calling variants is to use this command: java -jar pathtogatk/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs -R pathtostar/STAR_hg19/ucsc.hg19.fasta \ -V SRR2973275.sorted.rg.dedup.realigned.recal.raw.snps.indels.cof20.erc.g.vcf \ -V SRR2973351.sorted.rg.dedup.realigned.recal.raw.snps.indels.cof20.erc.g.vcf \ -o output.g.vcf

my question is that the output.g.vcf contains the variants information of all samples, but there are no sample information in the output.g.vcf, only a merged information left in it. So how do you konw each sample(or each cell)'s varinats information?

zhouzilu commented 3 years ago

Sorry for the delay. You will have a sample by variants matrics with each row as a variants and each column as a sample. Do you mind elaborate on "each sample's variants information"?

by-young commented 3 years ago

Thank you for your reply.

I got a merged variants vcf file, instead of a sample by variants matrix, after doing the last step I say above(using the command 'java -jar pathtogatk/gatk/GenomeAnalysisTK.jar').

"each sample's variants information" is just the meaning of a sample by variants matrics with each row as a variants and each column as a sample.

So I wonder whether you have another steps after merging the variants information of all the sample.

By the way, the datasets you used is the single-cell data with Smart-seq2 protocol, which means each sample just a cell?

Best wishes.

感谢您的回复。我不确定我的问题您是否理解，所以用中文在表述一遍：

java -jar pathtogatk/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs -R pathtostar/STAR_hg19/ucsc.hg19.fasta \ -V SRR2973275.sorted.rg.dedup.realigned.recal.raw.snps.indels.cof20.erc.g.vcf -V SRR2973351.sorted.rg.dedup.realigned.recal.raw.snps.indels.cof20.erc.g.vcf -o output.g.vcf 在执行完最后这个命令之后，我得到的output.g.vcf文件是一个所有样本的变异信息的vcf文件，只有变异位点的信息，并没有您所说的行为变异位点，列为每个样本的变异矩阵，所以，您在得到这个变异矩阵之前还有别的步骤么？

另外，您文章中使用的数据是基于Smart-seq2的单细胞数据，想跟您确认下，是不是每个样本只有一个细胞的数据？

再次感谢您，希望得到您的回复！

祝您工作顺利！

zhouzilu commented 3 years ago

嗯嗯我了解了。Smart-seq2的数据是每个cell有一个BAM/fastaq file的。后面的我还是用英文写吧... You are correct. Every erc.g.vcf file contains information of a single cell: SRR2973275 and SRR2973351 are two cells. The above code will return a D-by-2 table with D as total number of variants. Some more details of the pipeline:

If data is not distributed by individual cell, we could first separate the combined BAM files into small BAM files by their bar code (like 10X), and each BAM file contain reads of individual cell. 2.We then run mutation detection on individual cell's BAM and get g.vcf file for individual cell (this step can be highly paralleled with computing cluster)
we further combined with GenotypeGVCFs for the final output.g.vcf (as shown here). We can further extract the cell by mutation matrix with this little R tool.

Sorry for the confusion...

by-young commented 3 years ago

实在不好意思，我我跑了上面那段代码，得到的是这个形式的文件第二张图是左后两列的结果（FORMAT sample）并没有您所说的D by 2的带有每个样本的信息，所以是我运行的命令有问题么？

zhouzilu / DENDRO

extract the variant information #9