Open aleednn opened 1 month ago
Hello @aleednn,
First of all, thank you for your interest in scBayes and your question. To be clear, the yaml file is not automatically generated by subcloneseeker, but rather something you have to compose manually. This is intentional as we did not want to make our tools tightly coupled together. Someone might prefer a different subclone reconstruction algorithm and we want them to be able to use scBayes as well.
Because of this, some level of understanding of the subclone analysis software is necessary to carry out the yaml file construction. At a minimal, the subclone analysis software should provide you with the following information:
subcloneseeker does not perform 2 directly, but rather, it relies on the clustering results from the clustering analysis that is upstream of subclone analysis. It does produce 1, from which you should be able to infer which mutations are in what sub clones. BTW, just in case you want to use subclone seeker, you should make sure that you use the updated subclone seeker version 2 (aka superseeker) at https://github.com/yiq/superseeker. Unfortunately I have not had the time to update the document. But if this is something you want to do, let me know and I can help you with getting it to run. Here is an example:
Let's say that from clustering analysis (perhaps with pyclone) you identified two mutation clusters A and B. You can use subclone seeker to compute the subclone structure. Let's say it produces a linear evolution tree A -> AB. This means that you have two sub clones, one with mutations in cluster A, and the other with mutations in both cluster A and cluster B. Armed with this knowledge, here is what you need to do to create the yaml file:
clusters:
- name: A
vcf: ClusterA.vcf
- name: B
vcf: ClusterB.vcf
subclones:
- name: normal
fraction: 0.1
clusters: []
defining_cluster: ""
- name: subclone1
fraction: 0.7
clusters: [A]
defining_cluster: A
- name: subclone2
fraction: 0.2
clusters: [A, B]
defining_cluster: B
notice that in the clusters
section, you define the mutations for each cluster using VCF files; and in the subclones
section, you define the subclone structure using the mutation clusters. Both information is necessary to carry out the scAssign
function.
Let me know if this makes sense, and if you have further questions.
@yiq
Thank you for your response!
My question is just about step 2, I have already done the clustering of mutations with another tool (DPclust). But I don't understand how to assign the clusters to the subclones identified by subcloneseeker.
In particular, I have three questions:
0 0.11 0.80 1 I get this tree:
digraph { n1 [label="n1: 0%"]; n2 [label="n2: 8.26%"]; n3 [label="n3: 80.3%"]; n4 [label="n4: 11.4%"]; n1->n2; n2->n3; n2->n4; }
How should clusters be assigned to subclones?
Thank you for your help! AD
Hi @aleednn,
I believe you are using the old version of subclone seeker. I strongly recommend that you switch to v2 (aka superseeker) as it is much more recent and a bit easier to work with. you can download the release tarball with configure
script generated at https://github.com/yiq/SuperSeeker/releases/tag/v2.0.0-rc2
I am not too familiar with DPclust output, but I imagine you can find out about which variants belong to which clusters. To use superseeker, you want to annotate your VCF file so that you add an extra INFO field to each of your somatic mutations. The field label should be called "AFCLU" and the value should be an distinct, integer number denoting the cluster identity. So if a variant belongs to cluster 1, you want something like this:
20 14370 . G A 29 PASS NS=3;DP=14;AO=3;RO=11;DB;H2;AFCLU=1 [... rest of the vcf line ...]
superseeker will take this vcf file as input, and compute the cluster cellular prevalence by averaging the allele frequencies of all variants belonging to the same cluster. There are a few advantages of doing it this way:
scGenotype
AFCLU=<cluster-id>
from your master vcf file (making sure you duplicate the header as well).Just two notes:
DP
, AO
, and RO
fields in the GENOTYPE columns (after FORMAT). If your vcf file does not contain them, you can let me know what it looks like and I can update the code so it work with them properly.AFCLU
INFO field, you want to stick this line into the header: ##INFO=<ID=AFCLU,Number=1,Type=String,Description="The allele frequency cluster this variant belongs to">
The result will be a vcf file printed on stdout (so you can redirect to a file). It is going to be identical to the input file except a few extra header lines that starts with ##subclone=
which describes the subclone structure similar to the n1->n2
notation you posted above. You can use that information to create the yaml file. So here is an example output:
##subclone="1->2, 1->3, n->1" trace=<tumor="0:0.00161995, 1:-0.00161995, 2:0.12, 3:0.87">
This means you have a two layer, tree where the second layer is bifurcating (imaging an upside down Y). n
is a special node that denotes the normal root; 1, 2, and 3 are the cluster ids you provide in the AFCLU fields. So this describes four subclones
normal: at nearly 0 cellular prevalence clone 1: only contains mutations in cluster 1 at almost 0 cellular prevalence clone 2: contains mutations in cluster 1 and 2 (because 2 descends from 1) at 12% cellular prevalence clone 3: contains mutations in cluster 1 and 3 (because 3 descends from 1) at 87% cellular prevalence
With this, you would write the yaml file as follows:
clusters:
- name: C1
vcf: Cluster1.vcf
- name: C2
vcf: Cluster2.vcf
- name: C3
vcf: Cluster3.vcf
subclones:
- name: normal
fraction: 0.01
clusters: []
defining_cluster: ""
- name: subclone1
fraction: 0.01
clusters: [C1]
defining_cluster: C1
- name: subclone2
fraction: 0.12
clusters: [C1, C2]
defining_cluster: C2
- name: subclone2
fraction: 0.87
clusters: [C1, C3]
defining_cluster: C3
Notice that I gave normal and subclone 1 a small but non-zero CP at 0.01. You always want to give some small number even if it is at near zero cellular prevalence. Also, the numbers don't need to add up to 1 as it will be normalized internally by scAssign
@aleednn just want to check in. Let me know if you have managed to run this successfully.
In the paper it is suggested to use subcloneseeker to build yalm files, however the output of this tool is a tree of subclones, but for each subclone it is not specified what mutations/clusters are present. How do you build the yalm file from the output of subcloneseeker?
thank you AD