yalm file - Githubissues

In the paper it is suggested to use subcloneseeker to build yalm files, however the output of this tool is a tree of subclones, but for each subclone it is not specified what mutations/clusters are present. How do you build the yalm file from the output of subcloneseeker?

thank you AD

Hello @aleednn,

First of all, thank you for your interest in scBayes and your question. To be clear, the yaml file is not automatically generated by subcloneseeker, but rather something you have to compose manually. This is intentional as we did not want to make our tools tightly coupled together. Someone might prefer a different subclone reconstruction algorithm and we want them to be able to use scBayes as well.

Because of this, some level of understanding of the subclone analysis software is necessary to carry out the yaml file construction. At a minimal, the subclone analysis software should provide you with the following information:

what is the subclone structure
what mutations are found in each subclone

subcloneseeker does not perform 2 directly, but rather, it relies on the clustering results from the clustering analysis that is upstream of subclone analysis. It does produce 1, from which you should be able to infer which mutations are in what sub clones. BTW, just in case you want to use subclone seeker, you should make sure that you use the updated subclone seeker version 2 (aka superseeker) at https://github.com/yiq/superseeker. Unfortunately I have not had the time to update the document. But if this is something you want to do, let me know and I can help you with getting it to run. Here is an example:

Let's say that from clustering analysis (perhaps with pyclone) you identified two mutation clusters A and B. You can use subclone seeker to compute the subclone structure. Let's say it produces a linear evolution tree A -> AB. This means that you have two sub clones, one with mutations in cluster A, and the other with mutations in both cluster A and cluster B. Armed with this knowledge, here is what you need to do to create the yaml file:

Identify which exact mutations belong to cluster A and B (using your clustering tool outputs). Create separate VCF files for A mutations and B mutations

Create a yaml file that looks something like this:

clusters:
- name: A
  vcf: ClusterA.vcf
- name: B
  vcf: ClusterB.vcf
subclones:
- name: normal
  fraction: 0.1
  clusters: []
  defining_cluster: ""
- name: subclone1
  fraction: 0.7
  clusters: [A]
  defining_cluster: A
- name: subclone2
  fraction: 0.2
  clusters: [A, B]
  defining_cluster: B

notice that in the clusters section, you define the mutations for each cluster using VCF files; and in the subclones section, you define the subclone structure using the mutation clusters. Both information is necessary to carry out the scAssign function.

Let me know if this makes sense, and if you have further questions.

@yiq

Thank you for your response!

My question is just about step 2, I have already done the clustering of mutations with another tool (DPclust). But I don't understand how to assign the clusters to the subclones identified by subcloneseeker.

In particular, I have three questions:

cluster2db requires a file with CP of the primary sample and relapse sample, in case you only have the primary sample, is it possible to use this function anyway?
Output CP is not between 0-1, should it be normalized in this range?
Considering only the primary sample (clusters CP):

0 0.11 0.80 1 I get this tree:

digraph { n1 [label="n1: 0%"]; n2 [label="n2: 8.26%"]; n3 [label="n3: 80.3%"]; n4 [label="n4: 11.4%"]; n1->n2; n2->n3; n2->n4; }

How should clusters be assigned to subclones?

Thank you for your help! AD

Hi @aleednn,

I believe you are using the old version of subclone seeker. I strongly recommend that you switch to v2 (aka superseeker) as it is much more recent and a bit easier to work with. you can download the release tarball with configure script generated at https://github.com/yiq/SuperSeeker/releases/tag/v2.0.0-rc2

I am not too familiar with DPclust output, but I imagine you can find out about which variants belong to which clusters. To use superseeker, you want to annotate your VCF file so that you add an extra INFO field to each of your somatic mutations. The field label should be called "AFCLU" and the value should be an distinct, integer number denoting the cluster identity. So if a variant belongs to cluster 1, you want something like this:

20     14370   . G      A       29   PASS   NS=3;DP=14;AO=3;RO=11;DB;H2;AFCLU=1    [... rest of the vcf line ...]

superseeker will take this vcf file as input, and compute the cluster cellular prevalence by averaging the allele frequencies of all variants belonging to the same cluster. There are a few advantages of doing it this way:

you don't need to worry about all different types of files. It's one vcf in, one vcf out
you can have one or more samples represented in the vcf file. So it can do single sample, primary / relapse, or multiple ones all the same
you can use this master VCF file for scGenotype
you can generate the cluster-specific vcf files by simply grepping AFCLU=<cluster-id> from your master vcf file (making sure you duplicate the header as well).

Just two notes:

the VCF file will need the DP, AO, and RO fields in the GENOTYPE columns (after FORMAT). If your vcf file does not contain them, you can let me know what it looks like and I can update the code so it work with them properly.
To be spec-complaint, once you added the AFCLU INFO field, you want to stick this line into the header: ##INFO=<ID=AFCLU,Number=1,Type=String,Description="The allele frequency cluster this variant belongs to">
some fancy clustering algorithms can attempt to correct CP using copy number calls. In our experience we have found it to be very unreliable as the absolute copy number is often difficult to estimate. So we restrict our mutations to copy number 2, LOH-free regions. superseeker makes this assumption, and calculate CP using a simple formula that is CP=2*AF

The result will be a vcf file printed on stdout (so you can redirect to a file). It is going to be identical to the input file except a few extra header lines that starts with ##subclone= which describes the subclone structure similar to the n1->n2 notation you posted above. You can use that information to create the yaml file. So here is an example output: ##subclone="1->2, 1->3, n->1" trace=<tumor="0:0.00161995, 1:-0.00161995, 2:0.12, 3:0.87"> This means you have a two layer, tree where the second layer is bifurcating (imaging an upside down Y). n is a special node that denotes the normal root; 1, 2, and 3 are the cluster ids you provide in the AFCLU fields. So this describes four subclones

normal: at nearly 0 cellular prevalence clone 1: only contains mutations in cluster 1 at almost 0 cellular prevalence clone 2: contains mutations in cluster 1 and 2 (because 2 descends from 1) at 12% cellular prevalence clone 3: contains mutations in cluster 1 and 3 (because 3 descends from 1) at 87% cellular prevalence

With this, you would write the yaml file as follows:

clusters:
    - name: C1
      vcf: Cluster1.vcf
    - name: C2
      vcf: Cluster2.vcf
    - name: C3
      vcf: Cluster3.vcf
subclones:
    - name: normal
      fraction: 0.01
      clusters: []
      defining_cluster: ""
    - name: subclone1
      fraction: 0.01
      clusters: [C1]
      defining_cluster: C1
    - name: subclone2
      fraction: 0.12
      clusters: [C1, C2]
      defining_cluster: C2
    - name: subclone2
      fraction: 0.87
      clusters: [C1, C3]
      defining_cluster: C3

Notice that I gave normal and subclone 1 a small but non-zero CP at 0.01. You always want to give some small number even if it is at near zero cellular prevalence. Also, the numbers don't need to add up to 1 as it will be normalized internally by scAssign

@aleednn just want to check in. Let me know if you have managed to run this successfully.

yiq / scBayes

yalm file #2