detecting doublets whose cells came from the same sample

aryarm commented 4 years ago

I like to use Github issues as TODO lists. First item of business: conflicting UMIs.

The problem

UMIs are used by demuxlet and appear in the UB tags within the BAM file. When we create doublets, there is the possibility of duplicate UMIs appearing within the same doublet.

Is this a cause for concern?

Probably. demuxlet might discard duplicate UMIs or otherwise treat duplicates differently. But it's hard to say how prevalent this problem will be until we actually encounter it.

The solution

Not sure about this yet. Ideally, the new_bam.py script that changes the CB tags could change the UMIs in a way that prevents them from conflicting. But I'm not sure how we would do this given our current workflow design: new_bam.py is called once for each sample's BAM file; they aren't considered in tandem. Another idea would be to merge the BAM files together and then fix the duplicate UMIs after that. That might be fastest, but it would probably involve extra steps, like sorting the merged BAM by UB tag (ugh).

aryarm commented 4 years ago

ok, after analyzing the results of the simulation, I've come to the conclusion that this issue is actually something we need to be worried about

# how many doublets did we create with the same sample?
len(list(filter(lambda samp: samp[0] == samp[1], truth[truth['type'] == 'DBL']['sample'])))
979
# and how many doublets were predicted to have the same sample?
len(list(filter(lambda samp: samp[0] == samp[1], predicts[predicts['type'] == 'DBL']['sample'])))
0

zrcjessica commented 4 years ago

From my understanding, are we only concerned with UMI collisions in simulated doublets? If so can we go in and check the UMIs for the doublets only, and change them if necessary.

aryarm commented 4 years ago

@zrcjessica you're right! The bam file is too big to load into memory all at once and check for conflicting barcodes among the doublets, but I can try to hash the old cellular barcode with the UMI, and since at least one of the two is guaranteed to be unique at any given time, I should get a new unique UMI that works!

Now all I have to do is find a good hash function... It needs to take the cellular and molecular barcodes as input and return a new UMI consisting of only A, G, C, T. hmmm...

zrcjessica commented 4 years ago

Another potential solution - how many cases of redundant UMIs do we have across all of the original .bam files? If it's not very many, we could simply drop those cells from this simulation.

aryarm commented 4 years ago

It's hard to know because there are too many molecular barcodes to check, but my gut tells me there will probably be conflicts.

Every time I tried to get a unique list of the molecular barcodes (in order to know how many duplicates there are), my job used up too much memory on the cluster and got killed. So I think there are just too many barcodes to check.

But my intuition tells me there will probably be a non-negligible number of conflicts. Every bead in every droplet was probably created identically save for the cellular barcode. This means the unique molecular identifiers of every bead are probably the same. Granted, only some (not all) of every bead's molecular identifiers actually get used and sequenced, but still, I'm gonna guess that there will be quite a few collisions.

aryarm commented 4 years ago

I came up with a hash function in case we want to do it this way:

def hash_ub(ub, cb):
    """
        return a hash of the molecular barcode (ub)
        using the cellular barcode (cb) as salt
    """
    # Convert bases to integers
    ub = [('A','C','G','T').index(ch) for ch in ub]
    cb = [('A','C','G','T').index(ch) for ch in cb]

    # The cb is usually longer, so we'll just ignore the extra
    # To retain the uniqueness of the cb we add the end chars
    # to every nth character in the new, shortened cb (where
    # n is the number of extra characters at the end of cb)
    extra_len = len(cb)-len(ub)
    for i in range(extra_len):
        cb[i*int(len(ub)/extra_len)] += cb[:-(i+1)]

    # Finally, we can add the salt (the cb characters)
    for i in range(len(ub)):
        # 1) add the cb[i] value
        # 2) mod by 4 to ensure 0 <= ub[i] <= 3
        # 3) convert back from integer to ACGT
        ub[i] = ('A','C','G','T')[(ub[i] + cb[i]) % 4]
    # Join the bases together into a string
    return "".join(ub)

In new_bam.py, we can apply this function to every read that's going into a simulated doublet in order to get new, unique, molecular barcodes. The function as it is written cannot guarantee 100% uniqueness. I'm not smart enough to figure out how to do that. (It might even be impossible given the problem as stated?) But I think this is good enough. It will at least reduce the number of conflicts substantially, I ~~think~~ hope.

aryarm commented 4 years ago

ok, Graham acknowledged that this might be a problem, but that it doesn't matter because we're going to be throwing out the doublets when we use demuxlet anyway also, he doesn't think demuxlet can identify doublets whose cells come from the same sample, but he's interested in the idea of us doing this ourselves after running demuxlet (perhaps using a statistical test to check for an outsized number of UMIs)

aryarm commented 4 years ago

Ok, I'm actually going to reopen this issue to keep track of work for our new idea: discovering doublets whose cells come from the same sample.

The first order of business is to create some sort of a histogram for the number of UMIs across each droplet, colored by doublets vs singlets.

zrcjessica / demux