zdebruine / singlet

Single-cell analysis with non-negative matrix factorization
41 stars 13 forks source link

Extracting barcodes per component #52

Open kmh005 opened 6 months ago

kmh005 commented 6 months ago

Hello,

Terrific package, started using in October 2023. I've hit a few snags with the latest release (new install) with MetadataPlot, I'll post a separate issue.

With respect to feature extraction of the contributing barcodes to the components, is it my understanding that with the sparse representation, that all non-0 value barcodes from the cell embeddings are treated as counting towards the component? I need to match up with what barcodes are retained per component in the MetadataPlot.

Or should I try to apply something more like Kim et al 2007 to this approach to score and extract, as is done in the NMF package?

Your guidance would be most appreciated here.

Thanks, kmh005

zdebruine commented 6 months ago

Your intuition is correct that any non-0 value indicates contribution of any sample or feature (i.e. cell barcode or transcript ID) to that component. While in theory you can do any type of scoring, enrichment analysis, or summary statistic on the model, it sometimes is most effective (and least error-prone) to just stay with the actual component weights for interpretation.

Of course, bear in mind that the resolution (rank) of the model is very important. The model can "hallucinate" by squishing together information that should not be in the same component (underfitting due to too low of a rank) or fail to appreciate information that should indeed be viewed jointly (overfitting due to too high of a rank), and this tradeoff is a hard one to really understand.

kmh005 commented 6 months ago

Thank you for the explanation and quick respones, it's much appreciated. Hopefully my first model has a good rank (21 for 125k barcodes). Feel free to close out!

kmh005 commented 6 months ago

Hi again,

Just following up here, if you don't mind. How are the barcodes for the MetadataSummary/MetadataPlot pulled? When I use all non-0 values of object@reductions[["nmf"]]@cell.embeddings[,component], I am getting a much larger number of barcodes than are represented from MetadataSummary. For component 1, roughly 51k non-zero to 72k zero. Looking at the MetadataPlot, component 1 is largely dominated by 3 cell types, but they don't add up to nearly 51k. The same tracks through each component. A little more explanation would be appreciated, if you have the time. Thanks!