zktuong / dandelion

dandelion - A single cell BCR/TCR V(D)J-seq analysis package for 10X Chromium 5' data
https://sc-dandelion.readthedocs.io/
GNU Affero General Public License v3.0
109 stars 25 forks source link

Adapt the new scverse AIRR data structure #261

Closed grst closed 10 months ago

grst commented 1 year ago

Is your feature request related to a problem?

Hi @zktuong and @DennisCambridge,

the new scverse AIRR data structure is out in scirpy v0.13 RC and ready to be experimented with.

As discussed previously, it would be fantastic to get dandelion aligned to it and I'm opening this issue to track this challenge.

Describe the solution you'd like

In an ideal world, dandelion could directly work on the new data structure -- with the potential to get rid of the separate Dandelion class and the h5ddl format.

Describe alternatives you've considered

This could turn out to be a tremendous amount of work to transfer to a new data structure (I experienced this myself with scirpy). Therefore an easier solution could be to just support import and export from the new data structure.

Additional context

No response

zktuong commented 1 year ago

Thanks @grst.

yes indeed it probably will break lots of things but nevertheless i am committed to implementing this somehow.

At the first instance, I'm thinking I can probably get rid of Query class https://github.com/zktuong/dandelion/blob/878c1a0a8796ef36b6e8f1881d280777c497fd5a/dandelion/utilities/_core.py#L1470-L1481 and use the awkward array you use.

Will keep you updated on the progress here (while i try to potentially find a student to work on this !)

DennisCambridge commented 1 year ago

Hi @grst and @zktuong, after upgrading scirpy with pip install --upgrade --pre scirpy, it takes much longer for ddl.from_scirpy(irdata) to complete.

grst commented 1 year ago

Hi @DennisCambridge,

i have an idea why (https://github.com/scverse/scirpy/pull/399).

Can you try that branch?

pip install git+https://github.com/scverse/scirpy.git@speed-up-to-airr-cells
DennisCambridge commented 1 year ago

Hi Gregor @grst ,

Yes, it seems to have been restored after I installed this branch.

grst commented 1 year ago

perfect, thanks for your feedback! I'll merge it as soon as the checks ran through.

zktuong commented 1 year ago

some initial scalability test

scalability_test.pdf

zktuong commented 1 year ago

incorporate scverse-airr https://github.com/zktuong/dandelion/issues/261

zktuong commented 1 year ago

@grst am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?

Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")? I noticed that if i do this on mudata, the mudata.obs doesn't show it.

grst commented 1 year ago

am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function? Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")?

Exactly! There's also a neat context manager to just temporarily add a column for e.g. plotting and automatically remove it again:

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="VJ_1_v_call")

I noticed that if i do this on mudata, the mudata.obs doesn't show it.

Mudata doesn't propagate changes to obs of one modality automatically. Essentially think of mdata.obs as a separate static data frame that is updated on-demand.

You can either assign columns to mdata.obs directly:

mdata.obs["airr:vj_1_locus"] = ir.get.airr(adata, "locus", "VJ_1")

or use the update_obs to propagate changes:

adata = mdata["airr"]
adata.obs["vj_1_locus"] =  ir.get.airr(adata, "locus", "VJ_1")
mdata.update_obs()

I made scirpy functions add to both adata.obs and mdata.obs by default: See DataHandler class.

zktuong commented 1 year ago

Gotcha!

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m: mu.pl.umap(mdata, color="VJ_1_v_call")

This didn't work for me unfortunately i had to do

with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata["airr"], color="VJ_1_v_call")

Anyway - ok now i know what the the new datastructure is doing and i've put in a couple of issues/requests

The roadmap that i have in mind at the moment is for dandelion to work using adata.obsm['airr'] as a start. This would require its own function similar to ir.get.airr.

zktuong commented 1 year ago

and support mudata too

grst commented 1 year ago

This didn't work for me unfortunately i had to do

Sorry, my bad. When using mdata directly, the key is color="airr:VJ_1_v_call".

This would require its own function similar to ir.get.airr.

Just for clarification: Would it need its own function because you don't want to have scirpy as a heavy dependency, or because it really needs to work differently? If it's the former, there's always the option to factor that out in a helper package, see also https://github.com/scverse/scirpy/issues/385

zktuong commented 1 year ago

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

still not working for mdata

for completeness this is my code:

mdata = ir.datasets.wu2020_3k()
mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata["airr"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)
with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[17], line 2
      1 with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)
grst commented 1 year ago

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

fair enough. the scirpy.get.airr function is of course somewhat specific to the scirpy receptor model, so makes sense that dandelion has a different one.

still not working for mdata

mu.pl.umap looks for X_umap in the MuData object, not in the anndata objects. In general, whenever you use mdata.obs{m}, it will look in the "global" version of MuData. If you want to access a value from a modality, you need the airr: or gex: prefix.

So what should work here is either

mdata.obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mu.pl.umap(mdata, color="airr:VJ_1_v_call")

or

mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata.update() # <-- not sure if this is even needed
mu.pl.embedding(mdata, basis="gex:X_umap", color="airr:VJ_1_v_call")
zktuong commented 1 year ago

OK i've made some preliminary progress over at (https://github.com/zktuong/dandelion/commit/d5fb09f724acc94f6eb4e0952c2ed80fa96cb016).

There's a list of TODOs that are probably quite trivial to work through - might get a few grad students to work on them over the next quarter.

I also had to deactivate the cell level (.obs equivalent) data generation because it takes a long time to load, compared to just operating from pandas dataframe natively.

zktuong commented 10 months ago

implemented in #342

grst commented 10 months ago

Ok, then I should probably change the scirpy functions io.from_dandelion and io.to_dandelion to be aliases of your functions once this is released?

zktuong commented 10 months ago

yes. perhaps @amoschoomy can do in the PR on scirpy's side and you can do a quick check?

grst commented 10 months ago

that would be fab!