Closed grst closed 10 months ago
Thanks @grst.
yes indeed it probably will break lots of things but nevertheless i am committed to implementing this somehow.
At the first instance, I'm thinking I can probably get rid of Query
class https://github.com/zktuong/dandelion/blob/878c1a0a8796ef36b6e8f1881d280777c497fd5a/dandelion/utilities/_core.py#L1470-L1481 and use the awkward array you use.
Will keep you updated on the progress here (while i try to potentially find a student to work on this !)
Hi @grst and @zktuong, after upgrading scirpy with pip install --upgrade --pre scirpy
, it takes much longer for ddl.from_scirpy(irdata)
to complete.
Hi @DennisCambridge,
i have an idea why (https://github.com/scverse/scirpy/pull/399).
Can you try that branch?
pip install git+https://github.com/scverse/scirpy.git@speed-up-to-airr-cells
Hi Gregor @grst ,
Yes, it seems to have been restored after I installed this branch.
perfect, thanks for your feedback! I'll merge it as soon as the checks ran through.
some initial scalability test
@grst am i right to think that the new structure essentially holds off on the generation of the cell level .obs
data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?
Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")
? I noticed that if i do this on mudata
, the mudata.obs
doesn't show it.
am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function? Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")?
Exactly! There's also a neat context manager to just temporarily add a column for e.g. plotting and automatically remove it again:
with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata, color="VJ_1_v_call")
I noticed that if i do this on mudata, the mudata.obs doesn't show it.
Mudata doesn't propagate changes to obs of one modality automatically. Essentially think of mdata.obs
as a separate static data frame that is updated on-demand.
You can either assign columns to mdata.obs
directly:
mdata.obs["airr:vj_1_locus"] = ir.get.airr(adata, "locus", "VJ_1")
or use the update_obs
to propagate changes:
adata = mdata["airr"]
adata.obs["vj_1_locus"] = ir.get.airr(adata, "locus", "VJ_1")
mdata.update_obs()
I made scirpy functions add to both adata.obs
and mdata.obs
by default: See DataHandler class.
Gotcha!
with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m: mu.pl.umap(mdata, color="VJ_1_v_call")
This didn't work for me unfortunately i had to do
with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata["airr"], color="VJ_1_v_call")
Anyway - ok now i know what the the new datastructure is doing and i've put in a couple of issues/requests
The roadmap that i have in mind at the moment is for dandelion to work using adata.obsm['airr']
as a start. This would require its own function similar to ir.get.airr
.
and support mudata
too
This didn't work for me unfortunately i had to do
Sorry, my bad. When using mdata
directly, the key is color="airr:VJ_1_v_call"
.
This would require its own function similar to ir.get.airr.
Just for clarification: Would it need its own function because you don't want to have scirpy as a heavy dependency, or because it really needs to work differently? If it's the former, there's always the option to factor that out in a helper package, see also https://github.com/scverse/scirpy/issues/385
at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.
still not working for mdata
for completeness this is my code:
mdata = ir.datasets.wu2020_3k()
mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata["airr"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata, color="airr:VJ_1_v_call")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
142 try:
--> 143 mod, basis_mod = basis.split(":")
144 except ValueError:
ValueError: not enough values to unpack (expected 2, got 1)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[14], line 2
1 with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2 mu.pl.umap(mdata, color="airr:VJ_1_v_call")
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
268 """
269 UMAP Scatter plot
270
271 See :func:`muon.pl.embedding` for details.
272 """
--> 273 return embedding(mdata, basis="umap", **kwargs)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
143 mod, basis_mod = basis.split(":")
144 except ValueError:
--> 145 raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
147 if mod not in data.mod:
148 raise ValueError(
149 f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
150 )
ValueError: Basis umap is not present in the MuData object (.obsm)
with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata, color="airr:VJ_1_v_call")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
142 try:
--> 143 mod, basis_mod = basis.split(":")
144 except ValueError:
ValueError: not enough values to unpack (expected 2, got 1)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[17], line 2
1 with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2 mu.pl.umap(mdata, color="airr:VJ_1_v_call")
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
268 """
269 UMAP Scatter plot
270
271 See :func:`muon.pl.embedding` for details.
272 """
--> 273 return embedding(mdata, basis="umap", **kwargs)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
143 mod, basis_mod = basis.split(":")
144 except ValueError:
--> 145 raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
147 if mod not in data.mod:
148 raise ValueError(
149 f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
150 )
ValueError: Basis umap is not present in the MuData object (.obsm)
at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.
fair enough. the scirpy.get.airr
function is of course somewhat specific to the scirpy receptor model, so makes sense that dandelion has a different one.
still not working for mdata
mu.pl.umap
looks for X_umap
in the MuData object, not in the anndata objects. In general, whenever you use mdata.obs{m}
, it will look in the "global" version of MuData. If you want to access a value from a modality, you need the airr:
or gex:
prefix.
So what should work here is either
mdata.obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mu.pl.umap(mdata, color="airr:VJ_1_v_call")
or
mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata.update() # <-- not sure if this is even needed
mu.pl.embedding(mdata, basis="gex:X_umap", color="airr:VJ_1_v_call")
OK i've made some preliminary progress over at (https://github.com/zktuong/dandelion/commit/d5fb09f724acc94f6eb4e0952c2ed80fa96cb016).
There's a list of TODOs that are probably quite trivial to work through - might get a few grad students to work on them over the next quarter.
I also had to deactivate the cell level (.obs equivalent) data generation because it takes a long time to load, compared to just operating from pandas dataframe natively.
implemented in #342
Ok, then I should probably change the scirpy functions io.from_dandelion
and io.to_dandelion
to be aliases of your functions once this is released?
yes. perhaps @amoschoomy can do in the PR on scirpy's side and you can do a quick check?
that would be fab!
Is your feature request related to a problem?
Hi @zktuong and @DennisCambridge,
the new scverse AIRR data structure is out in scirpy v0.13 RC and ready to be experimented with.
As discussed previously, it would be fantastic to get dandelion aligned to it and I'm opening this issue to track this challenge.
Describe the solution you'd like
In an ideal world, dandelion could directly work on the new data structure -- with the potential to get rid of the separate
Dandelion
class and theh5ddl
format.Describe alternatives you've considered
This could turn out to be a tremendous amount of work to transfer to a new data structure (I experienced this myself with scirpy). Therefore an easier solution could be to just support import and export from the new data structure.
Additional context
No response