zktuong / dandelion

dandelion - A single cell BCR/TCR V(D)J-seq analysis package for 10X Chromium 5' data
https://sc-dandelion.readthedocs.io/
GNU Affero General Public License v3.0
103 stars 25 forks source link

ddl.to_scirpy throws an error: "Value passed for key 'airr' is of incorrect shape" #398

Closed Nithishwer closed 2 months ago

Nithishwer commented 3 months ago

I encountered an issue while converting TCR data from a Dandelion object to a Scirpy object for further analysis. Despite successfully performing all analyses with Dandelion, I received an error about object shape mismatches during the conversion.

Steps to Reproduce

  1. Performed analyses on TCR data using Dandelion, including filtering for contigs, preprocessing for variable genes, PCA, clone finding, generating network, and transferring vdj data to the AnnData object.
  2. Attempted to convert the resulting data into a Scirpy object using the conversion function.
  3. Encountered an error when setting mudata=False and supplying gex to the conversion function.

Behavior

Received an error indicating that some objects are not in the correct shape. The issue does not occur when setting mudata=True or omitting gex in the conversion function.

Minimal example

import dandelion as ddl
adata = sc.read_h5ad("adata_filtered_trab.h5ad")
vdj =  ddl.read_h5ddl("vdj_network_trab.h5ddl")
mudata = ddl.to_scirpy(vdj, to_mudata=False, gex_adata=adata)

Any error message produced by the code above

[/users/immune-rep/buo283/.local/lib/python3.11/site-packages/anndata/utils.py:334](http://localhost:10000/users/immune-rep/buo283/.local/lib/python3.11/site-packages/anndata/utils.py#line=333): ExperimentalFeatureWarning: Support for Awkward Arrays is currently experimental. Behavior may change in the future. Please report any issues you may encounter!
[/users/immune-rep/buo283/.local/lib/python3.11/site-packages/anndata/_core/anndata.py:1818](http://localhost:10000/users/immune-rep/buo283/.local/lib/python3.11/site-packages/anndata/_core/anndata.py#line=1817): UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 mudata = ddl.to_scirpy(vdj, to_mudata=False, gex_adata=adata)

File [~/.local/lib/python3.11/site-packages/dandelion/utilities/_io.py:1174](http://localhost:10000/lab/workspaces/auto-r/tree/dandelion/analysis_v1_blood_biopsy_CD4_CD8/~/.local/lib/python3.11/site-packages/dandelion/utilities/_io.py#line=1173), in to_scirpy(data, transfer, to_mudata, gex_adata, key, **kwargs)
   1172     return _create_mudata(gex_adata, adata, key)
   1173 else:
-> 1174     adata = _create_anndata(airr, obs, gex_adata)
   1176     if transfer:
   1177         tf(adata, data)

File [~/.local/lib/python3.11/site-packages/dandelion/utilities/_io.py:1074](http://localhost:10000/lab/workspaces/auto-r/tree/dandelion/analysis_v1_blood_biopsy_CD4_CD8/~/.local/lib/python3.11/site-packages/dandelion/utilities/_io.py#line=1073), in _create_anndata(airr, obs, adata)
   1072     temp = temp[temp.obs_names.isin(cell_names)].copy()
   1073     adata.obsm = dict() if adata.obsm is None else adata.obsm
-> 1074     adata.obsm.update(temp.obsm)
   1076 return adata

File <frozen _collections_abc>:949, in update(self, other, **kwds)

File [~/.local/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py:199](http://localhost:10000/lab/workspaces/auto-r/tree/dandelion/analysis_v1_blood_biopsy_CD4_CD8/~/.local/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py#line=198), in AlignedActualMixin.__setitem__(self, key, value)
    198 def __setitem__(self, key: str, value: V):
--> 199     value = self._validate_value(value, key)
    200     self._data[key] = value

File [~/.local/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py:268](http://localhost:10000/lab/workspaces/auto-r/tree/dandelion/analysis_v1_blood_biopsy_CD4_CD8/~/.local/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py#line=267), in AxisArraysBase._validate_value(self, val, key)
    266         msg = "Index.equals and pd.testing.assert_index_equal disagree"
    267         raise AssertionError(msg)
--> 268 return super()._validate_value(val, key)

File [~/.local/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py:89](http://localhost:10000/lab/workspaces/auto-r/tree/dandelion/analysis_v1_blood_biopsy_CD4_CD8/~/.local/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py#line=88), in AlignedMapping._validate_value(self, val, key)
     83         dims = tuple(("obs", "var")[ax] for ax in self.axes)
     84         msg = (
     85             f"Value passed for key {key!r} is of incorrect shape. "
     86             f"Values of {self.attrname} must match dimensions {dims} of parent. "
     87             f"Value had shape {actual_shape} while it should have had {right_shape}."
     88         )
---> 89     raise ValueError(msg)
     91 if not self._allow_df and isinstance(val, pd.DataFrame):
     92     name = self.attrname.title().rstrip("s")

ValueError: Value passed for key 'airr' is of incorrect shape. Values of obsm must match dimensions ('obs',) of parent. Value had shape (11259,) while it should have had (11346,).

OS information

Linux

Version information

dandelion==0.3.5 pandas==2.2.0 numpy==1.26.4 matplotlib==3.8.4 networkx==3.3 scipy==1.11.4 scanpy==1.10.0 anndata==0.10.6 umap==0.5.5 numpy==1.26.4 scipy==1.11.4 pandas==2.2.0 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11

Additional context

No response

zktuong commented 3 months ago

hi @Nithishwer,

is it because you are trying to use mu.pl.embedding instead of sc.pl.embedding ?

Nithishwer commented 3 months ago

Hi @zktuong,

I'm sorry, I think I uploaded the wrong output. Unrelatedly, mu.pl.embedding and sc.pl.embedding seem to run into this error.

But my issue is converting the ddl object into an scirpy object with gex. The ddl.to_scirpy call itself runs into an error. the plotting errors are just downstream errors due to a faulty object when I pass to mudata=True.

Let me know what you think!

zktuong commented 2 months ago

Ok so your error log is saying that the adata that you are trying to pass doesn't have as many cells as the number of cells in your vdj object.

so perhaps a quick solution could be to subset your vdj object to contain the same cells as in your adata?

Nithishwer commented 2 months ago

Hi Kelvin,

I looked into the size of the adata and vdj objects and this is how it seems:

VDJ object:

Dandelion class object with n_obs = 11259 and n_contigs = 20627
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'rearrangement_status', 'j_source', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'umi_count_abT_VDJ', 'umi_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 7702 vertices, layout for 7702 vertices
    graph: networkx graph of 7702 vertices, networkx graph of 7702 vertices 

adata object:

AnnData object with n_obs × n_vars = 16405 × 15981
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'Sample.Type', 'cell_type', 'source', 'RNA_snn_res.1', 'seurat_clusters', 'cell_type_overall', 'refined_annotations', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'umi_count_abT_VDJ', 'umi_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    uns: 'neighbors', 'refined_annotations_colors'
    obsm: 'X_harmony', 'X_pca', 'X_umap'
    varm: 'HARMONY', 'PCs'
    obsp: 'connectivities', 'distances'

It seems like vdj object has lesser cells than the adata object. Also, all the cell barcodes present in the vdj object are present in the data object.

Can you elaborate what the numbers 11259 and 11346 indicate in the error message?

zktuong commented 2 months ago

hmm that's weird. are there any duplicated cell barcodes by any chance?

Nithishwer commented 2 months ago

There seems to be some duplicate barcodes in the data object

print(len(adata.obs.index),len(np.unique(adata.obs.index)))

returns:

16405 16302

I removed all the duplicates with:

adata = adata[~adata.obs.index.duplicated(keep='first')] and it seems to work now.

zktuong commented 2 months ago

ok. well, instead of removing them, i would try and make them unique in the future (e.g. https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.obs_names_make_unique.html). anyway. closing this for now.