zktuong / dandelion

dandelion - A single cell BCR/TCR V(D)J-seq analysis package for 10X Chromium 5' data
https://sc-dandelion.readthedocs.io/
GNU Affero General Public License v3.0
108 stars 25 forks source link

I am confused, and need your help #162

Closed saramoein372 closed 2 years ago

saramoein372 commented 2 years ago

Description of the question

Hi Kelvin,

I have some basic questions about how dandelion is working and trying to find the biological meaning of each step in dandelion. To do this I am asking my questions to complete the puzzle.

Would you please help me to answer these questions:

1- is each node in the dandelion network a clone?

2- how the clone network is generated? I already read all the tutorials and papers. But I think there are some inconsistencies in the paper and tutorial. It would be great to briefly provide me the steps. I am very confused.

3- why after generating the .tsv file, some of the cells have different cluster_id?

4- We expected to see the same germline in all the cells in the network. But the germlines of cells in the network are different. Why?

Thank you, Sara

Minimal example

NA

Any error message produced by the code above

NA

OS information

NA

Version information

NA

Additional context

No response

zktuong commented 2 years ago

Hi Sara,

1- is each node in the dandelion network a clone?

each node is a single cell: image and each connected component (network) would most often be 1 clone. there are situations where a network can be comprised of multiple clones, because some cells have multiple BCRs/TCRs and dandelion merges them into a single network just for the visualisation.

2- how the clone network is generated?

in a simple example, for all cells that were assigned a clone id of 1_1_1_1, including cells that have clone ids of 1_2_3_4|1_1_1_1 (exampled of a single cell expressing two pairs of BCRs) will be selected and pairwise levenshtein distances will be calculated for every pair of cells within this subset. The calculation is performed on each IGH/IGK/IGL layer separately. The layers are then just summed (simple matrix addition), forming a distance matrix like this: image

I've coloured the upper triange grey because it's just going to mirror the lower triangle.

a minimum spanning tree is then calculated, which will form something like this: image I've coloured the edge weights (levenshtein distance) blue

In the constructed minimum spanning tree, a special circumstance here is that Cell 1, being connected to Cell 4, is totally random - Cell 3 and Cell 2 have equal chances of being selected for Cell 1's position because they are the same distance apart. So, i added a step to 'rescue' those connections/edges, making it look like: image

I've coloured the rescued edges as orange.

That's it.

3- why after generating the .tsv file, some of the cells have different cluster_id?

i'm not sure what you mean by this. Unless you are asking why the numbers change each time you run it - it's got to do with a random argsort whenever lists of V/J and lengths are sorted. The numbers don't have any particularly meaning other than to say whether or not two different clones share a similar criteria, so i've never enforced for the numbers to stay identical all the time.

4- We expected to see the same germline in all the cells in the network. But the germlines of cells in the network are different. Why?

I'm unsure how this can happen, other than the possiblity as i described above where a cell can have multiple BCRs, and also when cells have multiple light chains. Are you sure that the different germlines you are seeing is not because it's just IGH/IGK/IGL? Otherwise, I'll need an example where you've observed this.

saramoein372 commented 2 years ago

Thank you Kevin!

One thing that still I can not find justification for it, is the network generated from dandelion['edges'], all the clusters are connected and intra-connections are generated. That means multiple clones are connected in dandelion.

1-My confusion is how biologically we can justify the intra-connections of clones? Do you have any comments about the justifications of intra-clsterers edges from a biological perspective? 2- Most of the clones in my dandelion file have unassigned clone-id. Why can this happen?

Thank you, Sara

On Mon, Jul 11, 2022 at 1:40 PM Zewen Kelvin Tuong @.***> wrote:

Hi Sara,

1- is each node in the dandelion network a clone?

each node is a single cell: [image: image] https://user-images.githubusercontent.com/26215587/178317378-fe93bfca-b00d-44f4-8981-8d3c93ceb32c.png and each connected component (network) would most often be 1 clone. there are situations where a network can be comprised of multiple clones, because some cells have multiple BCRs/TCRs and dandelion merges them into a single network just for the visualisation.

2- how the clone network is generated?

in a simple example, for all cells that were assigned a clone id of 1_1_1_1, including cells that have clone ids of 1_2_3_4|1_1_1_1 (exampled of a single cell expressing two pairs of BCRs) will be selected and pairwise levenshtein distances will be calculated for every pair of cells within this subset. The calculation is performed on each IGH/IGK/IGL layer separately. The layers are then just summed (simple matrix addition), forming a distance matrix like this: [image: image] https://user-images.githubusercontent.com/26215587/178321696-49800642-edd5-43f5-a88a-640752814772.png

I've coloured the upper triange grey because it's just going to mirror the lower triangle.

a minimum spanning tree is then calculated, which will form something like this: [image: image] https://user-images.githubusercontent.com/26215587/178322113-2623d4e2-651a-4489-85f2-b806ec3fdd64.png I've coloured the edge weights (levenshtein distance) blue

In the constructed minimum spanning tree, a special circumstance here is that Cell 1, being connected to Cell 4, is totally random - Cell 3 and Cell 2 have equal chances of being selected for Cell 1's position because they are the same distance apart. So, i added a step to 'rescue' those connections/edges, making it look like: [image: image] https://user-images.githubusercontent.com/26215587/178322811-bb4aec4f-1299-4543-921e-94e074f0d797.png

I've coloured the rescued edges as orange.

That's it.

3- why after generating the .tsv file, some of the cells have different cluster_id?

i'm not sure what you mean by this. Unless you are asking why the numbers change each time you run it - it's got to do with a random argsort whenever lists of V/D/J and lengths are sorted. The numbers don't have any particularly meaning other than to say whether or not two different clones share a similar criteria, so i've never enforced for the numbers to stay identical all the time.

4- We expected to see the same germline in all the cells in the network. But the germlines of cells in the network are different. Why?

I'm unsure how this can happen, other than the possiblity as i described above where a cell can have multiple BCRs, and also when cells have multiple light chains. Are you sure that the different germlines you are seeing is not because it's just IGH/IGK/IGL? Otherwise, I'll need an example where you've observed this.

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1180687085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONV2LYYVQC2DQ7Q3G7LVTRMA7ANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

Hi Sara,

1-My confusion is how biologically we can justify the intra-connections of clones? Do you have any comments about the justifications of intra-clsterers edges from a biological perspective?

The network structure should look like this:

image

Just a side note: in the latest update (v0.2.4), .edges have been removed because its behaviour was a bit random in which nodes were selected for source/target and this can lead to edge table being unstable and it's used at the intermediate step. In the latest version, the networkx graph holds the final edge list which should hopefully be more consistent.

2- Most of the clones in my dandelion file have unassigned clone-id. Why can this happen?

can you try and update your dandelion version and see if this persist?

saramoein372 commented 2 years ago

Thank you Kelvin!

tw questions I have: 1- From the dandelion network, how can I extract the single cell ID's in the biggest clone? 2- Are you saying there is no biological justification to use the "edges" that was in the previous version?

Thanks, Sara

On Tue, Jul 12, 2022 at 6:41 AM Zewen Kelvin Tuong @.***> wrote:

Hi Sara,

1-My confusion is how biologically we can justify the intra-connections of clones? Do you have any comments about the justifications of intra-clsterers edges from a biological perspective?

The network structure should look like this: [image: image] https://user-images.githubusercontent.com/26215587/178472236-e01d6ce4-430c-4cdc-988c-50353c8303ff.png

Just a side note: in the latest update (v0.2.4), .edges have been removed because its behaviour was a bit random in which nodes were selected for source/target and this can lead to edge table being unstable - the eventual network is still the same. I've elected to just operate from the networkx graphs as the behaviour is more consistent.

2- Most of the clones in my dandelion file have unassigned clone-id. Why can this happen?

can you try and update your dandelion version and see if this persist?

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1181603367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTSVPJIHPRVNVF4YGTVTVDXDANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

Hi Kelvin again:

My questions are: 1- From the dandelion network, how can I extract the single cell ID's in the biggest clone? 2- Are you saying there is no biological justification to use the "edges" that was in the previous version? 3- why in the same clone, I see different VDJs?

Thanks, Sara

On Thu, Jul 14, 2022 at 3:59 PM Sara Moien @.***> wrote:

Thank you Kelvin!

tw questions I have: 1- From the dandelion network, how can I extract the single cell ID's in the biggest clone? 2- Are you saying there is no biological justification to use the "edges" that was in the previous version?

Thanks, Sara

On Tue, Jul 12, 2022 at 6:41 AM Zewen Kelvin Tuong < @.***> wrote:

Hi Sara,

1-My confusion is how biologically we can justify the intra-connections of clones? Do you have any comments about the justifications of intra-clsterers edges from a biological perspective?

The network structure should look like this: [image: image] https://user-images.githubusercontent.com/26215587/178472236-e01d6ce4-430c-4cdc-988c-50353c8303ff.png

Just a side note: in the latest update (v0.2.4), .edges have been removed because its behaviour was a bit random in which nodes were selected for source/target and this can lead to edge table being unstable - the eventual network is still the same. I've elected to just operate from the networkx graphs as the behaviour is more consistent.

2- Most of the clones in my dandelion file have unassigned clone-id. Why can this happen?

can you try and update your dandelion version and see if this persist?

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1181603367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTSVPJIHPRVNVF4YGTVTVDXDANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

Sorry Kelvin,

Some other questions are added here: Hi Kelvin again:

My questions are: 1- From the dandelion network, how can I extract the single cell ID's in the biggest clone? 2- How can get the size of clones? 2- Are you saying there is no biological justification to use the "edges" that was in the previous version? 3- why in the same clone, I see different VDJs?

Thanks, Sara

On Thu, Jul 14, 2022 at 4:54 PM Sara Moien @.***> wrote:

Hi Kelvin again:

My questions are: 1- From the dandelion network, how can I extract the single cell ID's in the biggest clone? 2- Are you saying there is no biological justification to use the "edges" that was in the previous version? 3- why in the same clone, I see different VDJs?

Thanks, Sara

On Thu, Jul 14, 2022 at 3:59 PM Sara Moien @.***> wrote:

Thank you Kelvin!

tw questions I have: 1- From the dandelion network, how can I extract the single cell ID's in the biggest clone? 2- Are you saying there is no biological justification to use the "edges" that was in the previous version?

Thanks, Sara

On Tue, Jul 12, 2022 at 6:41 AM Zewen Kelvin Tuong < @.***> wrote:

Hi Sara,

1-My confusion is how biologically we can justify the intra-connections of clones? Do you have any comments about the justifications of intra-clsterers edges from a biological perspective?

The network structure should look like this: [image: image] https://user-images.githubusercontent.com/26215587/178472236-e01d6ce4-430c-4cdc-988c-50353c8303ff.png

Just a side note: in the latest update (v0.2.4), .edges have been removed because its behaviour was a bit random in which nodes were selected for source/target and this can lead to edge table being unstable - the eventual network is still the same. I've elected to just operate from the networkx graphs as the behaviour is more consistent.

2- Most of the clones in my dandelion file have unassigned clone-id. Why can this happen?

can you try and update your dandelion version and see if this persist?

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1181603367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTSVPJIHPRVNVF4YGTVTVDXDANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

1- From the dandelion network, how can I extract the single cell ID's in the biggest clone?

The biggest clone should have a clone_id_by_size of 1. So you can just use the cell ids from the metadata that matches that.

2- How can get the size of clones?

run ddl.tl.clone_size

2- Are you saying there is no biological justification to use the "edges" that was in the previous version?

that's correct. no justification

3- why in the same clone, I see different VDJs?

I'm not sure how this can happen. can you show me an example?

saramoein372 commented 2 years ago

Thanks Kelvin for all your answers. Really appreciate your time! I could see that after adding vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, filter_rna = True) , I could get clone_id for all cells.

My question is how I can extract all the cells on the BCR network (the visualized network)? I want to extract the clone_id and cell_barcode from the visualized BCR_network.

Thank you again!

Sara

On Mon, Jul 18, 2022 at 4:10 PM Zewen Kelvin Tuong @.***> wrote:

1- From the dandelion network, how can I extract the single cell ID's in the biggest clone?

The biggest clone should have a clone_id_by_size of 1. So you can just use the size ids from the metadata.

2- How can get the size of clones?

run ddl.tl.clone_size

2- Are you saying there is no biological justification to use the "edges" that was in the previous version?

that's correct. no justification

3- why in the same clone, I see different VDJs?

I'm not sure how this can happen. can you show me an example?

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188252484, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQBD3RGXJSETVRAH3LVUW22BANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

i see.

for that you need to extract from the graph itself: vdj.graph[0] or vdj.graph[1] - either will work.

you would want to follow the instructions here: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html

which basically should look like:

G = vdj.graph[1]
# find the largest connected network
largest_cc = max(nx.connected_components(G), key=len)
# subset to largest_cc
S = [G.subgraph(c).copy() for c in nx.connected_components(G)]

# this should give you the list of nodes that are this network
S.nodes

Then you should be able to just match it them from the metadata?

newvdj = vdj[vdj.metadata_names.isin(list(S.nodes))].copy()
newvdj.metadata
saramoein372 commented 2 years ago

Thank you Kelvin. I could see you made a lot of updates on your tutorial. That helped me to find the problem.

Best, Sara

On Mon, Jul 18, 2022 at 4:46 PM Zewen Kelvin Tuong @.***> wrote:

i see.

for that you need to extract from the graph itself: vdj.graph[0] or vdj.graph[1] - either will work.

you would want to follow the instructions here:

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html

which basically should look like:

G = vdj.graph[1]# find the largest connected networklargest_cc = max(nx.connected_components(G), key=len)# subset to largest_ccS = [G.subgraph(c).copy() for c in nx.connected_components(G)]

this should give you the list of nodes that are this networkS.nodes# orlist(S.nodes)

Then you should be able to just match it them from the metadata?

newvdj = vdj[vdj.metadata_names.isin(S.nodes)].copy()newvdj.metadata

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188293012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONU4QODNCXE5Z6MEKQDVUW7B3ANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

So Kelvin I have another question:

I got a much better network before filtering the configs. And after the configs, my network significantly shrinked. Is there any justification for using the network without filtering of configs (, which from that most of the clones have no ids)?

On Mon, Jul 18, 2022 at 4:54 PM Sara Moien @.***> wrote:

Thank you Kelvin. I could see you made a lot of updates on your tutorial. That helped me to find the problem.

Best, Sara

On Mon, Jul 18, 2022 at 4:46 PM Zewen Kelvin Tuong < @.***> wrote:

i see.

for that you need to extract from the graph itself: vdj.graph[0] or vdj.graph[1] - either will work.

you would want to follow the instructions here:

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html

which basically should look like:

G = vdj.graph[1]# find the largest connected networklargest_cc = max(nx.connected_components(G), key=len)# subset to largest_ccS = [G.subgraph(c).copy() for c in nx.connected_components(G)]

this should give you the list of nodes that are this networkS.nodes# orlist(S.nodes)

Then you should be able to just match it them from the metadata?

newvdj = vdj[vdj.metadata_names.isin(S.nodes)].copy()newvdj.metadata

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188293012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONU4QODNCXE5Z6MEKQDVUW7B3ANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

Hmm, i think the data may be artificial In that those cells are connected because they do not have a good set of BCRs, so my current judgement is no it’s not recommended to use the network where it’s formed by unassigned ids.

However, you could disregard dandelion mode of assigning clone ids, and just replace them with your clone ids set by your prefer criterion (or don’t specify any and leave all clone ids as blanks - in which case you will end up with a fully connected network, can you can use other methods to break this up e.g. louvain clustering as per implemented in various graphing tools). I guess you have to ask what would be the purpose of this approach yourself, to validate why you chose that route.

Kelvin

saramoein372 commented 2 years ago

So Kelvin, is this filter_contigs command necessary?

On Mon, Jul 18, 2022 at 5:49 PM Zewen Kelvin Tuong @.***> wrote:

Hmm, i think the data may be artificial In that those cells are connected because they do not have a good set of BCRs, so my current judgement is no it’s not recommended to use the network where it’s formed by unassigned ids.

However, you could disregard dandelion mode of assigning clone ids, and just replace them with your clone ids set by your prefer criterion (or don’t specify any and leave all clone ids as blanks - in which case you will end up with a fully connected network, can you can use other methods to break this up e.g. louvain clustering as per implemented in various graphing tools). I guess you have to ask what would be the purpose of this approach yourself, to validate why you chose that route.

Kelvin

On 18 Jul 2022, at 10:17 PM, saramoein372 @.***> wrote:



So Kelvin I have another question:

I got a much better network before filtering the configs. And after the configs, my network significantly shrinked. Is there any justification for using the network without filtering of configs (, which from that most of the clones have no ids)?

On Mon, Jul 18, 2022 at 4:54 PM Sara Moien @.***> wrote:

Thank you Kelvin. I could see you made a lot of updates on your tutorial. That helped me to find the problem.

Best, Sara

On Mon, Jul 18, 2022 at 4:46 PM Zewen Kelvin Tuong < @.***> wrote:

i see.

for that you need to extract from the graph itself: vdj.graph[0] or vdj.graph[1] - either will work.

you would want to follow the instructions here:

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html

which basically should look like:

G = vdj.graph[1]# find the largest connected networklargest_cc = max(nx.connected_components(G), key=len)# subset to largest_ccS = [G.subgraph(c).copy() for c in nx.connected_components(G)]

this should give you the list of nodes that are this networkS.nodes

orlist(S.nodes)

Then you should be able to just match it them from the metadata?

newvdj = vdj[vdj.metadata_names.isin(S.nodes)].copy()newvdj.metadata

— Reply to this email directly, view it on GitHub < https://github.com/zktuong/dandelion/issues/162#issuecomment-1188293012>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AVVJONU4QODNCXE5Z6MEKQDVUW7B3ANCNFSM53H5VKCQ>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub [github.com]< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_zktuong_dandelion_issues_162-23issuecomment-2D1188320507&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=NnH1lFEAbZToqib-c1bFKCDR6VzAy7mQ1sbB2q4qbXQ&m=MW1cV0xsLGYlETWuyJDmyfEqhzCPa4l5shM9avRhUJ-n8j0z9frikpa8VwqY3ojk&s=WJ3ytqpIVlxZvMJc6-sD1gAPzsuH9dKXkHJl9jBix2k&e=>, or unsubscribe [github.com]< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AGIAJI7GBYU7VZXQKMDOIBTVUXCXPANCNFSM53H5VKCQ&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=NnH1lFEAbZToqib-c1bFKCDR6VzAy7mQ1sbB2q4qbXQ&m=MW1cV0xsLGYlETWuyJDmyfEqhzCPa4l5shM9avRhUJ-n8j0z9frikpa8VwqY3ojk&s=CI1HYdokLGUke8YZvmm9xKrYQsKQTMvFMOef9u734dk&e=>.

You are receiving this because you commented.Message ID: @.***>

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188355317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXTHTATLLAQU2FWIHLVUXGPZANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

Yes, because the whole point is to remove all ambiguous BCR chains.

You can also use scirpy's method to define clones and see if that makes a difference

saramoein372 commented 2 years ago

Thanks Kelvin.

One more question: how this can happen that my cell ranger results has the v-call and j-call information for each cell. But dandelion has put empty for v and j genotypes columns, and also empty column for clone-id? Then I have unassigned clone and my bcr network is showing all these cells in a big clone.

How this is possible?

On Tue, Jul 19, 2022, 2:05 AM Zewen Kelvin Tuong @.***> wrote:

Yes, because the whole point is to remove all ambiguous BCR chains.

You can also use scirpy's method to define clones and see if that makes a difference

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188635367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONR7SH7DJO7VJ5Y2MSLVUZASVANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

And one more question is: can I ask the correct singularity command in preprocessing step, that has all the necessary parameters for correct filtering, including contig filtering and everything?

On Tue, Jul 19, 2022, 5:16 AM Sara Moien @.***> wrote:

Thanks Kelvin.

One more question: how this can happen that my cell ranger results has the v-call and j-call information for each cell. But dandelion has put empty for v and j genotypes columns, and also empty column for clone-id? Then I have unassigned clone and my bcr network is showing all these cells in a big clone.

How this is possible?

On Tue, Jul 19, 2022, 2:05 AM Zewen Kelvin Tuong @.***> wrote:

Yes, because the whole point is to remove all ambiguous BCR chains.

You can also use scirpy's method to define clones and see if that makes a difference

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188635367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONR7SH7DJO7VJ5Y2MSLVUZASVANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

One more question: how this can happen that my cell ranger results has the v-call and j-call information for each cell. But dandelion has put empty for v and j genotypes columns, and also empty column for clone-id? Then I have unassigned clone and my bcr network is showing all these cells in a big clone.

the pre-processing will reannotate the V and J calls, using igblastn and blastn. Where it was deemed that the call was too low confidence, dandelion will remove the V/J call annotation, but would largely be consistent with how igblastn is performed (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692102/).

during post-processing i.e. filter_contigs or check_contigs, a contig level QC assessment is performed where i ask whether the assignments make sense:

e.g. IGHV must pair with a IGHJ in the same contig - if it's missing either, then it's not a good productive contig. there's several other logical checks like that along the way, to ensure that what we end up with are good sets of contigs.

Where filter_contigs and check_contigs differ, is that filter_contigs is stricter, and also performs a hard cell level QC where it checks if a cell has 1 or many sets of heavy+light chains. If many, filter_contigs will remove. For check_contigs, the cell level QC is a soft check, and just populates in the .metadata's chain_status column - to indicate if particular cells display ambiguous contigs.

clone_id thus relies on all these checks to succeed. 1) It MUST have a V gene, a J gene, CDR3 sequence 2) It MUST have at least 1 heavy chain If a cell only has light chains, then clone id will not be defined. The rationale is that biologically, IGH rearrangement occurs prior to IGK/IGL rearrangement i.e. you must have a productive heavy chain before light chain will be rearranged.

So unless you are still using an older version of dandelion i'm not sure if it's possible for form a network of unassigned clones - regardless, this is still a bug and should be removed/ignore. I'll need a more concrete example to able to diagnose this bug.

And one more question is: can I ask the correct singularity command in preprocessing step, that has all the necessary parameters for correct filtering, including contig filtering and everything?

The current singularity script just do the pre-processing. All the filtering steps are considered post-processing and you'll have to follow the tutorial.

saramoein372 commented 2 years ago

Thanks Kelvin. During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)"

I get this error:

TypeError: update_metadata() got an unexpected keyword argument 'library_type'

How I can get rid of this error? Since it is recommended to define the type of library.

On Tue, Jul 19, 2022 at 7:54 AM Zewen Kelvin Tuong @.***> wrote:

One more question: how this can happen that my cell ranger results has the v-call and j-call information for each cell. But dandelion has put empty for v and j genotypes columns, and also empty column for clone-id? Then I have unassigned clone and my bcr network is showing all these cells in a big clone.

the pre-processing will reannotate the V and J calls, using igblastn and blastn. Where it was deemed that the call was too low confidence, dandelion will remove the V/J call annotation, but would largely be consistent with how igblastn is performed ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692102/).

during post-processing i.e. filter_contigs or check_contigs, a contig level QC assessment is performed where i ask whether the assignments make sense:

e.g. IGHV must pair with a IGHJ in the same contig - if it's missing either, then it's not a good productive contig. there's several other logical checks like that along the way, to ensure that what we end up with are good sets of contigs.

Where filter_contigs and check_contigs differ, is that filter_contigs is stricter, and also performs a hard cell level QC where it checks if a cell has 1 or many sets of heavy+light chains. If many, filter_contigs will remove. For check_contigs, the cell level QC is a soft check, and just populates in the .metadata's chain_status column - to indicate if particular cells display ambiguous contigs.

clone_id thus relies on all these checks to succeed.

  1. It MUST have a V gene, a J gene, CDR3 sequence
  2. It MUST have at least 1 heavy chain If a cell only has light chains, then clone id will not be defined. The rationale is that biologically, IGH rearrangement occurs prior to IGK/IGL rearrangement i.e. you must have a productive heavy chain before light chain will be rearranged.

So unless you are still using an older version of dandelion i'm not sure if it's possible for form a network of unassigned clones - regardless, this is still a bug and should be removed/ignore. I'll need a more concrete example to able to diagnose this bug.

And one more question is: can I ask the correct singularity command in preprocessing step, that has all the necessary parameters for correct filtering, including contig filtering and everything?

The current singularity script just do the pre-processing. All the filtering steps are considered post-processing and you'll have to follow the tutorial.

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188959996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRLNORBW3BXAMT6EQTVU2JRFANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

And sorry Kelvin,

I am going to generate a network of all the edges from nx package (like the graph that you sent me a few days ago) and you mentioned that the 'edges' from the dandelion package is not reliable. I need a way that gives me edges. But it is not clear for me how to do that.

Any comments?

On Tue, Jul 19, 2022 at 9:39 AM Sara Moien @.***> wrote:

Thanks Kelvin. During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)"

I get this error:

TypeError: update_metadata() got an unexpected keyword argument 'library_type'

How I can get rid of this error? Since it is recommended to define the type of library.

On Tue, Jul 19, 2022 at 7:54 AM Zewen Kelvin Tuong < @.***> wrote:

One more question: how this can happen that my cell ranger results has the v-call and j-call information for each cell. But dandelion has put empty for v and j genotypes columns, and also empty column for clone-id? Then I have unassigned clone and my bcr network is showing all these cells in a big clone.

the pre-processing will reannotate the V and J calls, using igblastn and blastn. Where it was deemed that the call was too low confidence, dandelion will remove the V/J call annotation, but would largely be consistent with how igblastn is performed ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692102/).

during post-processing i.e. filter_contigs or check_contigs, a contig level QC assessment is performed where i ask whether the assignments make sense:

e.g. IGHV must pair with a IGHJ in the same contig - if it's missing either, then it's not a good productive contig. there's several other logical checks like that along the way, to ensure that what we end up with are good sets of contigs.

Where filter_contigs and check_contigs differ, is that filter_contigs is stricter, and also performs a hard cell level QC where it checks if a cell has 1 or many sets of heavy+light chains. If many, filter_contigs will remove. For check_contigs, the cell level QC is a soft check, and just populates in the .metadata's chain_status column - to indicate if particular cells display ambiguous contigs.

clone_id thus relies on all these checks to succeed.

  1. It MUST have a V gene, a J gene, CDR3 sequence
  2. It MUST have at least 1 heavy chain If a cell only has light chains, then clone id will not be defined. The rationale is that biologically, IGH rearrangement occurs prior to IGK/IGL rearrangement i.e. you must have a productive heavy chain before light chain will be rearranged.

So unless you are still using an older version of dandelion i'm not sure if it's possible for form a network of unassigned clones - regardless, this is still a bug and should be removed/ignore. I'll need a more concrete example to able to diagnose this bug.

And one more question is: can I ask the correct singularity command in preprocessing step, that has all the necessary parameters for correct filtering, including contig filtering and everything?

The current singularity script just do the pre-processing. All the filtering steps are considered post-processing and you'll have to follow the tutorial.

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188959996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRLNORBW3BXAMT6EQTVU2JRFANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

And Kelvin,

Would you please provide a short explanation about graph[0] and graph[1]?

It looks after plotting all clones are connected together. I am confused about how they are connected?

Thanks, Sara

On Tue, Jul 19, 2022 at 11:29 AM Sara Moien @.***> wrote:

And sorry Kelvin,

I am going to generate a network of all the edges from nx package (like the graph that you sent me a few days ago) and you mentioned that the 'edges' from the dandelion package is not reliable. I need a way that gives me edges. But it is not clear for me how to do that.

Any comments?

On Tue, Jul 19, 2022 at 9:39 AM Sara Moien @.***> wrote:

Thanks Kelvin. During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)"

I get this error:

TypeError: update_metadata() got an unexpected keyword argument 'library_type'

How I can get rid of this error? Since it is recommended to define the type of library.

On Tue, Jul 19, 2022 at 7:54 AM Zewen Kelvin Tuong < @.***> wrote:

One more question: how this can happen that my cell ranger results has the v-call and j-call information for each cell. But dandelion has put empty for v and j genotypes columns, and also empty column for clone-id? Then I have unassigned clone and my bcr network is showing all these cells in a big clone.

the pre-processing will reannotate the V and J calls, using igblastn and blastn. Where it was deemed that the call was too low confidence, dandelion will remove the V/J call annotation, but would largely be consistent with how igblastn is performed ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692102/).

during post-processing i.e. filter_contigs or check_contigs, a contig level QC assessment is performed where i ask whether the assignments make sense:

e.g. IGHV must pair with a IGHJ in the same contig - if it's missing either, then it's not a good productive contig. there's several other logical checks like that along the way, to ensure that what we end up with are good sets of contigs.

Where filter_contigs and check_contigs differ, is that filter_contigs is stricter, and also performs a hard cell level QC where it checks if a cell has 1 or many sets of heavy+light chains. If many, filter_contigs will remove. For check_contigs, the cell level QC is a soft check, and just populates in the .metadata's chain_status column - to indicate if particular cells display ambiguous contigs.

clone_id thus relies on all these checks to succeed.

  1. It MUST have a V gene, a J gene, CDR3 sequence
  2. It MUST have at least 1 heavy chain If a cell only has light chains, then clone id will not be defined. The rationale is that biologically, IGH rearrangement occurs prior to IGK/IGL rearrangement i.e. you must have a productive heavy chain before light chain will be rearranged.

So unless you are still using an older version of dandelion i'm not sure if it's possible for form a network of unassigned clones - regardless, this is still a bug and should be removed/ignore. I'll need a more concrete example to able to diagnose this bug.

And one more question is: can I ask the correct singularity command in preprocessing step, that has all the necessary parameters for correct filtering, including contig filtering and everything?

The current singularity script just do the pre-processing. All the filtering steps are considered post-processing and you'll have to follow the tutorial.

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1188959996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRLNORBW3BXAMT6EQTVU2JRFANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)" I get this error: TypeError: update_metadata() got an unexpected keyword argument 'library_type'

you are not using the correct version of dandelion. please uninstall and reinstall again. dandelion.__version__ has to be 0.2.4

I am going to generate a network of all the edges from nx package (like the graph that you sent me a few days ago) and you mentioned that the 'edges' from the dandelion package is not reliable. I need a way that gives me edges. But it is not clear for me how to do that.

I would suggest for you have to learn how to use the networkx package because this isn't the place to ask questions related to it. https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.to_pandas_edgelist.html

Would you please provide a short explanation about graph[0] and graph[1]? It looks after plotting all clones are connected together. I am confused about how they are connected?

graph[0] contains all nodes (includes singleton) and graph[1] contains only connected nodes.

Sorry the code i used above is wrong. should be:

S = G.subgraph(largest_cc)
saramoein372 commented 2 years ago

Hi Kelvin,

One question I have: how filtering_contigs function is working? Does dandelion remove the light chain?

We want to see which criterias filter_contigs is looking at to filter contigs.

Because we see many of our cells are excluded in the filtering step, which is strange.

Thanks, Sara

On Tue, Jul 19, 2022 at 11:14 PM Sara Moien @.***> wrote:

Thanks Kelvin,

I made some mess for updating my dandelion. I uninstalled that, but to re-install it I am using the instruction from your tutorial. https://sc-dandelion.readthedocs.io/en/latest/README.html#installation For installation conda install -c conda-forge python-igraph leidenalg I get an error:

ERROR: Failed building wheel for leidenalg

Do you have any idea how I install this leidenalg? or is there any other way for dandelion installation?

Thanks,

Sara

On Tue, Jul 19, 2022 at 1:58 PM Zewen Kelvin Tuong < @.***> wrote:

During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)" I get this error: TypeError: update_metadata() got an unexpected keyword argument 'library_type'

you are not using the correct version of dandelion. please uninstall and reinstall again. dandelion.version has to be 0.2.4

I am going to generate a network of all the edges from nx package (like the graph that you sent me a few days ago) and you mentioned that the 'edges' from the dandelion package is not reliable. I need a way that gives me edges. But it is not clear for me how to do that.

I would suggest for you have to learn how to use the networkx package https://networkx.org/documentation/stable/ because this isn't the place to ask questions related to it.

https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.to_pandas_edgelist.html

Would you please provide a short explanation about graph[0] and graph[1]? It looks after plotting all clones are connected together. I am confused about how they are connected?

graph[0] contains all nodes (includes singleton) and graph[1] contains only connected nodes.

Sorry the code i used above is wrong. should be:

S = G.subgraph(largest_cc)

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1189390207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVCI6G7LXJCXT4ITUTVU3UDVANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

Hi Kelvin,

How we can say to dandelion to consider both heavy and LIGHT chains? because currently, it is only generating clone_id based on heavy chain. But we need to look at both chanis.

Thanks, Sara

On Wed, Jul 20, 2022 at 12:52 PM Sara Moien @.***> wrote:

Hi Kelvin,

One question I have: how filtering_contigs function is working? Does dandelion remove the light chain?

We want to see which criterias filter_contigs is looking at to filter contigs.

Because we see many of our cells are excluded in the filtering step, which is strange.

Thanks, Sara

On Tue, Jul 19, 2022 at 11:14 PM Sara Moien @.***> wrote:

Thanks Kelvin,

I made some mess for updating my dandelion. I uninstalled that, but to re-install it I am using the instruction from your tutorial. https://sc-dandelion.readthedocs.io/en/latest/README.html#installation For installation conda install -c conda-forge python-igraph leidenalg I get an error:

ERROR: Failed building wheel for leidenalg

Do you have any idea how I install this leidenalg? or is there any other way for dandelion installation?

Thanks,

Sara

On Tue, Jul 19, 2022 at 1:58 PM Zewen Kelvin Tuong < @.***> wrote:

During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)" I get this error: TypeError: update_metadata() got an unexpected keyword argument 'library_type'

you are not using the correct version of dandelion. please uninstall and reinstall again. dandelion.version has to be 0.2.4

I am going to generate a network of all the edges from nx package (like the graph that you sent me a few days ago) and you mentioned that the 'edges' from the dandelion package is not reliable. I need a way that gives me edges. But it is not clear for me how to do that.

I would suggest for you have to learn how to use the networkx package https://networkx.org/documentation/stable/ because this isn't the place to ask questions related to it.

https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.to_pandas_edgelist.html

Would you please provide a short explanation about graph[0] and graph[1]? It looks after plotting all clones are connected together. I am confused about how they are connected?

graph[0] contains all nodes (includes singleton) and graph[1] contains only connected nodes.

Sorry the code i used above is wrong. should be:

S = G.subgraph(largest_cc)

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1189390207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVCI6G7LXJCXT4ITUTVU3UDVANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

Sorry Kelvin,

How can I have the original version of dandelion?

On Wed, Jul 20, 2022 at 2:24 PM Sara Moien @.***> wrote:

Hi Kelvin,

How we can say to dandelion to consider both heavy and LIGHT chains? because currently, it is only generating clone_id based on heavy chain. But we need to look at both chanis.

Thanks, Sara

On Wed, Jul 20, 2022 at 12:52 PM Sara Moien @.***> wrote:

Hi Kelvin,

One question I have: how filtering_contigs function is working? Does dandelion remove the light chain?

We want to see which criterias filter_contigs is looking at to filter contigs.

Because we see many of our cells are excluded in the filtering step, which is strange.

Thanks, Sara

On Tue, Jul 19, 2022 at 11:14 PM Sara Moien @.***> wrote:

Thanks Kelvin,

I made some mess for updating my dandelion. I uninstalled that, but to re-install it I am using the instruction from your tutorial. https://sc-dandelion.readthedocs.io/en/latest/README.html#installation For installation conda install -c conda-forge python-igraph leidenalg I get an error:

ERROR: Failed building wheel for leidenalg

Do you have any idea how I install this leidenalg? or is there any other way for dandelion installation?

Thanks,

Sara

On Tue, Jul 19, 2022 at 1:58 PM Zewen Kelvin Tuong < @.***> wrote:

During filter_contig "vdj, adata2 = ddl.pp.filter_contigs(new_vdj, adata, library_type ='tr-ab', filter_rna = True)" I get this error: TypeError: update_metadata() got an unexpected keyword argument 'library_type'

you are not using the correct version of dandelion. please uninstall and reinstall again. dandelion.version has to be 0.2.4

I am going to generate a network of all the edges from nx package (like the graph that you sent me a few days ago) and you mentioned that the 'edges' from the dandelion package is not reliable. I need a way that gives me edges. But it is not clear for me how to do that.

I would suggest for you have to learn how to use the networkx package https://networkx.org/documentation/stable/ because this isn't the place to ask questions related to it.

https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.to_pandas_edgelist.html

Would you please provide a short explanation about graph[0] and graph[1]? It looks after plotting all clones are connected together. I am confused about how they are connected?

graph[0] contains all nodes (includes singleton) and graph[1] contains only connected nodes.

Sorry the code i used above is wrong. should be:

S = G.subgraph(largest_cc)

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1189390207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVCI6G7LXJCXT4ITUTVU3UDVANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

How we can say to dandelion to consider both heavy and LIGHT chains? because currently, it is only generating clone_id based on heavy chain. But we need to look at both chanis.

This assertion is not true. Dandelion will consider both heavy and light chains if they are there. Thus, your description is only possible if your light chain rows are not there (because they were filtered away because of quality issues), or are not formed properly (and thus filtered away because of quality issues).

One question I have: how filtering_contigs function is working? Does dandelion remove the light chain?

It does not remove normally.

Do you see a lot of situatuons where a single cell barcode have more than two contigs assigned to one barcode?

If so, then your original data needs to be assessed if it's correct and of high quality.

We want to see which criterias filter_contigs is looking at to filter contigs.

This is in the documentation. Please read it.

How can I have the original version of dandelion?

You can pip install an earlier version as they are all on pypi

However, earlier versions should not change this behavior of missing clone_ids as i highly suspect that your issue is with your data, rather than the tool itself.

Please provide a screenshot of your data/error, or send the data to my email so i can diagnose if it's a genuine problem. I wouldn't need the full data - just a couple of your rows which you are experiencing issues will suffice. If that is not possible, then i will suggest that you start from the original cellranger outputs and just read in with ddl.read_10x_vdj or ddl.read_10x_airr.

saramoein372 commented 2 years ago

Thanks Kelvin. I think for now, we are trying to make sure we are doing the correct steps. I have some other questions: 1- what is the "criteria of connecting" of one cluster to another cluster? Do clusters connect each other from cells that "have one base nucleotide" difference? 2- I was running one data with dandelion 1.12 and the output vdj had around 6000 rows (in vdj.metadata). But with dandelion 2.4, running on the same data generates the vdj.metadata with around 300 rows. How these two dandelions different?

Thanks, Sara

On Wed, Jul 20, 2022 at 5:55 PM Zewen Kelvin Tuong @.***> wrote:

How we can say to dandelion to consider both heavy and LIGHT chains? because currently, it is only generating clone_id based on heavy chain. But we need to look at both chanis.

This assertion is not true. Dandelion will consider both heavy and light chains if they are there. Thus, your description is only possible if your light chain rows are not there (because they were filtered away because of quality issues), or are not formed properly (and thus filtered away because of quality issues).

One question I have: how filtering_contigs function is working? Does dandelion remove the light chain?

It does not remove normally.

Do you see a lot of situatuons where a single cell barcode have more than two contigs assigned to one barcode?

If so, then your original data needs to be assessed if it's correct and of high quality.

We want to see which criterias filter_contigs is looking at to filter contigs.

This is in the documentation https://sc-dandelion.readthedocs.io/en/latest/modules/dandelion.preprocessing.filter_contigs.html. Please read it.

How can I have the original version of dandelion?

You can pip install an earlier version as they are all on pypi https://pypi.org/project/sc-dandelion/

However, earlier versions should not change this behavior of missing clone_ids as i highly suspect that your issue is with your data, rather than the tool itself.

Please provide a screenshot of your data/error, or send the data to my email so i can diagnose if it's a genuine problem. I wouldn't need the full data - just a couple of your rows which you are experiencing issues will suffice. If that is not possible, then i will suggest that you start from the original cellranger outputs and just read in with ddl.read_10x_vdj https://sc-dandelion.readthedocs.io/en/latest/modules/dandelion.utilities.read_10x_vdj.html or ddl.read_10x_airr https://sc-dandelion.readthedocs.io/en/latest/modules/dandelion.utilities.read_10x_airr.html .

— Reply to this email directly, view it on GitHub https://github.com/zktuong/dandelion/issues/162#issuecomment-1190801013, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTTZZORXYEN7PDMOHDVVBYVFANCNFSM53H5VKCQ . You are receiving this because you authored the thread.Message ID: @.***>

zktuong commented 2 years ago

1- what is the "criteria of connecting" of one cluster to another cluster? Do clusters connect each other from cells that "have one base nucleotide" difference?

As i've explained above - this is determined if the clone_id entry is found to be shared by the different cells/clusters.

2- I was running one data with dandelion 1.12 and the output vdj had around 6000 rows (in vdj.metadata). But with dandelion 2.4, running on the same data generates the vdj.metadata with around 300 rows. How these two dandelions different?

you can see the various code changes here: https://github.com/zktuong/dandelion/releases

The largest difference between v0.1.12 and 0.2.x is the preprocessing step has a 'strict' mode by default, which could be why your dataset now is reduced. The rest of the changes are to do with speed upgrades.

So, instead of using filter_contigs, can you use check_contigs and report back if you still only see ~300 rows?