yhoogstrate / fuma

:dash::leopard: FuMa: reporting overlap in RNA-seq detected fusion genes
GNU General Public License v3.0
5 stars 8 forks source link

Complex subset problem with n > 2 datasets #1

Closed yhoogstrate closed 9 years ago

yhoogstrate commented 9 years ago

If we have three datasets with one fusion in each dataset, of which for all fusions the left junction is identical and spanning the same gene but the right junction is spanning a different (sub)set:

Genes dataset 1: [Left],[A,B] Genes dataset 2: [Left],[A,B,C] Genes dataset 3: [Left],[B,C]

Then the outcome is dependent on the order of comparison:

Order 1: [A,B] + [A,B,C] → Overlap: [A,B,C] [A,B,C]\ + [B,C] → Overlap: [A,B,C]**

Order 2: [A,B] + [B,C] → no overlap [no overlap]\ + [A,B,C] → no overlap**

We expect this bug to be rare, but it may affect the outcome only by changing order of the samples. Because of the object oriented structure of the code - i.e. the concatenated datasets are used as novel datasets - it is barely impossible to solve this issue without loosing much (time) performance. It is not planned to solve this bug at the moment.

yhoogstrate commented 9 years ago

This issue has been solved from version 2.* by using subsets instead of supersets:

[A,B] + [A,B,C] → Overlap: [A,B]* << that's the subset [A,B]* + [B,C] → No Overlap

Order 2: [A,B] + [B,C] → no overlap [no overlap]\ + [A,B,C] → no overlap**

Both orders will produce the same output.