yunshiuan / label4MRI

Label the brain MNI coordinate by AAL/BA system
GNU General Public License v3.0
55 stars 19 forks source link

About speed optimization #2

Closed Losses closed 3 years ago

Losses commented 5 years ago

For a large number of queries, the speed of label4MRI seems to be a little bit slow, I'm very curious about how to optimize this lib for a better speed and try two ways, to use matrix calculation and cache the result with hdfs5. Just got some very interesting result.

This is the calculating speed with original mni_to_region_index:

> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.69    0.19    1.89 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.51    0.22    1.73 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.51    0.34    1.86 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.59    0.17    1.81 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.50    0.26    1.76 

Optimize with matrix

Here's the code:

mni2aal_matrix <- t(
  matrix(
    c(mni2aal[, 1], mni2aal[, 2], mni2aal[, 3]),
    ncol = 3
  )
)

mni_to_region_index_new<-function(x, y, z, distance=T)
{

  index=as.numeric(
    subset(
      mni2aal,
      x_mni==round(x) & y_mni==round(y) & z_mni==round(z),
      select=Region_index
      )
    )

  if(distance==T)
  {
    if(is.na(index)==F)
    {
      return(list(index,distance=0))
    }
    else
    {
#######################################
    d2 = colSums((mni2aal_matrix - c(x, y, z))^2)
    index=mni2aal$Region_index[which.min(d2)]
    distance=sqrt(min(d2))
#######################################
    return(list(index,distance))
    }
  }
  else if (distance==F)
  {
    if(is.na(index)==F)
    {return(list(index))}
    else{(return(paste0("Not exactly correspond to aal-labeled brain region. Please set distance=T if youn want the nearest aal-labeled region name.")))}
  }
}

The query speed has improved a little:

> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.39    0.17    1.61 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.21    0.22    1.42 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.19    0.27    1.46 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.31    0.20    1.53 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.32    0.20    1.51 

Optimize with cached result

I tried batch query all the result of x ∈ [-100, 100], y ∈ [-150, 150], z ∈ [-100, 100] in a Python implementation, and cache all result to a hdfs5 database.

The DB size is not very big (231 MB), and memory usage is quite small (1248 bytes)

here're the code and result:

Code:

library(h5)

a <- h5file('PATH TO H5 DB', mode = 'r')

query_h5 <- function(x, y, z) { return (list(
    as.numeric(a['aal.region'][x+100, y+150, z+100]),
    as.numeric(a['aal.distance'][x+100, y+150, z+100])
))}
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.08    0.02    0.11 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.06    0.00    0.07 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.08    0.00    0.08 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.08    0.00    0.08 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.07    0.00    0.08 

The third solution is quite useful for a massive query. maybe we could have a discussion about this.

Py implication: https://github.com/ezPsycho/brainSpy.py

Database: https://github.com/ezPsycho/brainSpy-data (MNI2BA included)

CLI tool: https://github.com/ezPsycho/brainSpy-cli

Finally, we generated a MNI2BA list which is similar to mni2aal, but to apply it to this project, some architure adjustment may need to be applied, if you are interested in the list, I can send a PR to the repo.

yunshiuan commented 5 years ago

I appreciate your kind effort and insights. I am actually currently working on the matrix implementation of the toolbox. I would love to see your implementation with cached result. Please send me the PR :)

Losses commented 5 years ago

Here are the finally optimization result:

> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   0.97    0.37    1.35 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   0.98    0.32    1.31 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   1.01    0.39    1.41 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   1.08    0.29    1.38 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   1.09    0.22    1.32 

But some breaking change happened, and maybe we need some discussion about these changes:

Code Format

The code was reformated with tidyR, and adjusted to Google's coding style to make sure it could be accepted by CRAN, sending a PR with such a modification seems rude, but for the readability of the code, these change is strongly suggested.

Repo Cleanup

Useless files like .Rproj.user, .Rhistory, .RData, .Ruserdata are not necessary and may have the risk of exposing developer privacy information. It's strongly recommended to remove these file, not only for the concern of disk space.

Dataset

The dataset was moved to a single list called label4mri_metadata, organized by atlas, for the ease of data query.

API Breaking Change

region_index_to_name

region_index_to_name was removed to reduce the complexity of the code under the new architecture.

mni_to_region_name

The API of mni_to_region_name was changed, added a parameter template for users to choose they want to query region from which template (AAL, BA provided now, marsAtlas and MNI template under consideration on my own implementation).

The format of the returned result was also changed, we may need to discuss the format of it seriously, it's unavoidable. current returned result format is:

> mni_to_region_name(0, 0, 0)
$`aal.index`
[1] 77

$aal.distance
[1] 7.81025

$aal.label
[1] "Thalamus_L"

$ba.index
[1] 40

$ba.distance
[1] 4.242641

$ba.label
[1] "Left-Thalamus (50)"

let me know if we need to adjust the format.

mni_to_region_index

This function was refactored for the robustness concern:

Suggestion about README

Rewrite the install section with a copy-and-paste-friendly format may make the install process not so painful :P

You can find my code in https://github.com/Losses/label4MRI/tree/master/label4MRI . It these changes are acceptable, I'll send a PR, if anything need to be discussed, plz reply :)

yunshiuan commented 5 years ago

I earnestly appreciate your advise and effort. In general, I am with your suggestions. I am going to address the issues you raised and would like to know your opinions.

1. Code Format:

If "tidyR" refers to tidyverse packages (e.g. dplyr, tidyr), I am not a big fan of using tidyR in package development. Although I rely heavily on them for interaction coding, the fact that they are subjected to package updates make it less stable, compared to using the functions embedded in base R.

However, if you are referring to the tidyverse style guide (https://style.tidyverse.org/), I agree with you. Let's adjust to code format to align with it.

2.Repo Cleanup:

Thanks for reminding me of this ;)

3.Dataset:

Agreed. This makes the data set tidier and allow new template to be added easily.

4. API Breaking Change:

4-1. region_index_to_name:

Encapsulating the projection from index to name by a function makes the code more readable. But I am okay to remove it. It does not make a big difference though.

4-2. mni_to_region_name:

4-2-1. The format of the returned result:

4-2-2. A bug in mni_to_region_name() & mni_toregion_index():

$aal.label [1] "Caudate_R"

$ba.distance [1] 0

$ba.label [1] "Right-Caudate (48)"

(2) When distance = T and without exact match:

mni_to_region_name(0, 0, 0) $aal.distance [1] 7.81025

$aal.label [1] "Thalamus_L"

$ba.distance [1] 4.242641

$ba.label [1] "Left-Thalamus (50)"

(3) When distance = F and with exact match:

mni_to_region_name(10, 10, 10, distance = F) $aal.distance [1] "NULL"

$aal.label [1] "Caudate_R"

$ba.distance [1] "NULL"

$ba.label [1] "Right-Caudate (48)"

(4) When distance = F and without exact match:

mni_to_region_name(0, 0, 0, distance = F) $aal.distance [1] "NULL"

$aal.label [1] "NULL"

$ba.distance [1] "NULL"

$ba.label [1] "NULL"


I believe this makes the format of returned values more consistent and readable. Please let me know if you would prefer the other ways.

### 5. Other refinements that I have been working on
(This part is out of the scope of this issue and is only for your interest)
#### 5-1. Improve efficiency by changing the data structure.
The current implementation of both of our versions stores the mapping data (i.e., brain coordinates - brain label ) in a list. This requires the program to search through the whole list just to find a coordinate of interest. Storing the mapping data in a 3d matrix would speed up the searching process by utilizing. matrix indexing. This is what I have been working on.

#### 5-2. Improve efficiency by using a better searching algorithm.
The other step that takes a huge amount of time is finding the nearest brain region when there is no exact match, e.g., mni_to_region_name(0, 0, 0). The current implementation computes the distances of EVERY voxel to the interested coordinate and then output the voxel with minimum distance. This could be speeded up if we choose to use a "spiral-out" approach, i.e., incrementally expanding the sphere of the given coordinate until the sphere reaches the first closest brain region.

My R version for testing the code is 3.4.2

Thank you once again for your endeavor. Please let me know if my replies make sense. I am happy to have your PR once we are on the same page.
Losses commented 5 years ago

Sorry, really busy recent days, until next weekend,

some idea about 5-2, we can get max distance value from brainSpy's distance db, we only need to search within a box with maxmium distance result as border width, it will be much faster.

yunshiuan commented 5 years ago

It would be great if you could send the PR soon. I'd like to make my progress based on your revision.

Losses commented 5 years ago

:joy: not in the town recently, and no computer in my hand

yunshiuan commented 5 years ago

I have pulled your fork and merge it into mine and make progress on it. I will keep this issue open to allow and encourage further discussion.