For a large number of queries, the speed of label4MRI seems to be a little bit slow, I'm very curious about how to optimize this lib for a better speed and try two ways, to use matrix calculation and cache the result with hdfs5. Just got some very interesting result.

This is the calculating speed with original mni_to_region_index:

> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.69    0.19    1.89 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.51    0.22    1.73 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.51    0.34    1.86 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.59    0.17    1.81 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.50    0.26    1.76

Optimize with matrix

Here's the code:

mni2aal_matrix <- t(
  matrix(
    c(mni2aal[, 1], mni2aal[, 2], mni2aal[, 3]),
    ncol = 3
  )
)

mni_to_region_index_new<-function(x, y, z, distance=T)
{

  index=as.numeric(
    subset(
      mni2aal,
      x_mni==round(x) & y_mni==round(y) & z_mni==round(z),
      select=Region_index
      )
    )

  if(distance==T)
  {
    if(is.na(index)==F)
    {
      return(list(index,distance=0))
    }
    else
    {
#######################################
    d2 = colSums((mni2aal_matrix - c(x, y, z))^2)
    index=mni2aal$Region_index[which.min(d2)]
    distance=sqrt(min(d2))
#######################################
    return(list(index,distance))
    }
  }
  else if (distance==F)
  {
    if(is.na(index)==F)
    {return(list(index))}
    else{(return(paste0("Not exactly correspond to aal-labeled brain region. Please set distance=T if youn want the nearest aal-labeled region name.")))}
  }
}

The query speed has improved a little:

> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.39    0.17    1.61 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.21    0.22    1.42 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.19    0.27    1.46 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.31    0.20    1.53 
> system.time(mapply(mni_to_region_index_new, -10:10, -10:10, -10:10))
   user  system elapsed 
   1.32    0.20    1.51

Optimize with cached result

I tried batch query all the result of x ∈ [-100, 100], y ∈ [-150, 150], z ∈ [-100, 100] in a Python implementation, and cache all result to a hdfs5 database.

The DB size is not very big (231 MB), and memory usage is quite small (1248 bytes)

here're the code and result:

Code:

library(h5)

a <- h5file('PATH TO H5 DB', mode = 'r')

query_h5 <- function(x, y, z) { return (list(
    as.numeric(a['aal.region'][x+100, y+150, z+100]),
    as.numeric(a['aal.distance'][x+100, y+150, z+100])
))}

> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.08    0.02    0.11 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.06    0.00    0.07 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.08    0.00    0.08 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.08    0.00    0.08 
> system.time(mapply(query_h5, -10:10, -10:10, -10:10))
   user  system elapsed 
   0.07    0.00    0.08

The third solution is quite useful for a massive query. maybe we could have a discussion about this.

Py implication: https://github.com/ezPsycho/brainSpy.py

Database: https://github.com/ezPsycho/brainSpy-data (MNI2BA included)

CLI tool: https://github.com/ezPsycho/brainSpy-cli

Finally, we generated a MNI2BA list which is similar to mni2aal, but to apply it to this project, some architure adjustment may need to be applied, if you are interested in the list, I can send a PR to the repo.

I appreciate your kind effort and insights. I am actually currently working on the matrix implementation of the toolbox. I would love to see your implementation with cached result. Please send me the PR :)

Here are the finally optimization result:

> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   0.97    0.37    1.35 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   0.98    0.32    1.31 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   1.01    0.39    1.41 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   1.08    0.29    1.38 
> system.time(mapply(mni_to_region_index, -10:10, -10:10, -10:10, template = 'aal'))
   user  system elapsed 
   1.09    0.22    1.32

But some breaking change happened, and maybe we need some discussion about these changes:

Code Format

The code was reformated with tidyR, and adjusted to Google's coding style to make sure it could be accepted by CRAN, sending a PR with such a modification seems rude, but for the readability of the code, these change is strongly suggested.

Repo Cleanup

Useless files like .Rproj.user, .Rhistory, .RData, .Ruserdata are not necessary and may have the risk of exposing developer privacy information. It's strongly recommended to remove these file, not only for the concern of disk space.

Dataset

The dataset was moved to a single list called label4mri_metadata, organized by atlas, for the ease of data query.

API Breaking Change

`region_index_to_name`

region_index_to_name was removed to reduce the complexity of the code under the new architecture.

`mni_to_region_name`

The API of mni_to_region_name was changed, added a parameter template for users to choose they want to query region from which template (AAL, BA provided now, marsAtlas and MNI template under consideration on my own implementation).

The format of the returned result was also changed, we may need to discuss the format of it seriously, it's unavoidable. current returned result format is:

> mni_to_region_name(0, 0, 0)
$`aal.index`
[1] 77

$aal.distance
[1] 7.81025

$aal.label
[1] "Thalamus_L"

$ba.index
[1] 40

$ba.distance
[1] 4.242641

$ba.label
[1] "Left-Thalamus (50)"

let me know if we need to adjust the format.

`mni_to_region_index`

This function was refactored for the robustness concern:

If nothing queried out with distance = F, the function will return $distance = NULL but not a string with a warning message. The warning message is not necessary if the developers defined the behavior of the function clearly. If anyone thinks a warning is necessary, we could provide it with the warning method, but in most use cases, it's annoying.
The returned result is provided in a consistent format. Returning results in different formats in different situations can cause a lot of trouble for the user's debugging process, and returning consistent results can help improve the robustness of the library, we can have a discussion about this.

Suggestion about README

Rewrite the install section with a copy-and-paste-friendly format may make the install process not so painful :P

You can find my code in https://github.com/Losses/label4MRI/tree/master/label4MRI . It these changes are acceptable, I'll send a PR, if anything need to be discussed, plz reply :)

I earnestly appreciate your advise and effort. In general, I am with your suggestions. I am going to address the issues you raised and would like to know your opinions.

1. Code Format:

If "tidyR" refers to tidyverse packages (e.g. dplyr, tidyr), I am not a big fan of using tidyR in package development. Although I rely heavily on them for interaction coding, the fact that they are subjected to package updates make it less stable, compared to using the functions embedded in base R.

However, if you are referring to the tidyverse style guide (https://style.tidyverse.org/), I agree with you. Let's adjust to code format to align with it.

2.Repo Cleanup:

Thanks for reminding me of this ;)

3.Dataset:

Agreed. This makes the data set tidier and allow new template to be added easily.

4. API Breaking Change:

4-1. region_index_to_name:

Encapsulating the projection from index to name by a function makes the code more readable. But I am okay to remove it. It does not make a big difference though.

4-2. mni_to_region_name:

4-2-1. The format of the returned result:

I believe that returning "aal.index"/"ba.index" is redundant because few people really know/care about the numeric index. It is the brain region name that is of the main interest. I suggest that the returning values to include only aal/ba.distance and aal/ba.label. See the chunk at the button for my preferred output format.

4-2-2. A bug in mni_to_region_name() & mni_toregion_index():

There is a bug in your version when there is a exact match but with distance = F. For example, mni_to_region_name(x = 10,y = 10,z = 10,distance = F,template = "aal") throws an error:Error in ans[!test & ok] <- rep(no, length.out = length(ans))[!test & : replacement has length zero.

I figured out that the error is resulted from the line 8 of mni_toregion_index(): r_distance <- ifelse(distance, 0, NULL). The function ifelse() somehow does not allow NULL to be the returned value. My workaround is to have it return the character "NULL". So the codes become r_distance <- ifelse(distance, 0, "NULL"), as well as for the line 16 r_distance <- "NULL". My preferred codes are shown as below.

if (length(matrix_index) != 0) {
r_index <- label4mri_metadata[[template]]$coordinate_label[matrix_index]
r_distance <- ifelse(distance, 0, "NULL")
} else {
if (distance) {
  d2 <- colSums((label4mri_metadata[[template]]$coordinate_list - c(x, y, z))^2)
  r_index <- label4mri_metadata[[template]]$coordinate_label[which.min(d2)]
  r_distance <- sqrt(min(d2))
} else {
  r_index <- "NULL"
  r_distance <- "NULL"
}
}

Following the format above, I would also like to change the returned value of result$label from "character(0)" to "NULL" when there is no exact match with distance = F. E.g., mni_to_region_name(x = 0,y = 0,z = 0,distance = F) returned "character(0)" in your version. Therefore, I preferred to convert it to NULL by inserting the ifelse() line.
```
result$label <- as.character(
    label4mri_metadata[[.template]]$label[label4mri_metadata[[.template]]$label$Region_index == result$index, "Region_name"]
  )
  result$label <- ifelse(length(result$label)==0,"NULL",result$label)
  result
```

To summarize the formatting rules:


# (1) When distance = T and with exact match: 
> mni_to_region_name(10, 10, 10)
$aal.distance
[1] 0

$aal.label [1] "Caudate_R"

$ba.distance [1] 0

$ba.label [1] "Right-Caudate (48)"

(2) When distance = T and without exact match:

mni_to_region_name(0, 0, 0) $aal.distance [1] 7.81025

$aal.label [1] "Thalamus_L"

$ba.distance [1] 4.242641

$ba.label [1] "Left-Thalamus (50)"

(3) When distance = F and with exact match:

mni_to_region_name(10, 10, 10, distance = F) $aal.distance [1] "NULL"

$aal.label [1] "Caudate_R"

$ba.distance [1] "NULL"

$ba.label [1] "Right-Caudate (48)"

(4) When distance = F and without exact match:

mni_to_region_name(0, 0, 0, distance = F) $aal.distance [1] "NULL"

$aal.label [1] "NULL"

$ba.distance [1] "NULL"

$ba.label [1] "NULL"


I believe this makes the format of returned values more consistent and readable. Please let me know if you would prefer the other ways.

### 5. Other refinements that I have been working on
（This part is out of the scope of this issue and is only for your interest)
#### 5-1. Improve efficiency by changing the data structure.
The current implementation of both of our versions stores the mapping data (i.e., brain coordinates - brain label ) in a list. This requires the program to search through the whole list just to find a coordinate of interest. Storing the mapping data in a 3d matrix would speed up the searching process by utilizing. matrix indexing. This is what I have been working on.

#### 5-2. Improve efficiency by using a better searching algorithm.
The other step that takes a huge amount of time is finding the nearest brain region when there is no exact match, e.g., mni_to_region_name(0, 0, 0). The current implementation computes the distances of EVERY voxel to the interested coordinate and then output the voxel with minimum distance. This could be speeded up if we choose to use a "spiral-out" approach, i.e., incrementally expanding the sphere of the given coordinate until the sphere reaches the first closest brain region.

My R version for testing the code is 3.4.2

Thank you once again for your endeavor. Please let me know if my replies make sense. I am happy to have your PR once we are on the same page.

Sorry, really busy recent days, until next weekend,

some idea about 5-2, we can get max distance value from brainSpy's distance db, we only need to search within a box with maxmium distance result as border width, it will be much faster.

It would be great if you could send the PR soon. I'd like to make my progress based on your revision.

:joy: not in the town recently, and no computer in my hand

I have pulled your fork and merge it into mine and make progress on it. I will keep this issue open to allow and encourage further discussion.

yunshiuan / label4MRI

About speed optimization #2

Optimize with matrix

Optimize with cached result

Code Format

Repo Cleanup

Dataset

API Breaking Change

`region_index_to_name`

`mni_to_region_name`

`mni_to_region_index`

Suggestion about README

1. Code Format:

2.Repo Cleanup:

3.Dataset:

4. API Breaking Change:

4-1. region_index_to_name:

4-2. mni_to_region_name:

4-2-1. The format of the returned result:

4-2-2. A bug in mni_to_region_name() & mni_toregion_index():

(2) When distance = T and without exact match:

(3) When distance = F and with exact match:

(4) When distance = F and without exact match: