yjmantilla / sovabids

A python package for the automatic conversion of EEG datasets to the BIDS standard, with a focus on making the most out of metadata.
https://sovabids.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Populating the bids information on the rules file from different sources #24

Open yjmantilla opened 3 years ago

yjmantilla commented 3 years ago

I have been meditating about how bidscoin and sovabids work and one of the ideas is collecting information from the files (bidscoin does it from the DICOM header) and returning that information as "attributes". Then one populates the bids information with those attributes collected. See the following example:


  subject: <<entities.subject>>
  session: <<entities.session>>

    attributes:
      sidecar:
      channels.name:
      channels.type:
      channels.units:
      entities.subject:
      entities.task:
      entities.session:
      entities.run:
      dataset_description:
    bids:
      task: <<entities.task>>
      acq: 
      run: <<entities.run>>       
      suffix: eeg

In our case I mainly extract information from the path since usually (correct me if im wrong) the headers of eeg files do not contain such information. In any case If we generalize the idea from extracting information from a given source then we could implement a rules (or template/configuration file) that has two main parts:

The heuristics part would set up how the information is collected from the sources. A single heuristic in the heuristics part corresponds to a function in heuristics.py . These functions should return dictionaries.

heuristics:
  h0: # arbitrary name to refer to a particular instance of a heuristic (needed to refer to its results later)
    heuristic: # name of the python function (inside heuristics.py) that executes the heuristic, it must return a dictionary
    args: # a dictionary with the arguments of  the function
      arg1:
      arg2:

Then , whenever we need the result of a heuristic we can call it with . In example

bids:
  subject: <h0.subject>

For tabular data the semantics could be:

bids:
   channels:
      0:
        name: <h0.someRow.someColumn>

Here is how a more complete example would look like, along with some ideas and thoughts regarding technical details of this approach:

heuristics :

  h0:
    heuristic: pattern_from_example  # example-based inference
    args:
      source: source example path
      target: target example path
  h1: # each heuristic  has an arbitrary name by the user (obviusly must be unique)
    heuristic : from_regex_pattern #name of the functions in heuristics.py
    args  : 
       pattern: some regex pattern
       fields:  
          - task
          - subject
          - session
          - group
  h1b:  # heuristics could take as input the outputs of another heuristic
   # this feature would be related to the idea of example-based inference, after all those would be heuristics too
    heuristic : from_regex_pattern 
    args  : 
       pattern:  <h0.pattern>
       fields:  <h0.fields>    
  h2:
    heuristic: from_placeholder_pattern
    args : 
        pattern: some placeholder pattern
  h3:
      heuristic: from_tabular
      args:
        file: some csv file for example
        split: ,
  h4:
    heuristic: from_dictionary
    args:
      file: some json or yaml file #extracts as a dictionary
  h5:  
    heuristic: from_tabular
    args:
      file: some tsv file
      split: \t

bids:
  entities : # or path, what is better? After all the entities define the path. 
    session : <h1.session> 
    task : <h1.task>
    acquisition : 
    run : 
    subject: <h1.subject>

  # Two options to refer to electrodes/channels . Either with the original name in the raw file, or the index position in the raw file. Each has it pros and cons.

  electrodes : # configures electrodes.tsv
    - nameOfElectrodeInRawFile : # Referring to the electrode by the name
        # renaming can be implemented here but it may add some technical complications if we use the name itself to identify  
        # or index the electrodes.  should we drop support for that specific functionality? 
        # A power user could just use the code_execution feature if he needs to rename
         name : <h3.somerow.somecolumn>  # New name given (renaming)
         #should somerow somecolumn be the indexes or the name/key of the value ,ie h3.FCz.XCoordinate)
         x : <h5.somerow.somecolumn> + <h5.somerow.somecolumn> #should we include basic operations?

  channels:
    #maybe drop support for renaming since the name works as an index, and keep the possibility of retyping
    - $indexinRawFile :  #Referring the electrode by its position in the raw file. Assume channels are not reordered at any point     
        name : <h3.somerow.somecolumn>

  sidecar:
    PowerLineFrequency : <h4.line_freq>

  # Setting up the participants tsv columns from another file
  participants:
    group: <h1.group> # or maybe <htable.<h1.subject>> (h1 heuristic ouput serves as an index for the table)

So I was wondering if this what something worth exploring for the community. @civier

Comments regarding this approach

MNE-BIDS and mappings

References to inspire us

Oren's Proposal

General configuration file format

Example of one row in the table:

Origin location Target location Formula
subject1/eeg/session${A}/config.csv/1/8 subject1/eeg/session${A}/config.csv/1/9 eeg/eeg_rest_raw/sub-1/ses-rest-ses${A}/eeg/sub-1_ses-rest_coordsystem.json/“Coordinates”/“NAS”/2 ${1} + ${2}
yjmantilla commented 1 year ago

I partially introduced a way to operate different fields from the path analysis in a feature called "operation". It is described in the Rules File Schema