Populating the bids information on the rules file from different sources

I have been meditating about how bidscoin and sovabids work and one of the ideas is collecting information from the files (bidscoin does it from the DICOM header) and returning that information as "attributes". Then one populates the bids information with those attributes collected. See the following example:


  subject: <<entities.subject>>
  session: <<entities.session>>

    attributes:
      sidecar:
      channels.name:
      channels.type:
      channels.units:
      entities.subject:
      entities.task:
      entities.session:
      entities.run:
      dataset_description:
    bids:
      task: <<entities.task>>
      acq: 
      run: <<entities.run>>       
      suffix: eeg

In our case I mainly extract information from the path since usually (correct me if im wrong) the headers of eeg files do not contain such information. In any case If we generalize the idea from extracting information from a given source then we could implement a rules (or template/configuration file) that has two main parts:

heuristics
bids

The heuristics part would set up how the information is collected from the sources. A single heuristic in the heuristics part corresponds to a function in heuristics.py . These functions should return dictionaries.

heuristics:
  h0: # arbitrary name to refer to a particular instance of a heuristic (needed to refer to its results later)
    heuristic: # name of the python function (inside heuristics.py) that executes the heuristic, it must return a dictionary
    args: # a dictionary with the arguments of  the function
      arg1:
      arg2:

Then , whenever we need the result of a heuristic we can call it with . In example

bids:
  subject: <h0.subject>

For tabular data the semantics could be:

bids:
   channels:
      0:
        name: <h0.someRow.someColumn>

Here is how a more complete example would look like, along with some ideas and thoughts regarding technical details of this approach:

heuristics :

  h0:
    heuristic: pattern_from_example  # example-based inference
    args:
      source: source example path
      target: target example path
  h1: # each heuristic  has an arbitrary name by the user (obviusly must be unique)
    heuristic : from_regex_pattern #name of the functions in heuristics.py
    args  : 
       pattern: some regex pattern
       fields:  
          - task
          - subject
          - session
          - group
  h1b:  # heuristics could take as input the outputs of another heuristic
   # this feature would be related to the idea of example-based inference, after all those would be heuristics too
    heuristic : from_regex_pattern 
    args  : 
       pattern:  <h0.pattern>
       fields:  <h0.fields>    
  h2:
    heuristic: from_placeholder_pattern
    args : 
        pattern: some placeholder pattern
  h3:
      heuristic: from_tabular
      args:
        file: some csv file for example
        split: ,
  h4:
    heuristic: from_dictionary
    args:
      file: some json or yaml file #extracts as a dictionary
  h5:  
    heuristic: from_tabular
    args:
      file: some tsv file
      split: \t

bids:
  entities : # or path, what is better? After all the entities define the path. 
    session : <h1.session> 
    task : <h1.task>
    acquisition : 
    run : 
    subject: <h1.subject>

  # Two options to refer to electrodes/channels . Either with the original name in the raw file, or the index position in the raw file. Each has it pros and cons.

  electrodes : # configures electrodes.tsv
    - nameOfElectrodeInRawFile : # Referring to the electrode by the name
        # renaming can be implemented here but it may add some technical complications if we use the name itself to identify  
        # or index the electrodes.  should we drop support for that specific functionality? 
        # A power user could just use the code_execution feature if he needs to rename
         name : <h3.somerow.somecolumn>  # New name given (renaming)
         #should somerow somecolumn be the indexes or the name/key of the value ,ie h3.FCz.XCoordinate)
         x : <h5.somerow.somecolumn> + <h5.somerow.somecolumn> #should we include basic operations?

  channels:
    #maybe drop support for renaming since the name works as an index, and keep the possibility of retyping
    - $indexinRawFile :  #Referring the electrode by its position in the raw file. Assume channels are not reordered at any point     
        name : <h3.somerow.somecolumn>

  sidecar:
    PowerLineFrequency : <h4.line_freq>

  # Setting up the participants tsv columns from another file
  participants:
    group: <h1.group> # or maybe <htable.<h1.subject>> (h1 heuristic ouput serves as an index for the table)

So I was wondering if this what something worth exploring for the community. @civier

Comments regarding this approach

Idea: Maybe develop an API standard for heuristics. The input args could vary but the output should be a dictionary-like return that allows retrieval of tabular data, dictionary data, and single pure data types.
Warning: If the mne inferred info collides with what is got from the rules file (actually a better name would be template now) then sovabids must have a way to know that he needs to do changes to what mne bids wrote. Potential problems may arise with info encoded in the eeg file written by mne-bids (the information inside vhdr,vmrk,eeg,edf,bdf,set,fdt files specifically).
Idea: Maybe it would be better to use $something$ as enclose instead of <> , it may be easier to parse.
Warning: The channel types count (EEGChannelCount, and similars) in the sidecar json should probably be removed or handled since if the user retypes then the counts have have to be updated.
A good thing about this design is that it is extendable
An "execute heuristic" would be needed whenever is found. Now, whether the heuristic is executed everytime it is called or if we keep in memory and old result is in question. One approach is to add a "return" key to the dictionary of heuristics in memory. If that key does not exist we run the heuristic, else we just check the results.
To think about: This covers having information from a single file that applies to all subjects. What if each subject has it own metadata file? How could we implement such idea?
One way to do the previous is allowing a heuristic output to become another heuristic input. Essentially inferring the location of a metadata file describing a single eeg file would be a heuristic whose output goes into another heuristic that reads the metadata file. This may be to complicated for users though.

MNE-BIDS and mappings

Should mne/mne-bids themselves be heuristics? If we concentrate ourselves in the metadata rather than the actual file we could give the special possibility of inferring channels.tsv and the sidecar json (and other files) from mne/mne-bids. Another way is finding all the stuff mne-bids did (writing to a temporal isolated directory for example and reading all metadata files it did, printing them on the mappings) and putting that in the mapping to have the transparent info there. It is ugly but easier to maintain than knowing all of the mne-bids logic.
In general the challenge is to identify everything mne-bids did in the output and find a way to encode that info in the mappings. Problem is when both mne-bids and the user write to the same info.

References to inspire us

Bidscoin template file uses attributes inferred from headers as sources of bids information. It does not generalize though to more sources of information (miscellaneous metadata files). Sovabids currently does a workaround to solve this, it returns information inferred from mne as if they were attributes of a header .
The ARTEMIS extension could be implemented as a function in heuristics.py (see #12 )
The idea of extracting info from arbitrary files can be seen in @civier original configuration file proposal:

Oren's Proposal

General configuration file format

Example of one row in the table:

Origin location	Target location	Formula
subject1/eeg/session${A}/config.csv/1/8 subject1/eeg/session${A}/config.csv/1/9	eeg/eeg_rest_raw/sub-1/ses-rest-ses${A}/eeg/sub-1_ses-rest_coordsystem.json/“Coordinates”/“NAS”/2	${1} + ${2}

yjmantilla / sovabids