The Discover Rules Module [ is it really necessary?]

yjmantilla commented 3 years ago

So in the proposed design there was the "Discover Rules Module" whose purpose was to "obtain the Rules File" by applying different heuristics to the source-files available.

Main Point

The heuristics are actually the rules themselves, there isn't an easy automated way to obtain them.

Do we really need a "Discover Rules" module? In essence this module applies heuristics to infer heuristics. Which seems redundant. If anything, we would have different "Rules File" templates that suit the needs of different labs/studies, being a human the one that selects which rules template to apply (that is, the human selects which set of heuristics are applied).

My proposal is to get rid of this module.

Advantages:

simplifies the software architecture
- more easy to understand for the user.
- more easy to maintain in the long run.

Explanation

The idea to have a "heuristics" module came from a shallow review of both bidscoin and heudiconv.

Now that I have gotten more into the project itself I'm noting the following:

Bidscoin heuristics are the template and study bidsmap which are equivalent to our rules file
Heudiconv heuristics are python scripts which are not so user-friendly.

Here is a comparison on how the heuristics work in these three converters:

Functionality	SOVABIDS	BIDSCOIN	HEUDICONV
Matching the correct files for the heuristic	The "non-bids.extension" key of the rules	"provenance" in the bidsmap and the is_sourcefile function of a plugin	filter_files function
Extracting information from the files matched	The "non-bids.path_analysis" key in the rules & the mne-powered file reading	"attributes" in the bidsmap and the get_attribute function of a plugin	infotodict & infotoids functions
Assigning the information extracted to bids properties	The "non-bids.path_analysis" key in the rules & hard-coded logic	"bids" in the bidsmap and (I think) the bidsmapper_plugin function of a plugin	infotodict & infotoids functions

So what I called "rules" initially , was what those softwares called "heuristics".

I would appreciate hearing what you guys think of this @stebo85 @aswinnarayanan @civier @DavidjWhite33 @TomEmotion

PD: Both bidscoin's and heudiconv's heuristics work from the pov of mapping input files to output files, ie "this goes here, and that goes overthere". For example, I have not found something similar to the idea of "correcting bad channel types".

stebo85 commented 3 years ago

Yes, that makes sense. I agree that it this module is actually the user :p

aswinnarayanan commented 3 years ago

Agreed.

Just an idea for an enhancement: Would it be useful for a "Discover module" to apply a bunch of available bids rulesets to a dataset. And provided a result table recommending the one with the closest match as the starting template.

yjmantilla commented 3 years ago

Would it be useful for a "Discover module" to apply a bunch of available bids rulesets to a dataset. And provided a result table recommending the one with the closest match as the starting template.

@aswinnarayanan That is actually what I had in mind, just that I wasn't sure how to actually develop such a thing.

Path Heuristic

The most difficult and important thing one has to infer is the input structure. In bidscoin it is fixed to a set of acceptable structures, whereas we leave that to the user so as to be more flexible. In this case one would need to make a set of structure guesses, the problem being that path structure has too many possibilities. Anyhow, one could do it in a combinatorial way with itertools or something.

EOG channels heuristic

One heuristic I thought for example is assuming channels named "EOG" "HEOG" "HVEOG" and variations from these are by default EOG channels. This heuristic would be quite simple, it would translate to a "retype the matching channels to EOG" in the rules file.

Minimum Heuristic to justify the DISCOVER RULES Module

I think that for example having heuristics like the EOG channel one I gave doesn't justify the "DISCOVER RULES" module because the user still needs to input the path structure to be able to infer the bidspaths. So in a sense it wouldnt be useful.

Now, if we manage to device a way to do the path inference heuristic I think the DISCOVER RULES would be worthwhile. I would finally almost automate the conversion with only minimal user intervention.

@aswinnarayanan Do you have any ideas or comments regarding these thoughts?

aswinnarayanan commented 3 years ago

@yjmantilla that's a very fair point. The channel name recommendation wouldn't justify a complete module. And the path inference being the bigger issue. Do you see it viable for the user community submit their edited mappings into some kind of central repository?

yjmantilla commented 3 years ago

Do you see it viable for the user community submit their edited mappings into some kind of central repository?

I think that is possible to do. The important part being that the context where the submitted mapping was applied being explicit so as the user can say "oh, this looks like my case" and tests the mapping to see if any further changes are required.

civier commented 3 years ago

As far as I remember (correct me if wrong), the idea we ended up considering is having heuristics in the sense that we learn the rules from a single example of raw EEG files converted into BIDS. This has several advantages: 1) Saving time to the user, as some of rules would be inferred from the example converted 2) Helping the user understand the format of the rules file. What is better from seeing some of the rules for your actual data? I envisage our tool to really be self-explanatory, without users needing to read a manual of how to write the rule file. 3) More practical. Most studies start with pilot data that needs to be converted rapidly for testing. Nobody has time to write a rule-file at this point, and conversion is usually done manually. Why not taking advantage of this conversion that is done anyway?

Now, I do not expect we can learn EVERYTHING from a simple example, but learning where data files go to in the folder hierarchy or learning which channels are classified as EOG sounds to me quite feasible. I do not expect it to be super clever at this point (so python scripts like heudiconv use are acceptable), but I do hope we can demonstrate that learning from an example has potential for BIDS conversion.

@yjmantilla Do you think it's feasible?

@aswinnarayanan Regarding central repository, I'm all for it, and it was actually one of my original proposals for the project. Having examples for each EEG instrument will be super useful, and even if we cannot base our heuristics on them, we can indeed try these mapping on the input dataset. Even just referring the user to such an online resource will be useful. Only question is if we can fit it into the timeframe of the current project -- @yjmantilla, do you think Bryan can work on this, or will he be investing his time in a rudimentary GUI in the end?

yjmantilla commented 3 years ago

@civier

the idea we ended up considering is having heuristics in the sense that we learn the rules from a single example of raw EEG files converted into BIDS.

It was one of the possible ideas for heuristics, along with developing heuristics for common metadata formats.

Saving time to the user, as some of rules would be inferred from the example converted

I think this is debatable. Some bids files are harder to write than the rules file, while some others arent. In example the sidecar eeg json and the channels.tsv file wouldn't be trivial to fill, or at least they would be cumbersome. The user would need to write the jsons with all the fields having the correct names, meaning they have to see the bids documentation and type that stuff. For the tsv they would need to know what a tsv is for starters and filling it (i imagine with some spreadsheet software).

One thing I do see that can be done by the user easily is placing files in sub-folders with the correct bids name. This would infer the path pattern and the user wouldn't need to write that rule (which is the most important one).

What is better from seeing some of the rules for your actual data? I envisage our tool to really be self-explanatory, without users needing to read a manual of how to write the rule file.

I agree with this being the ideal. Nevertheless almost every work-flow will require an explanation, even the one doing an initial conversion and them passing that to infers the rules. If the rules file is not complete with what is inferred from the example (which is the most probable case), then the user still needs to understand the rules file to complete it.

The main way I see to get rid of explanations is to use a form. Maybe something like ARTEMIS/COBIDAS would be the one to follow. And such form would need show a preview of the conversion so that the user knows what is wrong and changes it.

Most studies start with pilot data that needs to be converted rapidly for testing.

Im not sure if this is true at a general level. In our lab people doing the pilot study collect data following their own scheme and dont convert them to bids until it is required by someone else. Of course this is because we don't have a mature bids culture. The main people thinking of data collection are the ones inclined to that field.

Maybe is a matter of institutional guidelines and workflows. Here eeg data is taken by nurses, medical students, psychology students, engineering students and so on and they do that following a path pattern and thats it. They dont write sidecars files. At most they fill some form with the current session information.

So here the workflow is to convert to bids once the data is fully collected. Test are done on the original source data, not on bids data which is a bad practice but is what we have managed to do given the constraints.

yjmantilla commented 2 years ago

Now, I do not expect we can learn EVERYTHING from a simple example, but learning where data files go to in the folder hierarchy or learning which channels are classified as EOG sounds to me quite feasible. I do not expect it to be super clever at this point (so python scripts like heudiconv use are acceptable), but I do hope we can demonstrate that learning from an example has potential for BIDS conversion.

So what I take from this is that a prototype for the heuristics module is to be able to infer the path pattern following the placing the user did of a source file into a bids directory. That is feasible.

The channel classification is feasible from our side. From the user side though I don't imagine anyone filling the channels.tsv by hand. But if they did, we could infer the classification.

One way I do visualize this kind of example-based workflow working without too much effort into doing bids conversion by hand is that the user places the file where it belongs so that we infer the path pattern. Once that is done, we do the conversion with only that rule. The user inspects the conversion an changes what is needed accordingly. That way he wouldn't need to type everything by hand from zero, but rather just correct what is wrong. Nevertheless such workflow would need to be explained still. Here is the workflow diagram:

mermaid-diagram-20210721171636

So to wrap this up, I will do a prototype for these heuristics:

[ ] reading a channels.tsv and inferring what names and types were changed by the user, and saving them as rules.
[x] inferring path pattern from a sourcepath and a desired output path done by the user.

do you think Bryan can work on this, or will he be investing his time in a rudimentary GUI in the end?

He had advanced a bit with the rudimentary GUI, nevertheless I can say to him to focus his efforts on this feature rather than the GUI if you think it is more important.

PD: I just splitted the comment before this one for easier reading.

yjmantilla / sovabids