Overhaul library database for ML and other analysis procedures

wincowgerDEV commented 3 years ago

I kind of backed us into a corner with the way the library database is set up because it is all clean processed spectra. Many machine learning procedures would benefit from the use of unclean spectra in the library and some matching procedures might perform better too. It could also be that we want to allow people to propose a "better cleaned" version of the library spectra in the app. We would need to have the raw spectra loaded somewhere in the app to pull that off. Thinking about a new library structure which has Raw_Spectra, Clean_Spectra, Sample_ID, and Peak_Number columns. If we rewrite some of the code we can get around the current need to have a peaks only version of the library (which is creating some duplication of data), by having the peak number only correspond to the peak locations and put NA where there is no peak. It would be nice if we could then create a tool in Open Specy where people can adjust the "Clean_Spectra" values for the matches by reprocessing the "Raw_Spectra" column. The adjustments would be sent to MongoDB and we would need to review them. I also want to add the preprocessing parameters we used from Open Specy to clean the spectra into the metadata. The metadata view right now is pretty bad when there is too much metadata. Perhaps we can add some onhover information or something else for the metadata so that the metadata doesn't bleed across the screen.

[x] Create new library format.
[ ] Integrate into shiny app and function workflows in app.
[ ] Improve metadata display when there is a lot of metadata.
[ ] Develop a user cleaning tool for users to clean the database.
[ ] Integrate a mongodb alert system.

wincowgerDEV commented 3 years ago

Check out these resources for improving the metadata. They will join CAS numbers with common names of many materials. https://docs.openrefine.org/manual/wikidata https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine

zsteinmetz commented 3 years ago

I really like the idea of making the raw spectra a part of the Open Specy library. But maybe we can get around changing the current library structure too much. Although MongoDB certainly has many advantages, migrating the library would mean:

lots of changes to the current code base that will probably break something
an extra load of work to keep the old structure alive until the majority of package users has switched to the new version
the DB architecture will make off-line use much more difficult

In my opinion, I wouldn't call it duplication of data but rather modularization :wink: In it's current form, package users can only download the library they need and both package and app users might save some computation power if cleaned and raw spectra are not loaded simultaneously.

Having said this, I'd rather suggest adding the raw spectra as separate library files (raman_raw.rds and ftir_raw.rds) to OSF and supplement the metadata rds files with information on the cleaning parameters. The dropdown radio buttons under "Identify Spectrum" would then get an extra option for the raw spectra.

What do you think?

wincowgerDEV commented 3 years ago

Hey Zacharias,

Yeah, that makes sense to me and sounds like a good workflow for adding the raw spectrum into our current workflow. I will need to look into how raw spectrum searches are best conducted. I know that they can produce a ton of unexpected matches when using correlations because it fits more to the baseline than it does to the peaks. However, that kind of fit may be what someone wants for assessing a match to a fluorescence signal.

Warm Regards, Win

On Wed, May 26, 2021 at 8:14 AM Zacharias Steinmetz < @.***> wrote:

I really like the idea of making the raw spectra a part of the Open Specy library. But maybe we can get around changing the current library structure too much. Although MongoDB certainly has many advantages, migrating the library would mean:

lots of changes to the current code base that will probably break something

an extra load of work to keep the old structure alive until the majority of package users has switched to the new version

the DB architecture will make off-line use much more difficult

In my opinion, I wouldn't call it duplication of data but rather modularization 😉 In it's current form, package users can only download the library they need and both package and app users might save some computation power if cleaned and raw spectra are not loaded simultaneously.

Having said this, I'd rather suggest adding the raw spectra as separate library files (raman_raw.rds and ftir_raw.rds) to OSF and supplement the metadata rds files with information on the cleaning parameters. The dropdown radio buttons under "Identify Spectrum" would then get an extra option for the raw spectra.

What do you think?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/80#issuecomment-848855936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJU4JO6VH42OAHQA64VDTPUF7BANCNFSM45A6CFJQ .

--

´¯`·.¸¸.·´¯`·.´¯`·.¸¸.·´¯`ツ ------------------------------

Win Cowger PhD Candidate, Environmental Sciences: Soil and Water University of California, Riverside

NSF Graduate Research Fellow Research Advisor to 5 Gyres https://www.5gyres.org/advisors/ Data Advisor to Let's Do It World https://www.letsdoitworld.org/

Contact Info

515-298-3869 | @.***

Websites www.openspecy.org www.wincowger.com http://andrewgray.ucr.edu/people/wcowger.html

zsteinmetz commented 3 years ago

What about limiting matches from raw data to peaks only?

wincowgerDEV commented 3 years ago

I'm not sure I have seen that used in other software so I don't know how well it would be understood by the user. Do you have any examples of that type of reference database design?

On Fri, May 28, 2021, 12:40 AM Zacharias Steinmetz @.***> wrote:

What about limiting matches from raw data to peaks only?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/80#issuecomment-850218561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJUYE3CXYVUQXDEQ2Q33TP5CHJANCNFSM45A6CFJQ .

zsteinmetz commented 3 years ago

No .. it was just an idea.

wincowgerDEV commented 3 years ago

Oh ok yeah, definitely something to think about though.

On Tue, Jun 1, 2021, 12:02 AM Zacharias Steinmetz @.***> wrote:

No .. it was just an idea.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/80#issuecomment-851877877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJUZI3QY6MMEPEDTT7DLTQSAZRANCNFSM45A6CFJQ .

wincowgerDEV commented 3 years ago

I recently noticed that the Chabuka library still has some CO2 signals in it. Definitely worth taking another look at to clean up a little more when we go back through.

wincowgerDEV commented 3 years ago

Moving this to the top of my list now, it is the most requested feature from users and I think it will improve the accuracy of the tool. Might merge with wincowgerDEV/OpenSpecy-shiny#4 too so that we can speed up the analysis simultaneously.

zsteinmetz commented 3 years ago

Sounds good to me! I'd also add wincowgerDEV/OpenSpecy-package#92. Since this will break compatibility with the current DB structure, I guess it might be best to make this a milestone for v1.0.0

wincowgerDEV commented 3 years ago

Yeah good idea! I'll try to merge that in too. Agree that this one will be a haul and should be called 1.0.0. I guess that 2.0.0 will be when we get the AI running and maybe 3.0.0 will be when we have all the login functionality working. At least that's how I'm currently prioritizing things in my mind.

Warm Regards, Win

On Tue, Oct 19, 2021, 12:50 AM Zacharias Steinmetz @.***> wrote:

Sounds good to me! I'd also add wincowgerDEV/OpenSpecy-package#92 https://github.com/wincowgerDEV/OpenSpecy/issues/92. Since this will break compatibility with the current DB structure, I guess it might be best to make this a milestone for v1.0.0

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/80#issuecomment-946453333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJU6YKWXXNNODCECV57LUHUPM5ANCNFSM45A6CFJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

wincowgerDEV commented 2 years ago

The first task is done, a new dataset has been added to the library database. The raw format is the most similar form of metadata and spectra to the raw datasets as I can currently get it. Next, I need to clean up that raw format into something which is more machine learning readable.

wincowgerDEV commented 2 years ago

The first task is done, a new dataset has been added to the library database. The raw format is the most similar form of metadata and spectra to the raw datasets as I can currently get it. Next, I need to clean up that raw format into something which is more machine learning readable.

@ardcarvalho The dataset we spoke about is mostly ready now.

ardcarvalho commented 2 years ago

The first task is done, a new dataset has been added to the library database.

Just to confirm that these new libraries are not yet available through get_lib(), right?

zsteinmetz commented 2 years ago

Just to confirm that these new libraries are not yet available through get_lib(), right?

Nope, they shouldn't. Only files that fit the pattern below are downloaded: https://github.com/wincowgerDEV/OpenSpecy/blob/b5a89a3a35482f282761f6b5b951b7f89576f59a/R/manage_lib.R#L105-L110

Maybe we should put some more thought into how we best manage the library files in the future in order not to mess with the local installations of people. Spontaneously, I could imagine

Adding a version tag to the files and make get_lib() download the latest/a specific version that fits the package installation.
Putting new library files into a new folder with a new OSF node key and we just switch out that key once we want the people to use the new library together with a new package version.

What do you think? My current preference would be option 2 but I may have not thought it through yet.

ardcarvalho commented 2 years ago

Only files that fit the pattern below are downloaded:

Ah, thanks. So, if I understood it correctly, it's basically due to the "raw" in the file name?

What do you think? My current preference would be option 2 but I may have not thought it through yet.

Changing the library implies in changing the package version? If functions are affected by a new library - not sure it's the case - I would go for a possibility that allow a lib version that fits package installation (option 1, I guess).

zsteinmetz commented 2 years ago

Ah, thanks. So, if I understood it correctly, it's basically due to the "raw" in the file name?

Yes, exactly.

Changing the library implies in changing the package version? If functions are affected by a new library - not sure it's the case - I would go for a possibility that allow a lib version that fits package installation (option 1, I guess).

It depends on the kind of change. If the table structure is changed, we might need to adapt the package code accordingly. If just some new spectra are added, the older package versions should work just fine. Option 2 would also allow us to link specific library and package versions with one another (using different OSF nodes). But it could make the file handling easier than with option 1.

wincowgerDEV commented 2 years ago

Great discussion. I've been thinking about this too. I'm trying to figure out what the best strategy will be in the future if there are rapidly expanding libraries. I'm thinking once we get this ml model running for classification we actually won't need to have people download the libraries any more. We can just have them download individual spectra that they want to look at when they want to look at it. It could make the tool a lot lighter on folks devices and make the database query faster. The drawback would be is they couldn't easily use it offline.

On Wed, Dec 15, 2021, 9:15 AM Zacharias Steinmetz @.***> wrote:

Ah, thanks. So, if I understood it correctly, it's basically due to the "raw" in the file name?

Yes, exactly.

Changing the library implies in changing the package version? If functions are affected by a new library - not sure it's the case - I would go for a possibility that allow a lib version that fits package installation (option 1, I guess).

It depends on the kind of change. If the table structure is changed, we might need to adapt the package code accordingly. If just some new spectra are added, the older package versions should work just fine. Option 2 would also allow us to link specific library and package versions with one another (using different OSF nodes). But it could make the file handling easier than with option 1.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/80#issuecomment-994890100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJU5AENL7LLPYKCIVM4DURCWKLANCNFSM45A6CFJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

wincowgerDEV commented 1 year ago

I think this has been resolved, we can resurrect if needed.

wincowgerDEV / OpenSpecy-shiny