wincowgerDEV / OpenSpecy-package

Analyze, Process, Identify, and Share, Raman and (FT)IR Spectra
http://wincowger.com/OpenSpecy-package/
Creative Commons Attribution 4.0 International
26 stars 11 forks source link

Develop a Predictive Model for Identifying Spectra #78

Closed wincowgerDEV closed 1 year ago

wincowgerDEV commented 3 years ago

@ardcarvalho and @wincowgerDEV are working on developing a predictive model for identifying spectra, starting with PCA. The end goal is to develop a model which can be used to accurately predict any raw unprocessed spectrum. This model will speed up identification time and allow us to rapidly expand our resources. If we use an interpretable model, we may also be able to better understand which peaks are most important for identification. Ideally, the model accuracy will be greater than 90% which is the current accuracy of our default settings. This product is ripe for publication if we manage to pull it off and could have wide implications beyond Open Specy. The model will eventually be folded into the Open Specy package as a function (as long as the model file size isn't too large) and offered as a feature in the online version of the tool.

Steps

Some other model options that might work:

  1. https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html
  2. https://github.com/wincowgerDEV/OpenSpecyAI
zsteinmetz commented 3 years ago

Sounds great! Let me know if you like any help.

wincowgerDEV commented 3 years ago

You are welcome to take part and jump in however you would like. Will definitely be posting about challenges on this issue and at mention you if neither of us know what to do too.

Cheers Win

On Mon, May 17, 2021, 1:55 AM Zacharias Steinmetz @.***> wrote:

Sounds great! Let me know if you like any help.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/78#issuecomment-842148237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJU2YF4B3V3F2EZTPGSTTODKXXANCNFSM445HKS5Q .

wincowgerDEV commented 3 years ago

A recent publication made some headway on this problem for us.

They have a github page with the code here: https://github.com/EdsonCilos/mp_classification

It is in python but this will give us some exposure to the format of the models and most can be implemented in R.

ardcarvalho commented 3 years ago

Great! I'll take a look and get more involved from next week

ardcarvalho commented 2 years ago

Hi guys, Finally I jumped in, to stay! I'm starting to understand the data structure and how to manage the collaborative work/communication through GitHub. I don't know the best place to have this kind of discussion, please feel free to correct and guide me whenever you want. Here I go:

  1. Supposing the updated OpenSpecy database is avaiable in through get_lib(), we currently have 636 spectra, right?
  2. The spectrum_identity variable have several duplicate identities, e.g. "poly(ethylene terephthalate" and "poly(ethylene terepthalate". A more standardized classification is conceivable? If so, I could try to label them and reduce this great variability, at least for the purposes of this issue.
  3. I'm collecting and diving into more specific references, as the last pub you shared, and indeed a lot could be implemented to OpenSpecy and become avaiable to scientific community. I'm more familiar with multivariate analysis than machine learning, but willing to learn. About avaiable spectra, as the ones in (https://github.com/EdsonCilos/mp_classification), is it possible to include in our database? Should we also work as "spectra hunters"?
  4. I've started the exploratory analysis with a simple PCA followed by a hierarchical clustering (OpenSpecy_MVA_script.txt - what is the best way to share my code through GitHub?
  5. Several other data analysis tools are possible to be implemented through OpenSpecy, such as spectra differentiation through space and time or among experimental conditions. That is, using metadata info as explanatory variables. Happy to listen to your ideas!

Best, Aline

wincowgerDEV commented 2 years ago

Hey Aline

Awesome! Glad to have you on board.

Yeah this is the right place to have these convos 🙂

Answers below:

  1. Supposing the updated OpenSpecy database is avaiable in through get_lib(), we currently have 636 spectra, right?

Thats right.

  1. The spectrum_identity variable have several duplicate identities, e.g. "poly(ethylene terephthalate" and "poly(ethylene terepthalate". A more standardized classification is conceivable? If so, I could try to label them and reduce this great variability, at least for the purposes of this issue.

That would be awesome. I know there is another issue we opened where @Shreyas Patankar @.***> Made some headway on that problem so you might start with getting his code implemented.

  1. I'm collecting and diving into more specific references, as the last pub you shared, and indeed a lot could be implemented to OpenSpecy and become avaiable to scientific community. I'm more familiar with multivariate analysis than machine learning, but willing to learn. About avaiable spectra, as the ones in ( https://github.com/EdsonCilos/mp_classification), is it possible to include in our database? Should we also work as "spectra hunters"?

Definitely! We should include those and we are spectra hunters. The main thing I'm working on right now is overhauling that database with about 5k new spectra we have been given by folks. As you can imagine it's taking some time to get them all formatted together but I'm about a quarter of the way there.

  1. I've started the exploratory analysis with a simple PCA followed by a hierarchical clustering (OpenSpecy_MVA_script.txt https://github.com/wincowgerDEV/OpenSpecy/files/7474226/OpenSpecy_MVA_script.txt

    what is the best way to share my code through GitHub?

The best way to share it is to create a new branch of this repo and put the code where it should belong in the repo. I guess that this function could be a function to build a hierarchical clustering algorithm in the open Specy package in addition to the app so we would probably start by implimenting it as a function there. After the code is working, you'll submit a pull request and Zacharias or I will review it and edit and with you back and forth and then it will get implimented in the code after everyone is happy with it. We can set up a video call to walk through some of this if you would like.

  1. Several other data analysis tools are possible to be implemented through OpenSpecy, such as spectra differentiation through space and time or among experimental conditions. That is, using metadata info as explanatory variables. Happy to listen to your ideas!

That would be sweet! Do you think all reference data would have to have the metadata info for a routine like that to work or could some have it and some not have it? I ask because many of the spectra have poor metadata.

Warm Regards Win

On Thu, Nov 4, 2021, 4:26 AM ardcarvalho @.***> wrote:

Hi guys, Finally I jumped in, to stay! I'm starting to understand the data structure and how to manage the collaborative work/communication through GitHub. I don't know the best place to have this kind of discussion, please feel free to correct and guide me whenever you want. Here I go:

  1. Supposing the updated OpenSpecy database is avaiable in through get_lib(), we currently have 636 spectra, right?
  2. The spectrum_identity variable have several duplicate identities, e.g. "poly(ethylene terephthalate" and "poly(ethylene terepthalate". A more standardized classification is conceivable? If so, I could try to label them and reduce this great variability, at least for the purposes of this issue.
  3. I'm collecting and diving into more specific references, as the last pub you shared, and indeed a lot could be implemented to OpenSpecy and become avaiable to scientific community. I'm more familiar with multivariate analysis than machine learning, but willing to learn. About avaiable spectra, as the ones in ( https://github.com/EdsonCilos/mp_classification), is it possible to include in our database? Should we also work as "spectra hunters"?
  4. I've started the exploratory analysis with a simple PCA followed by a hierarchical clustering (OpenSpecy_MVA_script.txt https://github.com/wincowgerDEV/OpenSpecy/files/7474226/OpenSpecy_MVA_script.txt
    • what is the best way to share my code through GitHub?
  5. Several other data analysis tools are possible to be implemented through OpenSpecy, such as spectra differentiation through space and time or among experimental conditions. That is, using metadata info as explanatory variables. Happy to listen to your ideas!

Best, Aline

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/78#issuecomment-960673235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJU4MLQSJKLC2SQ26SCDUKJUXTANCNFSM445HKS5Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ardcarvalho commented 2 years ago

Hey!

Great Win, things are becoming clear.

Indeed Shreyas Patankar did a great work on polymers categorization #95. I'll work to adapt it and implement it in OS (https://github.com/Ocean-Wise/OpenSpecy_data_sorting) - but it would be nice to apply to the database that you're working on.

The following step would be towards what was done by Back et al (pub you sent, https://doi.org/10.1016/j.chemosphere.2021.131903): developing the model to better identify an unknown spectra. They propose an interesting pipeline and worked with a dataset of 958 spectra. Nice if we develop with our 5k+ database... But I'll start the code with the what we've now.

But beyond improving the spectrum identification, I seek the implementation of data analysis tools (perhaps it should be another issue). The user would be the one to enter the metadata to perform the analysis (as experimental conditions), so I don't think the metadata of our dataset will be important.

We could definitely chat =) Let's set up a video call later november, give me some time to get more into the whole issue. =p

Best, Aline

wincowgerDEV commented 2 years ago

Hey Aline,

By the time you have a first implementation of the code running in OS, I should have the new database up and running. I will put this at the top of my priorities.

I agree that the new data analysis tools you are thinking of with experimental conditions as inputs to spectral analysis should probably be a new issue. I will send you an email for the video call for later November soon :)

Let me know if you have any other questions in the meantime.

Warm Regards, Win

On Fri, Nov 5, 2021 at 10:20 AM Aline Carvalho, PhD < @.***> wrote:

Hey!

Great Win, things are becoming clear.

Indeed Shreyas Patankar did a great work on polymers categorization #95 https://github.com/wincowgerDEV/OpenSpecy/issues/95. I'll work to adapt it and implement it in OS ( https://github.com/Ocean-Wise/OpenSpecy_data_sorting) - but it would be nice to apply to the database that you're working on.

The following step would be towards what was done by Back et al (pub you sent, https://doi.org/10.1016/j.chemosphere.2021.131903): developing the model to better identify an unknown spectra. They propose an interesting pipeline and worked with a dataset of 958 spectra. Nice if we develop with our 5k+ database... But I'll start the code with the what we've now.

But beyond improving the spectrum identification, I seek the implementation of data analysis tools (perhaps it should be another issue). The user would be the one to enter the metadata to perform the analysis (as experimental conditions), so I don't think the metadata of our dataset will be important.

We could definitely chat =) Let's set up a video call later november, give me some time to get more into the whole issue. =p

Best, Aline

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/wincowgerDEV/OpenSpecy/issues/78#issuecomment-962075570, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMUJU2NQFHQJGRW53HBGRDUKQG4TANCNFSM445HKS5Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

´¯·.¸¸.·´¯·.´¯·.¸¸.·´¯ツ ------------------------------

Win Cowger, PhD Pronouns: he/him Research Scientist Moore Institute for Plastic Pollution Research

Contact Info

515-298-3869 | @.*** | @Win_OpenData https://twitter.com/Win_OpenData

Websites Personal Website: www.wincowger.com Currently Employed: https://mooreplasticresearch.org/ Alumni Of: https://www.thegraylab.org/ Project Websites: www.openspecy.org Research Gate: https://www.researchgate.net/profile/Win-Cowger Github: https://github.com/wincowgerDEV OSF: https://osf.io/kxeh5/

afalty commented 2 years ago

Hey folks, I recently spoke with Win about getting some hyperspectral data on OpenSpecy. I've built a model for classifying spectra using SIMCA (a PCA based method) which has worked very well. I haven't used SVM but I have colleagues who work with it and I know from the literature it performs very well. If you'd like to chat just let me know, I'll also be sharing my code for the models I work with once I get it cleaned up a bit.

Looking forward to collaborating :)

zsteinmetz commented 2 years ago
  1. I've started the exploratory analysis with a simple PCA followed by a hierarchical clustering (OpenSpecy_MVA_script.txt https://github.com/wincowgerDEV/OpenSpecy/files/7474226/OpenSpecy_MVA_script.txt - what is the best way to share my code through GitHub?

The best way to share it is to create a new branch of this repo and put the code where it should belong in the repo. I guess that this function could be a function to build a hierarchical clustering algorithm in the open Specy package in addition to the app so we would probably start by implimenting it as a function there. After the code is working, you'll submit a pull request and Zacharias or I will review it and edit and with you back and forth and then it will get implimented in the code after everyone is happy with it. We can set up a video call to walk through some of this if you would like.

Glad you're in, @ardcarvalho!

If you like, you can use GitHub Gists for prototyping (https://gist.github.com/). In addition, you can fork our repo and work on a new branch of that fork from within your private account. Once you fleshed it out, you can create a pull request from that fork. That's a little bit better to manage for us than creating branches here.

This is the so called "Fork and Pull Request" workflow; see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork for details.

ardcarvalho commented 2 years ago

If you like, you can use GitHub Gists for prototyping (https://gist.github.com/). In addition, you can fork our repo and work on a new branch of that fork from within your private account. Once you fleshed it out, you can create a pull request from that fork. That's a little bit better to manage for us than creating branches here.

This is the so called "Fork and Pull Request" workflow; see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork for details.

Great, thank you @zsteinmetz !

ardcarvalho commented 2 years ago

By the time you have a first implementation of the code running in OS, I should have the new database up and running. I will put this at the top of my priorities.

That would be really nice!

I agree that the new data analysis tools you are thinking of with experimental conditions as inputs to spectral analysis should probably be a new issue. I will send you an email for the video call for later November soon :)

Yeah, let's discuss better how to organize both projects.

Let me know if you have any other questions in the meantime.

Thanks, Win! See you soon

wincowgerDEV commented 1 year ago

Following up on this, the package and app now support multinomial classification 🎉.

wincowgerDEV commented 1 year ago

Going to close this for now, we can open a new issue when we want to make another push on model dev depending on need/requests from users.