wincowgerDEV / OpenSpecy-package

Analyze, Process, Identify, and Share, Raman and (FT)IR Spectra
http://wincowger.com/OpenSpecy-package/
Creative Commons Attribution 4.0 International
23 stars 11 forks source link

Add cluster levels as an option to matching #95

Closed wincowgerDEV closed 11 months ago

wincowgerDEV commented 3 years ago

Shreyas did cluster analysis on the FTIR Open Specy database. We should add an option to search based on the cluster group (which simplifies analysis), in the future it would be useful to use this when developing AI by conducting feature extraction using his standard deviation analysis and classifying the clusters instead of the classes currently in the list, his notes are below:

In short, I used SciPy’s hierarchical clustering and set threshold to cluster spectra together if (pearson_coefficient > 0.3). With this criterion, the data neatly separates into 33 clusters. To make this more useful:

I’ve created a figure (simplified_cluster_grid.png) attached that shows mean and standard deviation of all spectra contained respectively within each cluster – to me this boosts some confidence in the reliability of this process. The original OpenSpecy web download includes a metadata file. The clustering code adds a column to this file (see final column called “cluster_ix” in the file ftir_metadata_clusters.csv attached) Up till here everything is machine/code processed. But as a final step, I use human judgement in the attached file cluster_keys_simplified.csv where I added the last column “simplified_cluster_name” to enter simplified polymer category names similar to Primpke 2018.

I’m incorporating these into our lab’s analysis code – hopefully some of our data going forward will be labelled using these simplified cluster names after fitting with the OpenSpecy database. I also think this clustering and the simplified category names could be useful to other OpenSpecy users. All of this is now available on GitHub (with a more descriptive readme file and step-by-step jupyter notebook).

ftir_metadata_clusters.csv cluster_keys_simplified.csv simplified_cluster_grid

Shreyas-Patankar commented 3 years ago

to add some more context, this process is based on the guidelines in Primpke et al., 2018 and is motivated by the fact that FTIR spectra from different materials can sometimes be very similar and not distinguishable due to instrument limitations.

wincowgerDEV commented 11 months ago

This is actually built in now with the mediod clustered library.