wconnell / genplasmid

MIT License
0 stars 1 forks source link

gLM2 Model Fine-Tuning and Embedding Evaluation #1

Open wconnell opened 1 month ago

wconnell commented 1 month ago

Objectives

Train fine-tuned versions of gLM2:

Tasks

Model Fine-Tuning (v1):

Embedding Extraction

Baseline Methods

Evaluation Metric

Analysis

wconnell commented 1 month ago

I created two evaluations, one for common CDS features and one for common payload genes. In these evaluations we expect the embeddings to cluster by these labels. We take the embeddings, apply Kmeans clustering (number clusters is number of label categories), and then calculate the NMI and ARI between clusters and categories.

Results

Model Eval NMI ARI
gLM2_150-pretrained CDS-curated-features 0.1950 0.1156
gLM2_150-finetuned CDS-curated-features 0.2098 0.1203
gLM2_150-finetuned-augment CDS-curated-features 0.2152 0.1510
plasmidGPT CDS-curated-features 0.1097 0.0609
one-hot-encoding CDS-curated-features 0.1806 0.0517
gLM2_150-pretrained common-entrez-gene 0.2429 0.0802
gLM2_150-finetuned common-entrez-gene 0.3070 0.1564
gLM2_150-finetuned-augment common-entrez-gene 0.3072 0.1372
plasmidGPT common-entrez-gene 0.3024 0.1056
one-hot-encoding common-entrez-gene 0.2581 0.0952

pretrained PCA

Screen Shot 2024-10-10 at 11 45 51 AM Screen Shot 2024-10-10 at 11 46 25 AM

finetuned PCA

Screen Shot 2024-10-10 at 11 47 02 AM Screen Shot 2024-10-10 at 11 47 16 AM