wconnell / genplasmid

MIT License
0 stars 0 forks source link

gLM2 Model Fine-Tuning and Embedding Evaluation #1

Open wconnell opened 4 hours ago

wconnell commented 4 hours ago

Objectives

Train fine-tuned versions of gLM2:

Tasks

Model Fine-Tuning (v1):

Embedding Extraction

Baseline Methods

Evaluation Metric

Analysis

wconnell commented 2 hours ago

I created two evaluations, one for common CDS features and one for common payload genes. In these evaluations we expect the embeddings to cluster by these labels. We take the embeddings, apply Kmeans clustering (number clusters is number of label categories), and then calculate the NMI and ARI between clusters and categories.

Results

Model Eval NMI ARI
gLM2_150-pretrained CDS-curated-features 0.1950 0.1156
gLM2_150-finetuned CDS-curated-features 0.2098 0.1203
gLM2_150-pretrained common-entrez-gene 0.2429 0.0802
gLM2_150-finetuned common-entrez-gene 0.3070 0.1564

pretrained PCA

Screen Shot 2024-10-10 at 11 45 51 AM Screen Shot 2024-10-10 at 11 46 25 AM

finetuned PCA

Screen Shot 2024-10-10 at 11 47 02 AM Screen Shot 2024-10-10 at 11 47 16 AM