Open wconnell opened 1 month ago
I created two evaluations, one for common CDS features and one for common payload genes. In these evaluations we expect the embeddings to cluster by these labels. We take the embeddings, apply Kmeans clustering (number clusters is number of label categories), and then calculate the NMI and ARI between clusters and categories.
Model | Eval | NMI | ARI |
---|---|---|---|
gLM2_150-pretrained | CDS-curated-features | 0.1950 | 0.1156 |
gLM2_150-finetuned | CDS-curated-features | 0.2098 | 0.1203 |
gLM2_150-finetuned-augment | CDS-curated-features | 0.2152 | 0.1510 |
plasmidGPT | CDS-curated-features | 0.1097 | 0.0609 |
one-hot-encoding | CDS-curated-features | 0.1806 | 0.0517 |
gLM2_150-pretrained | common-entrez-gene | 0.2429 | 0.0802 |
gLM2_150-finetuned | common-entrez-gene | 0.3070 | 0.1564 |
gLM2_150-finetuned-augment | common-entrez-gene | 0.3072 | 0.1372 |
plasmidGPT | common-entrez-gene | 0.3024 | 0.1056 |
one-hot-encoding | common-entrez-gene | 0.2581 | 0.0952 |
Objectives
Train fine-tuned versions of gLM2:
Tasks
Model Fine-Tuning (v1):
Embedding Extraction
Baseline Methods
Evaluation Metric
Analysis