zenoxygen / bayespam

A simple bayesian spam classifier written in Rust.
https://crates.io/crates/bayespam
MIT License
13 stars 4 forks source link

Docs Outdated & Classifier oddity #3

Open franzos opened 5 months ago

franzos commented 5 months ago

Hi there, awesome create! Trained on some ~18k messages and works really well.

extern crate bayespam;
extern crate csv;
extern crate serde;

use bayespam::classifier::Classifier;
use serde::Deserialize;
use std::fs::File;

#[derive(Debug, Deserialize)]
struct Message {
    message: String,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file_path = "my_super_model.json";

    // Create a new classifier with an empty model
    let mut classifier = Classifier::new(file_path, true);

    ...

    // Serialize the model and save it as JSON into a file
    classifier.save(file_path);

    Ok(())
}

In this example I'm training a new model:

  1. I supply a file-path, and indicate this is a new model Classifier::new(file_path, true)
  2. Then I supply the same file-path, to save the model classifier.save(file_path)

I didn't look too closely, but wouldn't something like this be enough:

  1. Classifier::new(file_path) <- default to new model
  2. classifier.save() <- no need to supply path

If I have time, I'll provide a PR.

Houski commented 2 months ago

@franzos If you have a better json model than the default, maybe you could pull request it up? :D

franzos commented 2 months ago

I'm not sure that it's better. I mostly trained it on data I got from form submissions over the years - of which 99% were SPAM; Now it seems to catch about ~ 90% of it - with some additional rules, like failing all messages with less than 10 characters, it's more like 95%.

my_super_model_v0.0.1.json

There's a couple of things I'm working on:

I suppose with IP filtering, email testing, some basic rules (character repetition) and sieve, it might get to 99.5%. That's before LLM's though; Interestingly enough, these aren't in use much at all yet... Once they are, I guess this will be useless.