zxsecurity / steamer

For importing, searching, and managing public password breach data
MIT License
159 stars 17 forks source link

Generic Importer #11

Open mlr0p opened 4 years ago

mlr0p commented 4 years ago

It would be beneficial to have a generic importer (or at least approximates to generic ) that can parse and import arbitary dump in any format.

SoftPoison commented 4 years ago

Expanding on this, the reason this is necessary is there is a lot of code reuse between each importer, and as such it can be quite daunting to write a new importer/fix issues. Having some sort of generic importer/importing method should both reduce code reuse and lower the barrier to entry for newcomers.

Golang has some form of interfaces/inheritance (https://golang.org/doc/effective_go.html#embedding), so we could use that for structuring an importer. In terms of how this should work on a technical level, here's what I propose:

importers/util/common.go

This file should be something of the form:

package util

type LineParser interface {
    ParseLine(line string) ([]interface{}, err)
    EstimateCount(line string) (int64, err)
}

type Importer struct {
    parser LineParser
    bar *pb.ProgressBar
    numThreads int
    threader chan string
    doner chan bool
    mongo *mgo.Session
    verbose bool
    fileName string
    // ... other variables that it needs
}

func MakeImporter(parser LineParser, verbose bool /*, other variables... */) *Importer {
    // basically just do what the main funcs of the current importers do, but in here
    // should just initialise everything, but not run the main loop (yet)
    // should set up the progress bar too (if verbose enabled)
    // creating the progress bar should call parser.EstimateCount(line) to get an estimate of how many creds on that line
}

func (i *Importer) Run() {
    // this part should have the threader <- r.ReadLine() loop and the <- doner loop
}

func (i *Importer) importLine() {
    // basically just copy paste the current importLine() functionality, but call i.parser.ParseLine(line) instead for the parsing (and handle any errors)
}

importers/importer-(sql-)template.go

These two files should show how easy the new system will be. They should be somewhat of the form:

import (
    "github.com/zxsecurity/steamer/importers/util"
)

type GenericData struct {
    Id           bson.ObjectId `json:"id" bson:"_id,omitempty"`
    MemberID     int           `bson:"memberid"`
    Email        string        `bson:"email"`
    Liame        string        `bson:"liame"`
    PasswordHash string        `bson:"passwordhash"`
    Password     string        `bson:"password"`
    Breach       string        `bson:"breach"`
}

type TemplateLineParser struct {}
func (t TemplateLineParser) ParseLine(line string) ([]interface{}, err) {
    data := make([]GenericData, 0)
    // code to parse a line into its data blobs

    return data
}

func (t TemplateLineParser) EstimateCount(line string) (int64, err) {
    // code to estimate how many pieces of data are in a line (for the progress bar)
}

func main() {
    parser := TemplateLineParser{}

    // other setup code ...

    importer := util.MakeImporter(parser /* , other args ... */)
    importer.Run()
}
SoftPoison commented 4 years ago

The plan to close this issue:

  1. Implement modular parsing to reduce code reuse
  2. Move the old code to the new system
  3. Using the modular parsing, implement a generic parser that should work for most things

One pull request should be made for 1. and 2. and a separate one for 3.