wschroder / NewGeorgia-ElectionFinance

New Georgia Project -Election Finance Open Access Project
4 stars 0 forks source link

Web-scraping component (first pass) #7

Open wschroder opened 5 years ago

wschroder commented 5 years ago

First pass of the web-scraping component should do the following:

I'm leaning toward the former (just saving it as raw, unprocessed data to a "RawData" table), so that we can have a separate module which handles the complications that may arise, such as data-cleanup, identifying & filtering duplicate data, etc. (Separation of Concerns)

bbrewington commented 5 years ago

By "raw, unprocessed data" do you mean the raw html?

wschroder commented 5 years ago

By "raw, unprocessed data", I mean the text that is is obtained as output from the web-scraping, so it would not contain any HTML. Example: "Bob Jones, 123 Elm Avenue, $100.00, ..." From a "web-scraping" perspective, it might look processed, but there's more work to be done, e.g. match up the foreign key references, e.g. donor, candidate, etc., and recognize & filter out duplicate data.