Open denten opened 10 years ago
Moving syllabus directories into a hopper now. Many (most?) of the documents don't have extensions, or have incorrect ones, so we'll need a format detector.
Most of what we have are PDF, DOC(X), and HTML. The first three have magic numbers that we can associate (not enough non-Word MSOffice files to matter), the last will be a catch-all. It won't be perfect, but good enough for our use for now. Will need enhancement for when we process the Cohen set.
I opened a separate issue on the sniffer. The ingestor should sniff, hash, and put in the right directory. How are we going to create the log file with crawl data?
files should be binary hashes with a log manifest containing contextual information like
url, date of creation, date of capture, ingested flag, hash.
The directory should be: 14/45/.......pdf
use a sniffer to check file type?