xpmethod / opensyllabus

Other
48 stars 10 forks source link

smarter fs #45

Open denten opened 10 years ago

denten commented 10 years ago

hashlog

files should be binary hashes with a log manifest containing contextual information like

url, date of creation, date of capture, ingested flag, hash.

The directory should be: 14/45/.......pdf

use a sniffer to check file type?

alexduryee commented 10 years ago

Moving syllabus directories into a hopper now. Many (most?) of the documents don't have extensions, or have incorrect ones, so we'll need a format detector.

Most of what we have are PDF, DOC(X), and HTML. The first three have magic numbers that we can associate (not enough non-Word MSOffice files to matter), the last will be a catch-all. It won't be perfect, but good enough for our use for now. Will need enhancement for when we process the Cohen set.

denten commented 10 years ago

I opened a separate issue on the sniffer. The ingestor should sniff, hash, and put in the right directory. How are we going to create the log file with crawl data?