Refine document extraction

maebert commented 8 years ago

Collecting a few examples here where things are not working well yet:

http://www.urbandictionary.com/define.php?term=brobdingnagian gives me the doc

"Type your email address below to get our Emails are sent from daily@urbandictionary.com. We'll never spam you."
http://thescene.whro.org/hear-cool-stuff has title: None
https://en.wiktionary.org/wiki/defenestration gives doc

"First attested circa "
http://dictionary.reference.com/browse/defenestration gives doc

"follow Dictionary.com follow Dictionary.com Geddie, for his part, fought his Slavata immediately resolved on refuting this work, written by the originator of the A tablet stating that the Many of the stormy meetings of the Bohemian nobles that preceded the 1620, \"the action of throwing out of a window,\" from Latin "
http://www.defenestration.org/ gives

dedicated to the memory of Jesse Nelson... © 1997-2010 Defenestration

So, in general, I think we should ignore documents that are have than 10 spaces. I think mostly highly fragmented sites like dictionaries don't work particularly well yet.

clarecorthell commented 8 years ago

None of those render FRDs. I wasn't developing for page types that don't. I did not work with dictionaries because I assume we will address them on specific bases (urbandictionary). We don't load javascript or bypass any overlays, which I was aware of.

Which documents should we ignore? I'm not clear on your comment.

clarecorthell commented 8 years ago

The example missing a title indeed has no title in the html whatsoever.

maebert commented 8 years ago

I think we should ignore e.g. the last one (defenestration.org). But the problem is not with the sites, it's a technical one i think: parsing Wiktionary for example stops at the first link in the text.

There's still this:

http://www.slate.com/blogs/xx_factor/2015/04/30/what_is_the_dad_bod_america_s_leading_expert_explains.html

"Photo by Michael Yarish/AMC The youth of America have been whispering about something they call the \u201cdad bod\u201d Amanda Hess is a \u201cIn case you haven't noticed lately, girls are all about that dad bod,\u201d Pearson wrote. \u201cThe dad bod is a nice balance between a beer gut and working out. The dad bod says, \u2018I go to the gym occasionally, but I also drink heavily on the weekends and enjoy eating eight slices of pizza at a time.\u2019 \u201d \u201cThere is just something about the dad bod,\u201d Pearson continued, \"that makes boys seem more human, natural, and attractive.\u201d Pearson\u2019s piece has since emerged as the definitive primer on the dad bod, educating the ",

http://www.novaroma.org/nr/Lararium

"The forms of "

http://www.wikihow.com/Make-a-Lararium

" Categories: Thanks to all authors for creating a page that has been read 17,717 times."

http://home.scarlet.be/mauk.haemers/collegium_religionis/lararium.htm

"\r\n \u00a0\r\n "

http://www.philosophybasics.com/movements_aristotelianism.html

"His immediate followers were also known as the Aristotelian Aristotelian Although much of The distinctively Aristotelian idea of ",

maebert commented 8 years ago

Mh, may I suggest using html2text instead? It's pretty fast and does a very solid job on these examples, e.g. on the last one:

parser = html2text.HTML2Text()
parser.ignore_links = parser.ignore_images = parser.ignore_emphasis = True
print parser.handle(html)

General | By Branch/Doctrine | By Historical Period | By Movement/School | By Individual Philosopher

A huge subject broken down into manageable chunks

Random Quote of the Day:

By Movement / School > Ancient > Aristotelianism

Aristotelianism is a school or tradition of philosophy from the Socratic (or Classical) period of ancient Greece, that takes its defining inspiration from the work of the 4th Century B.C. philosopher Aristotle.

His immediate followers were also known as the Peripatetic School (meaning itinerant or walking about, after the covered walkways at the Lyceum in Athens where they often met), and among the more prominent members (other than Aristotle himself) were Theophrastus (322 - 288 B.C.), Eudemus of Rhodes (c. 370 - 300 B.C.), Dicaearchus (c. 350 - 285 B.C.), Strato of Lampsacus (288 - 269 B.C.), Lyco of Troas (c. 269 - 225 B.C.), Aristo of Ceos (c. 225 - 190 B.C.), Critolaus (c. 190 - 155 B.C.), Diodorus of Tyre (c. 140 B.C.), Erymneus (c. 110 B.C.) and Alexander of Aphrodisias (c. 200 A.D.).

(...)

Bonus points: it was written by the Internet's own Aaron Swartz.

wordnik / serapis

Refine document extraction #39