whatevery1says / we1s-collector

article collector utils for WE1S (WhatEvery1Says)
https://whatevery1says.github.io/we1s-collector/
MIT License
1 stars 1 forks source link

multibody duplicate data? #4

Open jeremydouglass opened 6 years ago

jeremydouglass commented 6 years ago

Articles may have multiple <body> tags. Sometimes those contents might be redundant?

If so, here is a potential fix, to take only the first body tag. If the first is a preview, then perhaps they should be merged -- or only the second should be taken....

Potential patch on search.py:

             try:  # move dictionary keys
                 soup = BeautifulSoup(article.pop('full_text'), 'lxml')
                 body_divs = soup.find_all("div", {"class":"BODY"})
-                txt = ''
-                for b in body_divs:
-                    txt = txt + b.get_text(separator=u' ')
+                txt = body_divs[0].get_text(separator=u' ')
                 txt = string_cleaner(txt)
                 if bagify:
                     txt = ' '.join(sorted(txt.split(' '), key=str.lower))
scottkleinman commented 6 years ago

I've noticed this. Perhaps

# If the second body tag contains the text in the first body tag, take the second
if re.search(body_divs[0].get_text(separator=u' '), body_divs[1].get_text(separator=u' ')):
    txt = body_divs[1].get_text(separator=u' ')
# Otherwise, join them
else:
    txt = ' '.join(body_divs[0].get_text(separator=u' '), body_divs[1].get_text(separator=u' '))

Would that ensure the maximum amount of text without duplication?