multibody duplicate data?

whatevery1says / we1s-collector

article collector utils for WE1S (WhatEvery1Says)

MIT License

1 stars 1 forks source link

Articles may have multiple <body> tags. Sometimes those contents might be redundant?

If so, here is a potential fix, to take only the first body tag. If the first is a preview, then perhaps they should be merged -- or only the second should be taken....

Potential patch on search.py:

             try:  # move dictionary keys
                 soup = BeautifulSoup(article.pop('full_text'), 'lxml')
                 body_divs = soup.find_all("div", {"class":"BODY"})
-                txt = ''
-                for b in body_divs:
-                    txt = txt + b.get_text(separator=u' ')
+                txt = body_divs[0].get_text(separator=u' ')
                 txt = string_cleaner(txt)
                 if bagify:
                     txt = ' '.join(sorted(txt.split(' '), key=str.lower))

# If the second body tag contains the text in the first body tag, take the second if re.search(body_divs[0].get_text(separator=u' '), body_divs[1].get_text(separator=u' ')): txt = body_divs[1].get_text(separator=u' ') # Otherwise, join them else: txt = ' '.join(body_divs[0].get_text(separator=u' '), body_divs[1].get_text(separator=u' '))

whatevery1says / we1s-collector

multibody duplicate data? #4