Open jeremydouglass opened 6 years ago
I've noticed this. Perhaps
# If the second body tag contains the text in the first body tag, take the second
if re.search(body_divs[0].get_text(separator=u' '), body_divs[1].get_text(separator=u' ')):
txt = body_divs[1].get_text(separator=u' ')
# Otherwise, join them
else:
txt = ' '.join(body_divs[0].get_text(separator=u' '), body_divs[1].get_text(separator=u' '))
Would that ensure the maximum amount of text without duplication?
Articles may have multiple <body> tags. Sometimes those contents might be redundant?
If so, here is a potential fix, to take only the first body tag. If the first is a preview, then perhaps they should be merged -- or only the second should be taken....
Potential patch on
search.py
: