Open tfnribeiro opened 2 weeks ago
Module view diffs:
After a discussion with Mircea, we identified some edge-cases that needed to be covered correctly.
I have made a series of improvements, and I will exemplify an example here, currently I have "vejret" associated with the topic "news", and I want to remove this topic.
I will run the script with the following options:
URL_KEYWORD_TO_UPDATE = "vejret"
DELETE_ARTICLE_NEW_TOPICS = True
RECALCULATE_TOPICS = True
RE_INDEX_ONLY_ARTICLES_IN_ES = True
ITERATION_STEP = 1000
This means, I want to delete the mapping "vejret" -> "News", and recalculate the topics for those articles.
This results in the following output:
Started at: 2024-11-08 13:06:32.807006
Got articles with url_keyword 'vejret', total: 650
Deleting new_topics 'News' for articles which have the keyword: 'vejret'
Found '650' topic mappings to delete.
MySQL deletion completed.
Re-indexing only existing articles in ES...
Starting re-indexing process...
0%| | 0/1 [00:00<?, ?it/s]
Batch finished. ADDED/UPDATED:650 | ERRORS: 0
100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.68s/it]
[]
Total articles added/updated: 650
Ended at: 2024-11-08 13:07:26.641374
Process took: 0:00:53.834368
Now, going back to the home page:
The articles are no longer stored as "News", but can still be found by searching:
@mircealungu Take a look when you have time and let me know if it makes more sense now!
I have reviewed the code, and I hope now it's more understandable. I have renamed some variables / methods, and tried to make the steps more clear.
Let me know if it is more clear?
I also fixed a bug introduced by the quicker method of filtering articles in ES. This only happens when the user doesn't have the ES index zeeguu. Overall, it should be more robust this way.
As described in the issue, when labeling I thought vejret would have news that are similar to what's associated with the keyword 'klima', which are more focused on larger climate events rather than description of the weather in a daily / weekly basis that we see in 'vejret'.
This means we should remove this to avoid these types of news to be inferred as 'Science & Technology' in the future.
I added a script (tools/update_es_based_on_url_keyword.py), where we can give a keyword to delete the current mappings topics with that keyword, and re-index the articles accordingly. This can also be used to simply update a keyword which had no mapping and has been mapped a topic.