zeeguu / api

API for tracking a learner's progress when reading materials in a foreign language and recommending further personalized exercises and readings.
https://zeeguu.org
MIT License
8 stars 24 forks source link

281 vejret shouldnt be classify as technology science #283

Open tfnribeiro opened 2 weeks ago

tfnribeiro commented 2 weeks ago

As described in the issue, when labeling I thought vejret would have news that are similar to what's associated with the keyword 'klima', which are more focused on larger climate events rather than description of the weather in a daily / weekly basis that we see in 'vejret'.

This means we should remove this to avoid these types of news to be inferred as 'Science & Technology' in the future.

I added a script (tools/update_es_based_on_url_keyword.py), where we can give a keyword to delete the current mappings topics with that keyword, and re-index the articles accordingly. This can also be used to simply update a keyword which had no mapping and has been mapped a topic.

github-actions[bot] commented 2 weeks ago

Module view diffs: diffdiffdiff

tfnribeiro commented 2 weeks ago

After a discussion with Mircea, we identified some edge-cases that needed to be covered correctly.

I have made a series of improvements, and I will exemplify an example here, currently I have "vejret" associated with the topic "news", and I want to remove this topic.

image

I will run the script with the following options:

URL_KEYWORD_TO_UPDATE = "vejret"
DELETE_ARTICLE_NEW_TOPICS = True
RECALCULATE_TOPICS = True
RE_INDEX_ONLY_ARTICLES_IN_ES = True
ITERATION_STEP = 1000

This means, I want to delete the mapping "vejret" -> "News", and recalculate the topics for those articles.

This results in the following output:

Started at: 2024-11-08 13:06:32.807006
Got articles with url_keyword 'vejret', total: 650
Deleting new_topics 'News' for articles which have the keyword: 'vejret'
Found '650' topic mappings to delete.
MySQL deletion completed.
Re-indexing only existing articles in ES...
Starting re-indexing process...
  0%|                                                                                 | 0/1 [00:00<?, ?it/s]
Batch finished. ADDED/UPDATED:650 | ERRORS: 0
100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.68s/it]
[]
Total articles added/updated: 650
Ended at: 2024-11-08 13:07:26.641374
Process took: 0:00:53.834368

Now, going back to the home page:

image

The articles are no longer stored as "News", but can still be found by searching:

image

@mircealungu Take a look when you have time and let me know if it makes more sense now!

tfnribeiro commented 4 days ago

I have reviewed the code, and I hope now it's more understandable. I have renamed some variables / methods, and tried to make the steps more clear.

Let me know if it is more clear?

tfnribeiro commented 4 days ago

I also fixed a bug introduced by the quicker method of filtering articles in ES. This only happens when the user doesn't have the ES index zeeguu. Overall, it should be more robust this way.