Crawler got stuck when running article_crawler.py

zeeguu / api

API for tracking a learner's progress when reading materials in a foreign language and recommending further personalized exercises and readings.

https://zeeguu.org

MIT License

8 stars 23 forks source link

Crawler got stuck when running article_crawler.py #110

Closed tfnribeiro closed 8 months ago

tfnribeiro commented 9 months ago

I am running the article_crawler.py to test if it works with the new sources using newspaper.

When running the process for the danish sources, it got stuck in loading this article: https://www.dr.dk/stories/1288510966/det-skal-du-laegge-maerke-til-ved-aarets-oscar-nomineringer

This was on line 22 for file zeeguu\core\content_retriever\parse_with_readability_server.py where the request is made without a timeout parameter.

I would add a timeout of 1 minute - 2 minutes to avoid situations like this. I am not sure if this is what maybe causes the crawler to shut down?

mircealungu commented 9 months ago

Hi Tiago! It could be that we're hitting the server too hard? Definitely worth trying to add a timeout to see what happens!

tfnribeiro commented 9 months ago

The link I sent it's kinda of a instagram story, so I think the server might not be able to parse, so adding the timeout sorta ensures we don't wait forever I think.

mircealungu commented 9 months ago

Ah!

I didn’t actually open the link :) Now I see.

So a timeout not a sleep between requests.

For sure an article should be downloadable in 10 seconds. 30 on a slow connection. But more? More would mean that something’s wrong.

On 30 Jan 2024, at 18.04, Tiago Ribeiro @.***> wrote:

The story I sent it's kinda of a instagram story, so I think the server might not be able to parse, so adding the timeout sorta ensures we don't wait forever I think.

— Reply to this email directly, view it on GitHub https://github.com/zeeguu/api/issues/110#issuecomment-1917499668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRNBZXMKMG4YWTQBIRBNDYRER2RAVCNFSM6AAAAABCRFXKOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXGQ4TSNRWHA. You are receiving this because you commented.

tfnribeiro commented 8 months ago

@mircealungu I got a notification from Sentry regarding a timeout in the crawler, so I guess this also is happening in production. Should we add a timeout exception to capture those cases or leave as is?

mircealungu commented 8 months ago

Yes, that's a good idea!