Closed tfnribeiro closed 8 months ago
Hi Tiago! It could be that we're hitting the server too hard? Definitely worth trying to add a timeout to see what happens!
The link I sent it's kinda of a instagram story, so I think the server might not be able to parse, so adding the timeout sorta ensures we don't wait forever I think.
Ah!
I didn’t actually open the link :) Now I see.
So a timeout not a sleep between requests.
For sure an article should be downloadable in 10 seconds. 30 on a slow connection. But more? More would mean that something’s wrong.
On 30 Jan 2024, at 18.04, Tiago Ribeiro @.***> wrote:
The story I sent it's kinda of a instagram story, so I think the server might not be able to parse, so adding the timeout sorta ensures we don't wait forever I think.
— Reply to this email directly, view it on GitHub https://github.com/zeeguu/api/issues/110#issuecomment-1917499668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRNBZXMKMG4YWTQBIRBNDYRER2RAVCNFSM6AAAAABCRFXKOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXGQ4TSNRWHA. You are receiving this because you commented.
@mircealungu I got a notification from Sentry regarding a timeout in the crawler, so I guess this also is happening in production. Should we add a timeout exception to capture those cases or leave as is?
Yes, that's a good idea!
I am running the
article_crawler.py
to test if it works with the new sources using newspaper.When running the process for the danish sources, it got stuck in loading this article: https://www.dr.dk/stories/1288510966/det-skal-du-laegge-maerke-til-ved-aarets-oscar-nomineringer
This was on line 22 for file
zeeguu\core\content_retriever\parse_with_readability_server.py
where the request is made without a timeout parameter.I would add a timeout of 1 minute - 2 minutes to avoid situations like this. I am not sure if this is what maybe causes the crawler to shut down?