Web Scraper Improvements - Work Better with Bot Detectors

zero-to-mastery / breads-server

Server code for Breads. Keep track of what you read online, and see what your friends are reading.

https://www.breads.io/

Other

13 stars 29 forks source link

Web Scraper Improvements - Work Better with Bot Detectors #40

Open maxemileffort opened 3 years ago

maxemileffort commented 3 years ago

Original thread - Link

Fixed issues with the initial scrape failing, and then if it still fails, it passes the url to a webdriver, which attempts to mimic normal user behavior to get past bot detection.

Repo: Link

Works about 90% of the time now. The slower it goes, the better it works. Which brings me to the TODO list:

[ ] Throttle requests - checking the database for previous failed attempts and creating a batch of re-scrapes with a rate limit of 30-60 sec between requests
[x] Refactor to classes - make the code a little more readable and user-friendly for code improvements later
[x] Placeholders - sometimes we can't get a description or an image, so instead of leaving them blank, create a placeholder
[ ] PDFs - I know there's a pdf scraper being developed, too. That will just need to be added when it's done.

Also, this is my first contribution on a public project ever, so any pointers/feedback are welcome!

hiki270bis commented 3 years ago

@maxemileffort I read your code but didn't try it out. I don't know much about the previous issues with some urls, that you fixed adding the transform_url(url) funct, but if you give me one or two examples of problematic urls I'd be interested to test them to see how you solved the problem, just for my learning. About the auto_browser.py I agree that headless is more easily detected as a bot (or at least that it fails more often) but I wonder if when using a webapp it would be desirable for the browser to automatically open a anew tab to scrape the article...

maxemileffort commented 3 years ago

@hiki270bis Here's a couple of text files from @aTmb405 that I used to check everything:

unsuccessful_articles.txt bloomberg_unsuccessful_articles.txt

Sometimes the error I got was an invalid url error, so I just made sure they all started with http and www and that seemed to fix the issue.

Do you mean it makes sense for the browser to open a tab on the user's computer? It opens on the server side right now. I could see that as being a helper, but it would be a function of the chrome extension I think and not really of the server.

hiki270bis commented 3 years ago

@maxemileffort Oh, so it won't open the new tab in the user's browser? Ok, I didn't know that. In this case it should be alright.

maxemileffort commented 3 years ago

@maxemileffort Oh, so it won't open the new tab in the user's browser? Ok, I didn't know that. In this case it should be alright.

It’s actually a really good idea.

It would basically make everyone that uses the app a sort of “proxy” and solve a lot of bot detection problems. Haha!

The only issue I can see is that it’s not something we can (ethically) do from the server. I think there’s a task on the Trello board for the chrome extension where that type of functionality would be a lot easier to implement.

aubundy commented 3 years ago

This was merged in #47 and I think everything went well with the heroku build

aubundy commented 3 years ago

@maxemileffort The new additions are working well, but a few Bloomberg links are not returning any data from the server. Here's two articles that are causing this error.

Could you try uploading these articles in a local build to see if this is a production issue?

aubundy commented 3 years ago

Update article scraper