Open maxemileffort opened 3 years ago
@maxemileffort I read your code but didn't try it out. I don't know much about the previous issues with some urls, that you fixed adding the transform_url(url) funct, but if you give me one or two examples of problematic urls I'd be interested to test them to see how you solved the problem, just for my learning. About the auto_browser.py I agree that headless is more easily detected as a bot (or at least that it fails more often) but I wonder if when using a webapp it would be desirable for the browser to automatically open a anew tab to scrape the article...
@hiki270bis Here's a couple of text files from @aTmb405 that I used to check everything:
unsuccessful_articles.txt bloomberg_unsuccessful_articles.txt
Sometimes the error I got was an invalid url error, so I just made sure they all started with http and www and that seemed to fix the issue.
Do you mean it makes sense for the browser to open a tab on the user's computer? It opens on the server side right now. I could see that as being a helper, but it would be a function of the chrome extension I think and not really of the server.
@maxemileffort Oh, so it won't open the new tab in the user's browser? Ok, I didn't know that. In this case it should be alright.
@maxemileffort Oh, so it won't open the new tab in the user's browser? Ok, I didn't know that. In this case it should be alright.
It’s actually a really good idea.
It would basically make everyone that uses the app a sort of “proxy” and solve a lot of bot detection problems. Haha!
The only issue I can see is that it’s not something we can (ethically) do from the server. I think there’s a task on the Trello board for the chrome extension where that type of functionality would be a lot easier to implement.
This was merged in #47 and I think everything went well with the heroku build
Original thread - Link
Fixed issues with the initial scrape failing, and then if it still fails, it passes the url to a webdriver, which attempts to mimic normal user behavior to get past bot detection.
Repo: Link
Works about 90% of the time now. The slower it goes, the better it works. Which brings me to the TODO list:
Also, this is my first contribution on a public project ever, so any pointers/feedback are welcome!