wmgeolab / scope

5 stars 3 forks source link

BUG: Some URLS aren't successfully scraped by webscraper #101

Closed michaelrfoster closed 2 years ago

michaelrfoster commented 2 years ago

In the webscraper, some of the relevant websites on BRIGHTDATA's spreadsheet aren't being parsed. Single out these files and fix them individually.

UPDATE: When accessing these websites, it labeled me as an attacker and blocked the webscraper. Trying to use a browser-based scraper called selenium

UPDATE: Recreated the original web scraper using selenium and beautiful soup. New scraper is slower (about 100 sources every 10 mins) but seems to work with most urls