Been implementing asyncio instead of mp using aiohttp, but whather I tried (e.g. sleep, user agent ,etc) always falls for bot detection.
Somehow requests library can retrieve just fine.
Any ideas?
@staticmethod
async def _get_links_from_one_parent(url: str) -> List[str]:
"""Scrape all the available housing items from one Funda search page."""
try:
async with aiohttp.ClientSession(headers=config.header) as session:
async with session.get(url) as response:
if response.status != 200:
logger.error(f"Failed to fetch {url}: HTTP {response.status}")
return []
response_text = await response.text()
# Introduce a random delay
await asyncio.sleep(random.uniform(0.5, 2))
soup = BeautifulSoup(response_text, "lxml")
script_tags = soup.find_all("script", {"type": "application/ld+json"})
if not script_tags:
logger.warning(f"No script tags found in {url}")
return []
json_data = json.loads(script_tags[0].contents[0])
urls = [item["url"] for item in json_data["itemListElement"]]
return list(set(urls))
except Exception as e:
logger.error(f"Error fetching links from {url}: {e}")
return []
The updated HTML content you've provided still shows that you're encountering a verification page, not the actual content page you're intending to scrape. The presence of phrases like "Je bent bijna op de pagina die je zoekt" ("You are almost on the page you are looking for") and the script for Google reCAPTCHA ("grecaptcha.render("fundaCaptchaInput", {...}") suggests that the server is serving an intermediary page to verify that the request is coming from a real user, not an automated script.
Been implementing asyncio instead of mp using aiohttp, but whather I tried (e.g. sleep, user agent ,etc) always falls for bot detection. Somehow requests library can retrieve just fine. Any ideas?