whchien / funda-scraper

FundaScaper scrapes data from Funda, the Dutch housing website. You can find listings from house-buying or rental market, and historical data. 🏡
GNU General Public License v3.0
102 stars 46 forks source link

Aiohttp fails for captcha #32

Open utkuarslan5 opened 5 months ago

utkuarslan5 commented 5 months ago

Been implementing asyncio instead of mp using aiohttp, but whather I tried (e.g. sleep, user agent ,etc) always falls for bot detection. Somehow requests library can retrieve just fine. Any ideas?

@staticmethod
    async def _get_links_from_one_parent(url: str) -> List[str]:
        """Scrape all the available housing items from one Funda search page."""
        try:
            async with aiohttp.ClientSession(headers=config.header) as session:
                async with session.get(url) as response:
                    if response.status != 200:
                        logger.error(f"Failed to fetch {url}: HTTP {response.status}")
                        return []
                    response_text = await response.text()

                    # Introduce a random delay
                    await asyncio.sleep(random.uniform(0.5, 2))

            soup = BeautifulSoup(response_text, "lxml")
            script_tags = soup.find_all("script", {"type": "application/ld+json"})
            if not script_tags:
                logger.warning(f"No script tags found in {url}")
                return []

            json_data = json.loads(script_tags[0].contents[0])
            urls = [item["url"] for item in json_data["itemListElement"]]
            return list(set(urls))

        except Exception as e:
            logger.error(f"Error fetching links from {url}: {e}")
            return []

The updated HTML content you've provided still shows that you're encountering a verification page, not the actual content page you're intending to scrape. The presence of phrases like "Je bent bijna op de pagina die je zoekt" ("You are almost on the page you are looking for") and the script for Google reCAPTCHA ("grecaptcha.render("fundaCaptchaInput", {...}") suggests that the server is serving an intermediary page to verify that the request is coming from a real user, not an automated script.

whchien commented 2 months ago

@utkuarslan5 yes indeed that was what I encountered before. I don't have a solution but feel free to create a pull request if you spot any.