whchien / funda-scraper

FundaScaper scrapes data from Funda, the Dutch housing website. You can find listings from house-buying or rental market, and historical data. 🏡
GNU General Public License v3.0
104 stars 48 forks source link

CSS scraping no longer works with beta website #40

Closed mpgreg closed 3 months ago

mpgreg commented 3 months ago

Funda has released new beta pages and the CSS needs to be updated. In the mean time the old URLs are available but need to be parsed back to the original with something like

from urllib.parse import urlparse, urlunparse

def fix_link(self, link:str) -> str:
        link_url = urlparse(link)
        link_path = link_url.path.split("/")
        property_id = link_path.pop(5)
        property_address =  link_path.pop(4).split("-")
        link_path = link_path[2:4]
        property_address.insert(1, property_id)
        link_path.extend(["-".join(property_address), "?old_ldp=true"])

        return urlunparse((link_url.scheme, link_url.netloc, "/".join(link_path),'','',''))

urls = [self.fix_link(url) for url in urls]
whchien commented 3 months ago

Hi @mpgreg thanks for the PR! I just merged it and published a new release (v1.2.0).

mpgreg commented 3 months ago

Thanks @whchien. Obviously this is just a workaround and the correct fix is to update the scraper logic. Hopefully Funda will keep the pre-beta pages available for a while.