whchien / funda-scraper

FundaScaper scrapes data from Funda, the Dutch housing website. You can find listings from house-buying or rental market, and historical data. 🏡
GNU General Public License v3.0
104 stars 48 forks source link

Extract all the links, but get a smaller number of results and not from the selected city #10

Closed ValenteMendez closed 1 year ago

ValenteMendez commented 1 year ago

Hi Chien, thank you very much for updating the code!

I downloaded the latest version and executed it. My parameters were the following:

scraper = FundaScraper(area="amsterdam", want_to="buy", find_past=False, n_pages=275)

Although the code extracts 4125 links, the CSV output consists of only 1615 rows of data. From those 1615 datapoints, around 97% of the postal codes are not from Amsterdam but rather from a different city.

Do you have any insights on what may be causing this?

Also, when running the code, I encountered an error:

requests.exceptions.SSLError: None: Max retries exceeded with url: /koop/[random listing].

I was able to fix this by adding a delay in the scrape_one_link method:

def scrape_one_link(self, link: str) -> List[str]: """Scrape all the features from one house item given a link."""

    # Initialize for each page
    response = requests.get(link, headers=config.header)
    time.sleep(3)  # Add the delay here. Adjust the delay duration as needed.
    soup = BeautifulSoup(response.text, "lxml")
whchien commented 1 year ago

Hi @ValenteMendez

Thanks for spotting the bug. I am looking into it and I will make sure to keep you posted.

whchien commented 1 year ago

Hi @ValenteMendez

I just released the package with some new updates (funda_scraper==1.0.7). The previous problems should be solved. Please let me know if the issues remain.