whchien / funda-scraper

FundaScaper scrapes data from Funda, the Dutch housing website. You can find listings from house-buying or rental market, and historical data. 🏡
GNU General Public License v3.0
104 stars 48 forks source link

Valid funda listings not included in output #5

Closed MarcellusKovacek closed 1 year ago

MarcellusKovacek commented 1 year ago

I noticed that some listings matching my given criteria are not included when raw_data=False. If I set it to True I do get the listings in the output.

Arguments: want_to="buy", find_past=False, n_pages=99

Some example links that are not in the output but should be (note as of opening this issue the houses are still available for purchase): https://www.funda.nl/en/koop/amsterdam/huis-42138444-maria-austriastraat-853/ https://www.funda.nl/en/koop/amsterdam/huis-88599610-johan-hofmanstraat-273-pp/

I notice that one column is shifted, probably a hint. I highlighted it below. image

Python 3.11.3. Run from a .py script.

MarcellusKovacek commented 1 year ago

I think it might also be the date_list column

whchien commented 1 year ago

Hi, did you try to set the n_pages to 999 to see if they are still excluded? I notice there are 200+ pages for houses for sale in Amsterdam. Could it be the two examples you shared were listed in the pager later than 99?

MarcellusKovacek commented 1 year ago

Hi Chien, no it's related to the following line in config.yaml: listed_since: ".fd-flex~ .fd-align-items-center:nth-child(6)"

Sometimes the listed_since will be on nth-child(8). In that case the house will be filtered out due to following code in preprocess:

    if not is_past:
        # Only check current data
        df["date_list"] = df.listed_since.apply(clean_list_date)
        df = df[df["date_list"] != "na"]
        df["date_list"] = pd.to_datetime(df["date_list"])

The reason it sometimes goes to 8nth is because for example the 'Service charges' field will occupy the 6nth spot, which means it gets filtered out due to above code.

Some examples of results that get filtered out: (8nth)

https://www.funda.nl/en/koop/almere/appartement-42134911-bandastraat-42/ https://www.funda.nl/en/koop/schagen/appartement-88597019-nesserpark-47/ https://www.funda.nl/en/koop/haarlem/appartement-42131121-pieter-van-musschenbroekstraat-39/ https://www.funda.nl/en/koop/castricum/appartement-42131579-heereweg-36/ https://www.funda.nl/en/koop/wormerveer/appartement-42137172-wandelweg-58-c/ https://www.funda.nl/en/koop/amersfoort/appartement-42122087-liendertseweg-81-k/ https://www.funda.nl/en/koop/leusden/appartement-42122480-lepelaar-6-a/ https://www.funda.nl/en/koop/leusden/appartement-42122498-lepelaar-6-b/ https://www.funda.nl/en/koop/leusden/appartement-42122483-lepelaar-6-d/ https://www.funda.nl/en/koop/wormerveer/appartement-42126983-celebesstraat-4/ https://www.funda.nl/en/koop/wormerveer/appartement-42126906-zuideinde-15/ https://www.funda.nl/en/koop/haarlem/appartement-42128629-pieter-van-musschenbroekstraat-127/ https://www.funda.nl/en/koop/hillegom/appartement-42128844-dijkoever-25/ https://www.funda.nl/en/koop/almere/appartement-42115844-marsstraat-30/ https://www.funda.nl/en/koop/amersfoort/appartement-42117115-leusderweg-158-c/ https://www.funda.nl/en/koop/amersfoort/appartement-42117950-leusderweg-158-a/ https://www.funda.nl/en/koop/amersfoort/appartement-42101135-baak-van-herkingen-98/ https://www.funda.nl/en/koop/naaldwijk/appartement-42103945-populier-8/ https://www.funda.nl/en/koop/volendam/appartement-42105063-harlingenlaan-54/ https://www.funda.nl/en/koop/nieuw-vennep/appartement-42197714-venneperweg-497/ https://www.funda.nl/en/koop/zaandam/appartement-42189799-provincialeweg-186-f/ https://www.funda.nl/en/koop/s-gravendeel/appartement-42189971-nieuweweg-12-f/ https://www.funda.nl/en/koop/huizen/appartement-42042300-duiker-127/ https://www.funda.nl/en/koop/zaandam/appartement-42164679-teakhout-3/ https://www.funda.nl/en/koop/julianadorp/huis-88528790-prinses-arianehof-2/ https://www.funda.nl/en/koop/leerdam/appartement-42031117-keramieklaan-31/ https://www.funda.nl/en/koop/almere/appartement-42018801-kerkinilaan-26/ https://www.funda.nl/en/koop/almere/appartement-42018030-kerkinilaan-16/ https://www.funda.nl/en/koop/ridderkerk/appartement-88463287-nassaustraat-263/ https://www.funda.nl/en/koop/gouda/appartement-88393023-sint-mariewal-29/ https://www.funda.nl/en/koop/veenendaal/appartement-42805002-coornhertpad-11/ https://www.funda.nl/en/koop/leersum/appartement-88088034-gebouw-c-2e-verdieping-bouwnr-24/ https://www.funda.nl/en/koop/sommelsdijk/huis-42778676-westelijke-achterweg-22/

Some results that do get included (for comparison) (6nth):

https://www.funda.nl/en/koop/bodegraven/huis-42134799-zevenster-13/ https://www.funda.nl/en/koop/lelystad/appartement-42145278-waagstraat-17/ https://www.funda.nl/en/koop/oegstgeest/appartement-42145649-floralaan-145/ https://www.funda.nl/en/koop/zeewolde/appartement-42145600-kaapsduinhof-27/ https://www.funda.nl/en/koop/hoofddorp/appartement-42132890-concourslaan-22-d/ https://www.funda.nl/en/koop/almere/appartement-88509069-ambonstraat-36/ https://www.funda.nl/en/koop/ijmuiden/appartement-42133515-frans-naereboutstraat-12/ https://www.funda.nl/en/koop/almere/huis-42134980-makassarweg-30/ https://www.funda.nl/en/koop/heerhugowaard/appartement-42132079-de-groene-trede-18/ https://www.funda.nl/en/koop/heerhugowaard/appartement-42134180-industriestraat-11-bwnr-38/ https://www.funda.nl/en/koop/den-haag/appartement-42145207-petroleumhaven-app-403/ https://www.funda.nl/en/koop/den-haag/appartement-42145200-petroleumhaven-app-504/ https://www.funda.nl/en/koop/katwijk-zh/appartement-88509786-zwenkgras-17/ https://www.funda.nl/en/koop/rijnsburg/appartement-42130300-jan-van-goyenplein-37/ https://www.funda.nl/en/koop/rijswijk-zh/appartement-42146474-koopmansstraat-1-f-508/ https://www.funda.nl/en/koop/rutten/appartement-88598430-venelaan-1-bnr-11/ https://www.funda.nl/en/koop/rutten/appartement-88598433-venelaan-1-bnr-3/

Then you've also got cases like this, possible here its on the 10th child: https://www.funda.nl/en/koop/amersfoort/appartement-42787038-piet-mondriaanplein-197/

whchien commented 1 year ago

Hi @MarcellusKovacek

I just released the package with some new updates. The problems you mentioned should be solved. Please let me know if these issues persist.