Closed MarcellusKovacek closed 1 year ago
I think it might also be the date_list column
Hi, did you try to set the n_pages to 999 to see if they are still excluded? I notice there are 200+ pages for houses for sale in Amsterdam. Could it be the two examples you shared were listed in the pager later than 99?
Hi Chien, no it's related to the following line in config.yaml:
listed_since: ".fd-flex~ .fd-align-items-center:nth-child(6)"
Sometimes the listed_since will be on nth-child(8). In that case the house will be filtered out due to following code in preprocess:
if not is_past:
# Only check current data
df["date_list"] = df.listed_since.apply(clean_list_date)
df = df[df["date_list"] != "na"]
df["date_list"] = pd.to_datetime(df["date_list"])
The reason it sometimes goes to 8nth is because for example the 'Service charges' field will occupy the 6nth spot, which means it gets filtered out due to above code.
Some examples of results that get filtered out: (8nth)
https://www.funda.nl/en/koop/almere/appartement-42134911-bandastraat-42/ https://www.funda.nl/en/koop/schagen/appartement-88597019-nesserpark-47/ https://www.funda.nl/en/koop/haarlem/appartement-42131121-pieter-van-musschenbroekstraat-39/ https://www.funda.nl/en/koop/castricum/appartement-42131579-heereweg-36/ https://www.funda.nl/en/koop/wormerveer/appartement-42137172-wandelweg-58-c/ https://www.funda.nl/en/koop/amersfoort/appartement-42122087-liendertseweg-81-k/ https://www.funda.nl/en/koop/leusden/appartement-42122480-lepelaar-6-a/ https://www.funda.nl/en/koop/leusden/appartement-42122498-lepelaar-6-b/ https://www.funda.nl/en/koop/leusden/appartement-42122483-lepelaar-6-d/ https://www.funda.nl/en/koop/wormerveer/appartement-42126983-celebesstraat-4/ https://www.funda.nl/en/koop/wormerveer/appartement-42126906-zuideinde-15/ https://www.funda.nl/en/koop/haarlem/appartement-42128629-pieter-van-musschenbroekstraat-127/ https://www.funda.nl/en/koop/hillegom/appartement-42128844-dijkoever-25/ https://www.funda.nl/en/koop/almere/appartement-42115844-marsstraat-30/ https://www.funda.nl/en/koop/amersfoort/appartement-42117115-leusderweg-158-c/ https://www.funda.nl/en/koop/amersfoort/appartement-42117950-leusderweg-158-a/ https://www.funda.nl/en/koop/amersfoort/appartement-42101135-baak-van-herkingen-98/ https://www.funda.nl/en/koop/naaldwijk/appartement-42103945-populier-8/ https://www.funda.nl/en/koop/volendam/appartement-42105063-harlingenlaan-54/ https://www.funda.nl/en/koop/nieuw-vennep/appartement-42197714-venneperweg-497/ https://www.funda.nl/en/koop/zaandam/appartement-42189799-provincialeweg-186-f/ https://www.funda.nl/en/koop/s-gravendeel/appartement-42189971-nieuweweg-12-f/ https://www.funda.nl/en/koop/huizen/appartement-42042300-duiker-127/ https://www.funda.nl/en/koop/zaandam/appartement-42164679-teakhout-3/ https://www.funda.nl/en/koop/julianadorp/huis-88528790-prinses-arianehof-2/ https://www.funda.nl/en/koop/leerdam/appartement-42031117-keramieklaan-31/ https://www.funda.nl/en/koop/almere/appartement-42018801-kerkinilaan-26/ https://www.funda.nl/en/koop/almere/appartement-42018030-kerkinilaan-16/ https://www.funda.nl/en/koop/ridderkerk/appartement-88463287-nassaustraat-263/ https://www.funda.nl/en/koop/gouda/appartement-88393023-sint-mariewal-29/ https://www.funda.nl/en/koop/veenendaal/appartement-42805002-coornhertpad-11/ https://www.funda.nl/en/koop/leersum/appartement-88088034-gebouw-c-2e-verdieping-bouwnr-24/ https://www.funda.nl/en/koop/sommelsdijk/huis-42778676-westelijke-achterweg-22/
Some results that do get included (for comparison) (6nth):
https://www.funda.nl/en/koop/bodegraven/huis-42134799-zevenster-13/ https://www.funda.nl/en/koop/lelystad/appartement-42145278-waagstraat-17/ https://www.funda.nl/en/koop/oegstgeest/appartement-42145649-floralaan-145/ https://www.funda.nl/en/koop/zeewolde/appartement-42145600-kaapsduinhof-27/ https://www.funda.nl/en/koop/hoofddorp/appartement-42132890-concourslaan-22-d/ https://www.funda.nl/en/koop/almere/appartement-88509069-ambonstraat-36/ https://www.funda.nl/en/koop/ijmuiden/appartement-42133515-frans-naereboutstraat-12/ https://www.funda.nl/en/koop/almere/huis-42134980-makassarweg-30/ https://www.funda.nl/en/koop/heerhugowaard/appartement-42132079-de-groene-trede-18/ https://www.funda.nl/en/koop/heerhugowaard/appartement-42134180-industriestraat-11-bwnr-38/ https://www.funda.nl/en/koop/den-haag/appartement-42145207-petroleumhaven-app-403/ https://www.funda.nl/en/koop/den-haag/appartement-42145200-petroleumhaven-app-504/ https://www.funda.nl/en/koop/katwijk-zh/appartement-88509786-zwenkgras-17/ https://www.funda.nl/en/koop/rijnsburg/appartement-42130300-jan-van-goyenplein-37/ https://www.funda.nl/en/koop/rijswijk-zh/appartement-42146474-koopmansstraat-1-f-508/ https://www.funda.nl/en/koop/rutten/appartement-88598430-venelaan-1-bnr-11/ https://www.funda.nl/en/koop/rutten/appartement-88598433-venelaan-1-bnr-3/
Then you've also got cases like this, possible here its on the 10th child: https://www.funda.nl/en/koop/amersfoort/appartement-42787038-piet-mondriaanplein-197/
Hi @MarcellusKovacek
I just released the package with some new updates. The problems you mentioned should be solved. Please let me know if these issues persist.
I noticed that some listings matching my given criteria are not included when raw_data=False. If I set it to True I do get the listings in the output.
Arguments: want_to="buy", find_past=False, n_pages=99
Some example links that are not in the output but should be (note as of opening this issue the houses are still available for purchase): https://www.funda.nl/en/koop/amsterdam/huis-42138444-maria-austriastraat-853/ https://www.funda.nl/en/koop/amsterdam/huis-88599610-johan-hofmanstraat-273-pp/
I notice that one column is shifted, probably a hint. I highlighted it below.
Python 3.11.3. Run from a .py script.