I had a go at running the scrapers but quickly ran into a few exceptions. Looks like some of the councils updated their websites or something.
Lmk your thoughts
What changes to the code have you added?
Updated Darebin scraper because it couldn't find the agenda link. See exception below.
2024-05-23 23:30:48,305 [INFO] DarebinScraper: Starting darebin scraper
Traceback (most recent call last):
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
main()
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
run_scrapers(args)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 98, in run_scrapers
scraper_results = scraper_instance.scraper()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/scrapers/vic/darebin.py", line 32, in scraper
target_a_tag = soup.find("a", href=lambda href: href and "Agenda" in href)
^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'find'
Fixed a similar issue with Glen Eira scraper caused by the agenda link being relative rather than absolute. See error below.
2024-05-23 23:35:46,711 [INFO] GlenEiraScraper: Starting glen_eira scraper
2024-05-23 23:35:47,637 [INFO] GlenEiraScraper: Found link to agenda: /about-council/meetings-and-agendas/council-agendas-and-minutes/special-council-meeting-tuesday-28-may-2024
No time found in the input string.
2024-05-23 23:35:48,100 [INFO] GlenEiraScraper: Scraped: Special meeting, Date: 28 May 2024, Time: None, PDF URL: /media/gfad13kr/co_28052024_agn_1287_at_extra.pdf
2024-05-23 23:35:48,100 [INFO] YIMBY-Scraper: Link scraped! Downloading PDF...
Traceback (most recent call last):
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
main()
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
run_scrapers(args)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 102, in run_scrapers
processor(scraper_results, scraper_instance)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 36, in processor
download_pdf(scraper_results.download_url, council_name)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/utils.py", line 19, in download_pdf
response = requests.get(link)
^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 575, in request
prep = self.prepare_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 486, in prepare_request
p.prepare(
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/models.py", line 368, in prepare
self.prepare_url(url, params)
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/models.py", line 439, in prepare_url
raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL '/media/gfad13kr/co_28052024_agn_1287_at_extra.pdf': No scheme supplied. Perhaps you meant https:///media/gfad13kr/co_28052024_agn_1287_at_extra.pdf?
Similar issue for Merribek failing to find the agenda name.
2024-05-23 23:38:57,962 [ERROR] YIMBY-Scraper: Running merri_bek scraper
Starting merri_bek scraper
a tag found
download url set
No date found in the input string.
Traceback (most recent call last):
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
main()
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
run_scrapers(args)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 98, in run_scrapers
scraper_results = scraper_instance.scraper()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/scrapers/vic/merribek.py", line 54, in scraper
el_name = (grandparent_el).text
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'
The last exception happened when the program tried to read the Merribek agenda. Looks like it's a html page, not a pdf - therefore program fails when pdf library tries to read it. I added a try/except in main.py to fix this. Would be better to show a big error to the user for cases like this but probably overkill if the exception kills the process.
Council Agenda None None https://www.merri-bek.vic.gov.au/my-council/council-and-committee-meetings/council-meetings/council-meeting-minutes/ https://www.merri-bek.vic.gov.au/my-council/council-and-committee-meetings/council-meetings/nextcouncilmeetingagenda/
2024-05-23 23:41:57,740 [INFO] YIMBY-Scraper: Link scraped! Downloading PDF...
2024-05-23 23:41:57,944 [INFO] YIMBY-Scraper: PDF downloaded!
2024-05-23 23:41:57,944 [INFO] YIMBY-Scraper: Reading PDF into memory...
Traceback (most recent call last):
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/fitz/__init__.py", line 2679, in __init__
self.this = extra.Document_init( filename, stream, filetype, rect, width, height, fontsize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/fitz/extra.py", line 153, in Document_init
return _extra.Document_init(filename, stream, filetype, rect, width, height, fontsize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: code=0: no objects found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
main()
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
run_scrapers(args)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 102, in run_scrapers
processor(scraper_results, scraper_instance)
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 40, in processor
text = read_pdf(council_name)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/utils.py", line 26, in read_pdf
doc = fitz.open(f"files/{council_name}_latest.pdf")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/fitz/__init__.py", line 2686, in __init__
raise FileDataError( MSG_BAD_DOCUMENT) from e
fitz.FileDataError: cannot open broken document
Checklist
[x] Run Black to format the code
[x] Run tests locally and added the cached results to the PR
I had a go at running the scrapers but quickly ran into a few exceptions. Looks like some of the councils updated their websites or something.
Lmk your thoughts
What changes to the code have you added?
Checklist
[x] Run Black to format the code [x] Run tests locally and added the cached results to the PR