yimbymelbourne / council-meeting-agenda-scraper

A method of getting and scraping council agendas to streamline housing abundance advocacy.
16 stars 18 forks source link

Fix exceptions from Darebin, Glen Eira, Merribek council scrapers #113

Closed SebastBake closed 5 months ago

SebastBake commented 5 months ago

I had a go at running the scrapers but quickly ran into a few exceptions. Looks like some of the councils updated their websites or something.

Lmk your thoughts

What changes to the code have you added?

  1. Updated Darebin scraper because it couldn't find the agenda link. See exception below.
2024-05-23 23:30:48,305 [INFO] DarebinScraper: Starting darebin scraper
Traceback (most recent call last):
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
    main()
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
    run_scrapers(args)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 98, in run_scrapers
    scraper_results = scraper_instance.scraper()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/scrapers/vic/darebin.py", line 32, in scraper
    target_a_tag = soup.find("a", href=lambda href: href and "Agenda" in href)
                   ^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'find'
  1. Fixed a similar issue with Glen Eira scraper caused by the agenda link being relative rather than absolute. See error below.
2024-05-23 23:35:46,711 [INFO] GlenEiraScraper: Starting glen_eira scraper
2024-05-23 23:35:47,637 [INFO] GlenEiraScraper: Found link to agenda: /about-council/meetings-and-agendas/council-agendas-and-minutes/special-council-meeting-tuesday-28-may-2024
No time found in the input string.
2024-05-23 23:35:48,100 [INFO] GlenEiraScraper: Scraped: Special meeting, Date: 28 May 2024, Time: None, PDF URL: /media/gfad13kr/co_28052024_agn_1287_at_extra.pdf
2024-05-23 23:35:48,100 [INFO] YIMBY-Scraper: Link scraped! Downloading PDF...
Traceback (most recent call last):
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
    main()
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
    run_scrapers(args)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 102, in run_scrapers
    processor(scraper_results, scraper_instance)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 36, in processor
    download_pdf(scraper_results.download_url, council_name)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/utils.py", line 19, in download_pdf
    response = requests.get(link)
               ^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 575, in request
    prep = self.prepare_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 486, in prepare_request
    p.prepare(
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/requests/models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL '/media/gfad13kr/co_28052024_agn_1287_at_extra.pdf': No scheme supplied. Perhaps you meant https:///media/gfad13kr/co_28052024_agn_1287_at_extra.pdf?
  1. Similar issue for Merribek failing to find the agenda name.
2024-05-23 23:38:57,962 [ERROR] YIMBY-Scraper: Running merri_bek scraper
Starting merri_bek scraper
a tag found
download url set
No date found in the input string.
Traceback (most recent call last):
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
    main()
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
    run_scrapers(args)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 98, in run_scrapers
    scraper_results = scraper_instance.scraper()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/scrapers/vic/merribek.py", line 54, in scraper
    el_name = (grandparent_el).text
              ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'
  1. The last exception happened when the program tried to read the Merribek agenda. Looks like it's a html page, not a pdf - therefore program fails when pdf library tries to read it. I added a try/except in main.py to fix this. Would be better to show a big error to the user for cases like this but probably overkill if the exception kills the process.
Council Agenda None None https://www.merri-bek.vic.gov.au/my-council/council-and-committee-meetings/council-meetings/council-meeting-minutes/ https://www.merri-bek.vic.gov.au/my-council/council-and-committee-meetings/council-meetings/nextcouncilmeetingagenda/
2024-05-23 23:41:57,740 [INFO] YIMBY-Scraper: Link scraped! Downloading PDF...
2024-05-23 23:41:57,944 [INFO] YIMBY-Scraper: PDF downloaded!
2024-05-23 23:41:57,944 [INFO] YIMBY-Scraper: Reading PDF into memory...
Traceback (most recent call last):
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/fitz/__init__.py", line 2679, in __init__
    self.this = extra.Document_init( filename, stream, filetype, rect, width, height, fontsize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/fitz/extra.py", line 153, in Document_init
    return _extra.Document_init(filename, stream, filetype, rect, width, height, fontsize)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: code=0: no objects found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 124, in <module>
    main()
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 117, in main
    run_scrapers(args)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 102, in run_scrapers
    processor(scraper_results, scraper_instance)
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/./aus_council_scrapers/main.py", line 40, in processor
    text = read_pdf(council_name)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/dev/yimby-melb/council-meeting-agenda-scraper/aus_council_scrapers/utils.py", line 26, in read_pdf
    doc = fitz.open(f"files/{council_name}_latest.pdf")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sebastianbaker/Library/Caches/pypoetry/virtualenvs/aus-council-scrapers-ZorRvoJm-py3.12/lib/python3.12/site-packages/fitz/__init__.py", line 2686, in __init__
    raise FileDataError( MSG_BAD_DOCUMENT) from e
fitz.FileDataError: cannot open broken document

Checklist

[x] Run Black to format the code [x] Run tests locally and added the cached results to the PR