webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
384 stars 58 forks source link

Confusing documentation around request filter #110

Closed baali closed 4 years ago

baali commented 4 years ago

Example case of filtering responses that don't have 200 status fails with:

Traceback (most recent call last):
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 509, in readinto
    self._close_conn()
  File "/usr/lib/python3.7/http/client.py", line 411, in _close_conn
    fp.close()
  File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 65, in close
    self.recorder.done()
  File "/home/baali/projects/pyenv/lib/python3.7/site-packages/warcio/capture_http.py", line 185, in done
    request, response = self.filter_func(request, response, self)
  File "create_warc.py", line 43, in filter_records
    if response.http_headers.get_statuscode() != '200':
AttributeError: 'RequestRecorder' object has no attribute 'http_headers'

And the object RequestRecorder doesn't have http_headers

wumpus commented 4 years ago

So the bug in the example is that the function signature is actually

filter(request, response, self)

in the code.

A potential second bug is that this final self argument is called warc_writer in test_capture_http.py, which makes me wonder if @ikreymer intended it to actually be the warc_writer object. (This 3rd argument is not tested.)

I'm happy to fix the docs, but I suspect Ilya should weigh in if the warc_writer is really intended to be the 3rd argument. It seems to be that having that would unlock some good functionality. I'm happy to write a test for it. Ilya?

baali commented 4 years ago

Hello @wumpus,

Thanks for your input. It helped me figure out what was wrong and how to fix it. Reordering function arguments got things working for me. I have submitted a small PR to update README. Hope it helps.