wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

ValueError: BadGzipFile #555

Open jb11 opened 2 years ago

jb11 commented 2 years ago

I am using Selenium Wire to capture requests from a streaming website. Basically I am just grabbing the m3u8 manifests to FFmpeg the video and subtitles. It is not an issue that prevents my program from working but I am trying to figure out why it is occurring and how to either fix it or handle it to get rid of the exceptions. I am simply loading seleniumwire, the GeckoDriverManager (I have tried in Chromium as well), and the Service, and the configuring my driver. I have disable_encoding on. I do a driver.get(URL) and it loads the browser and starts grabbing the info.

The issue is that I am getting this gzip.BadGzipFile exception every time it loads a URL. It is in another thread so I cannot catch it in a try block. I have tried threading.excepthook and the function triggers but only after the exception. From what I can tell, the webdriver is trying to decode something it thinks is a Gzip but then fails because it is not. Like I said, this does not break my program. I just want to see if there is a way to fix the exception. Below is a sample of the console output when it occurs.

Exception in thread Http2SingleStreamLayer-17:
Traceback (most recent call last):
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\net\http\encoding.py", line 62, in decode
    decoded = custom_decode[encoding](encoded)
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\net\http\encoding.py", line 151, in decode_gzip
    return gfile.read()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\gzip.py", line 301, in read
    return self._buffer.read(size)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\_compression.py", line 118, in readall
    while data := self.read(sys.maxsize):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\gzip.py", line 488, in read
    if not self._read_gzip_header():
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\gzip.py", line 436, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'{"')
[3:24 PM]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http2.py", line 719, in run
    layer()
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http.py", line 206, in __call__
    if not self._process_flow(flow):
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http.py", line 388, in _process_flow
    get_response()
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http.py", line 373, in get_response
    self.send_request_headers(f.request)
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http2.py", line 389, in wrapper
    result = func(self, *args, **kwargs)
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http2.py", line 609, in send_request_headers
    raise e
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\server\protocol\http2.py", line 603, in send_request_headers
    end_stream=(False if request.content or request.trailers or request.stream else True),
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\net\http\message.py", line 134, in get_content
    content = encoding.decode(self.raw_content, ce)
  File "D:\Development\PyCharm Community Edition 2022.1.2\Projects\venv\lib\site-packages\seleniumwire\thirdparty\mitmproxy\net\http\encoding.py", line 71, in decode
    raise ValueError(
ValueError: BadGzipFile when decoding b'{"sdk_ve with 'gzip': BadGzipFile('Not a gzipped file (b\'{"\')')
wkeeling commented 2 years ago

Yes it looks as though the server is sending a resource that it has labelled as gzip, when the resource is not gzipped. Out of interest, does the exception still happen if you set mitm_http2 to False - e.g.

driver = webdriver.Firefox(
    seleniumwire_options={'mitm_http2': False}
)
jb11 commented 2 years ago

Yep, that fixed the issue and is what I was looking for. I didn't see that as an option. So it resolves the issue for this program, but in disabling that, will there be any issues that I might expect elsewhere for some other reason?

wkeeling commented 2 years ago

The option isn't documented, but Selenium Wire allows mitmproxy options to be passed using the mitm_ prefix to the option name (mitmproxy is the engine that Selenium Wire uses behind the scenes). This particular option disables HTTP/2 and forces the browser to use HTTP/1.1. The server sending the broken gzip file is likely still happening, but the HTTP/1.1 code appears not to attempt to decode the body hence no exception is seen. You probably won't see any difference using HTTP/1.1 unless the site you're visiting specifically requires the browser to be using HTTP/2 for some specific feature.

Lets keep this issue open as I think it would probably make sense to try and quieten the exception. It could probably be logged as a warning rather than an error/traceback.

jb11 commented 2 years ago

The one benefit to the error was that it slowed my page load so even throwing an exception, I could find the manifest in the requests. After fixing it, it now works so quickly that the requests don't have time to fully load before my code calls on the URL. Had to add a sleep to give the page time to load, but even with that, it still seems to work much more quickly.

And yes, I think a way to quiet the exception so it doesn't necessarily output or lag the process. I was going to say possibly just a gzip decode option but realized there might be legitimate gzip files that would get ignored.