Decoding response body - Githubissues

wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.

MIT License

1.9k stars 249 forks source link

Decoding response body #189

Closed ksmeeks0001 closed 3 years ago

ksmeeks0001 commented 3 years ago

I need to decode the response bodys in order to parse the json. Getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

# get the ajax calls to price spider
for request in browser.requests:
    if request.response and request.url[-3:] not in ['png', 'gif', 'jpg']:
        print(request.url,
            # response body is a bytes object that needs decoded to a string
              request.response.body.decode('utf-8')
              )

What can I do to get ajax response as string?

pawanpaudel93 commented 3 years ago

@ksmeeks0001 Maybe the response data is gzip-compressed. If its gzip-compressed you have to use gzip.decompress(response.body).decode("utf-8"). Maybe use try-except to decompress and if the error is caught do a decode only without decompress.

wkeeling commented 3 years ago

@ksmeeks0001 adding to @pawanpaudel93 comment, you can ask the server to disable compression with the option disable_encoding :

options = {
    'disable_encoding': True  # Tell the server not to compress the response
}
driver = webdriver.Firefox(seleniumwire_options=options)

Also, before you attempt to decode the body, you need to ensure that it is actually a binary string. You should probably check the content type header first:

for request in browser.requests:
    if request.response and request.url[-3:] not in ['png', 'gif', 'jpg']:
        print(request.response.headers.get('Content-Type'))
        if request.response.headers.get('Content-Type', '').startswith('application/json'):
            print(request.url,
                # response body is a bytes object that needs decoded to a string
                  request.response.body.decode('utf-8')
                  )

Ylodi commented 3 years ago

This problem occurs in version 3.0.2. Version 2.1.2 works without problems. request.response.body is b'' in 3.0.2.

wkeeling commented 3 years ago

b'' is a valid byte string and b''.decode('utf-8') shouldn't cause a problem.

I suspect this issue is happening because automatic body decoding has been switched off in 3.0.2 - but you can still enable it manually with the disable_encoding option as described above. I'll look at switching body decoding back on again if it's causing issues.

wkeeling commented 3 years ago

Version 3.0.3 now released which has automatic content decoding re-instated.

Ylodi commented 3 years ago

That doesn't seem to be a problem because version 3.0.3 doesn't work either.

ksmeeks0001 commented 3 years ago

@wkeeling ,

Yes the disable_encoding option was exactly what I needed. Thank you.

wkeeling commented 3 years ago

@Ylodi are you able to share your code? I think something else is perhaps happening.

Ylodi commented 3 years ago

Python 3.8.6 (Linux) Example code:

import json
from seleniumwire import webdriver as wire

def test_json_decode():
    driver = wire.Chrome()

    driver.get('https://gurushots.com/challenge/peaceful7/rank/top-photographer')

    request = driver.wait_for_request(
        '/rest/get_top_photographer',
        30
    )

    data = json.loads(request.response.body.decode('utf-8'))

    driver.close()

test_json_decode()

wkeeling commented 3 years ago

Thanks @Ylodi The issue is due to an OPTIONS request made by Chrome just before it makes the real request. The response to the OPTIONS request has a zero byte body and Selenium Wire captures that and returns it from driver.wait_for_request(). To fix, add the ignore_http_methods option:

options = {
    'ignore_http_methods': ['OPTIONS']
}
driver = wire.Chrome(seleniumwire_options=options)

In versions before v3.0.0 Selenium Wire filtered out OPTIONS requests by default, but that was also causing some issues for people so from v3.0.0 onwards Selenium Wire captures all requests including OPTIONS. Given that OPTIONS requests are largely useless perhaps it would be better if we revert to filtering them by default and just make it clearer in the docs.

Ylodi commented 3 years ago

Thanks, it's working now when OPTIONS requests are ignored.