wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

Failure to observe websocket data #140

Closed lapp0 closed 3 years ago

lapp0 commented 4 years ago

I'm trying to view websocket data using selenium. I'm testing out the recent changes that @wkeeling so kindly applied in https://github.com/wkeeling/selenium-wire/commit/af8247908e69a37d5e3f69a4a3c690e199953754

However, I'm a bit confused. The linked code doesn't appear to add websocket data to the response. The expected behavior is that either response.body or response.messages would contain a list of websocket messages.

Here is the code I attempted:

    opt = Options()
    opt.headless = False
    seleniumwire_options = {
        'websocket': True, # note: this does nothing because the "Upgrade" header determines whether websocket is used
    }
    driver = webdriver.Firefox(
        options=opt,
        seleniumwire_options=seleniumwire_options,
    )
    driver.implicitly_wait(10)  # seconds
    driver.get('https://colonist.io/')
    time.sleep(30)
    import pdb;pdb.set_trace()

And here is me observing a lack of websocket data available:

(Pdb) [x for x in driver.requests if x.url[-3:] == 'io/'][1]
Request(method='GET', url='https://colonist.io/', headers={'Host': 'colonist.io', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0', 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate, br', 'Sec-WebSocket-Version': '13', 'Origin': 'https://colonist.io', 'Sec-WebSocket-Extensions': 'permessage-deflate', 'Sec-WebSocket-Key': 'qTbihQnqpoo4U7NVcFWcsg==', 'Connection': 'keep-alive, Upgrade', 'Cookie': '__cfduid=db1f9aff8047411f454ef208e628215fd1595714153; jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiIxNTcxNDkwNiIsImlhdCI6MTU5NTcxNDE1MywiZXhwIjoxNTk4MzA2MTUzLCJhdWQiOiJodHRwczovL2NvbG9uaXN0LmlvLyIsImlzcyI6Imh0dHBzOi8vY29sb25pc3QuaW8vIn0.kyqjgSU7iIb2uJlwg3-d-24PeQlH2Wd-tZXCQ2Q_rsA; Indicative_a38719f2-d919-446b-b2e3-0da55a22a29a="%7B%22defaultUniqueID%22%3A%228933f9b7-04de-4698-ff40-797260252238%22%7D"', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Upgrade': 'websocket'}, body=b'')
(Pdb) [x for x in driver.requests if x.url[-3:] == 'io/'][1].response
Response(status_code=101, reason='Switching Protocols', headers={'Date': 'Sat, 25 Jul 2020 21:55:54 GMT', 'Connection': 'upgrade', 'Upgrade': 'websocket', 'Sec-WebSocket-Accept': 'UeZ7TvP5SmQSHHbsmgW2bkSuSQA=', 'CF-Cache-Status': 'DYNAMIC', 'cf-request-id': '042992c9d700000d725436b200000001', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '5b8920bc8a930d72-IAD'}, body=b'')
(Pdb) 

Is this code a WIP, or is there something more I'm missing

wkeeling commented 4 years ago

The recent change to support websockets allows Selenium Wire to pass through websocket data being transferred back and forth, but Selenium Wire doesn't yet capture this data. Prior to the change, websocket connections wouldn't work at all and page certain page functionality would break.

We need to look at what would be involved in capturing the websocket frames and exposing them via a suitable driver attribute.

lapp0 commented 4 years ago

I looked a bit into it. The fact that the responses are pickled presents some difficulty here. Is there any reason we need to pickle responses and save? Can't we just keep the objects in memory?

wkeeling commented 4 years ago

The idea is that all request and response data is immediately pickled to disk on capture, and Selenium Wire keeps just an index of this data in memory. It does this largely for scalability reasons as the volume of captured data could be very large, particularly if capture occurs over an extended time period.

We could look at some redesign here however if it would aid websocket data capture. Did you have any ideas on an approach?

lapp0 commented 4 years ago

Because websockets can continue to receive messages after the request is "complete", I think there are three things we might do here:

1) simplest working solution: offer an "in memory mode" with no disk writing, and document that websocket messages can only be read if this mode is enabled, otherwise they'll be an empty list

Would love to hear what you think.

wkeeling commented 4 years ago

Many thanks for the proposals. I wonder whether we try solution 1) first, as that sounds like the easiest to get off the ground.

I guess we could create a InMemoryRequestStorage class which could be swapped with the current RequestStorage depending upon whether memory mode is enabled or not?

I'm away currently but can look at it when I get back. Or if you have time to have a play yourself and make a PR feel free.

lapp0 commented 4 years ago

I guess we could create a InMemoryRequestStorage class

Funny enough, that's exactly what I named the class in the linked WIP PR.

lapp0 commented 4 years ago

I have changes in PR, however I'm struggling with a bug regarding Request._body not being set. I've tried figuring out what's going wrong, but I couldn't. Please let me know if you have any ideas:

https://github.com/wkeeling/selenium-wire/pull/143#issuecomment-664555087

wkeeling commented 4 years ago

Many thanks @lapp0 for the PR. It will be a few days before I can take a look but I'll try and figure out why request._body is falling over.

lapp0 commented 4 years ago

hi @wkeeling, please let me know if you'll have some chance to look at it

wkeeling commented 4 years ago

@lapp0 sorry for the delay on this. I started to look at it and then wondered whether we should use the wsproto library for handling the websocket communication rather than trying to use our own implementation. The wsproto library is also what mitmproxy uses (a backend supported by Selenium Wire) and thus by using wsproto we would keep things consistent.

I'm a little pushed for time currently having recently started a new job, but I haven't forgotten and I will look at this as soon as time permits.

wkeeling commented 3 years ago

The core of Selenium Wire has been reworked and the old core thrown out, largely to address issues with performance. As a consequence we get much improved websocket handling for free - and websocket capture has been much easier to implement. Many thanks for your ideas and work on this issue initially. The current implementation stores websocket messages in memory and we may look to your original suggestions to improve this over time (e.g. writing the messages to a pickle object as they arrive).

lapp0 commented 3 years ago

I'm quite busy at the moment, but if I come back to this around the holiday are you open to a MR pickling WS messages and allowing access to Response.messages based on an optional argument to instantiation of seleniumwire?

wkeeling commented 3 years ago

Certainly would be open to improving the storage of websocket messages, as currently they're just held in a list in memory. It would be better if they could be persisted and it would ensure they would scale.

In terms of message retrieval, the API for that is actually now in place - they can be retrieved using request.ws_messages where request is the originating websocket handshake request (i.e. starts wss://). The messages themselves are held in chronological order and have a from_client attribute to denote the direction they were sent.

ankurpandeyvns commented 3 years ago

@wkeeling we haven't got the functionality to send messages to the intercepted websocket connections if I'm not mistaken?

wkeeling commented 3 years ago

@ankurpandeyvns yes unfortunately it is not currently possible to send data to web socket connections, only capture the data that was sent and received. It sounds as though this is a requirement for you?

ankurpandeyvns commented 3 years ago

@ankurpandeyvns yes unfortunately it is not currently possible to send data to web socket connections, only capture the data that was sent and received. It sounds as though this is a requirement for you?

Yeah it would have been great if the functionality was there. It's not a necessary requirement but would have been great if it was there.