Some issues and questions arround requests & cookies

nrllh commented 3 years ago

Hi @wkeeling :-)

I think there are some issues with the requests. Here a few points & questions:

driver.requestsdoesn't really contain all the requests. You can re-generate the issue by visiting the webpage www.internet-sicherheit.de that contains resources from youtube-nocookie.com via iframe. That iframe object loads some images (e.g. https://i.ytimg.com/vi/WsMv3m7t2KU/sddefault.jpg). But it seems the request object doesn't deliver these. You can also compare the number of requests from selenium-wire with the number in the developer panel in the native browser. You'll see there are differences. (EDIT: I just discovered driver.iter_requests() this looks very good. But why are there differences?)
Is there any way to exclude requests made by web browser? While launching and creating a new tab some requests are being sent to chrome servers.
Is it possible to get the URL that sent the request. Currently, I'm doing it via driver.current_url but it's not the perfect way.
I'm trying to decrypt the Cookie values which are stored in the SQLite database in the profile directory. Calling driver.get_cookies() doesn't help me because I need all cookies that are stored while visiting foo.com. That's why I use the Cookies file to read the cookies, but the value there is encrypted. Is there any way in selenium-wire to catch the cookie value fully?

Thank you very much!

wkeeling commented 3 years ago

Hi @nrllh thanks for raising these points. I'll do my best to answer each one:

It depends at what point you access driver.requests. After you call driver.get(...) (assuming you're using the normal page loading strategy) Selenium will wait until the document.readyState is complete before continuing. But it is possible that not all requests have completed at this point - e.g. async background requests for images, markup, other content may still be ongoing. So the contents of driver.requests won't necessarily mirror what you see in the network panel - because the network panel will continue to display "live" requests, whereas driver.requests will contain a snapshot at the point it was accessed. Selenium Wire provides driver.wait_for_request() for this reason. This method will block until it sees a specific request of interest. If I use that method to wait for https://i.ytimg.com/vi/WsMv3m7t2KU/sddefault.jpg then I can access the request:
```
driver = webdriver.Chrome()
driver.get('https://www.internet-sicherheit.de')
request = driver.wait_for_request(
'https://i.ytimg.com/vi/WsMv3m7t2KU/sddefault.jpg'
)
print(request)
```
Also note that Selenium Wire will block OPTIONS requests by default - see ignore_http_methods.
Both Chrome and Firefox will send browser specific requests that are caught by Selenium Wire but don't show up in the network panel. Perhaps we could look at not capturing these by default since they're probably of little use to most people and just add noise? I can look at doing that in the next release. In the meantime there are other mechanisms to exclude requests - including driver.scopes, exclude_hosts and blocking requests.
It is not currently possible for Selenium Wire to know what URL was passed to driver.get() (I'm assuming that's what you mean?). That said, if we block browser requests in a future release as described in point 2, then the originating URL should just be driver.requests[0].url?

I think you should be able to use a response interceptor to intercept the headers that set the cookies? So something like:


def interceptor(request, response):  # A response interceptor takes two args
cookie_header = response.headers['Set-Cookie']
if cookie_header is not None:
    print(
        request.url,
        cookie_header
    )

driver.response_interceptor = interceptor driver.get(...)


That will print the cookies that are sent by by the server before the browser stores them.

nrllh commented 3 years ago

@wkeeling thank you very the information.

I visit the websites using normal strategy. I just added time.sleep(5) after calling the driver.get() method, now it seems much better. driver.wait_for_request()won't really help me because we are running a measurement on 1M websites (for a research project). My goal is really to catch all requests and responses without any blocking mechanism (so a mirror of them) - so I have to use also the Option 'ignore_http_methods': []
Such requests generates really many noises (I saw up to 7 requests). Directly excluding them won't work in our case, because maybe some website really send such requests while visiting. It'd be very nice if we really get only requests and responses made by visited website (of course in our case - I think you'll probably add another option for that)
I think driver.requests[0].url won't help to get the URL where all requests have been sent. I'll then get the first request that was sent while starting the driver, but really not the URL where all requests were sent Especially if there are redirections while visiting.
I think the method interceptor may not contain all cookies in the cookie jar. For example; if there are no new responses after setting a cookie. I think I have to decrypt the cookies directly to read all Cookies that have been stored while visiting (I'll look at this and this).

wkeeling commented 3 years ago

Thanks @nrllh

1) Glad that you're seeing better results after using time.sleep().

2) Can you not make use of driver.scopes to restrict request capture to just the website in question - for example:

driver.scopes = [
    '.*internet-sicherheit.de.*'
]
driver.get(...)

That will eliminate any requests that are not destined for internet-sicherheit.de - e.g. browser specific requests.

3) Yes I guess driver.requests[0] isn't reliable as it might return redirections as you mention. May require some further thought to see if anything is possible here.

4) Understood. Hopefully those GitHub repos will help you accomplish retrieving the stored cookies.

nrllh commented 3 years ago

No, this won't help because we really want to catch all requests, so that would limit our goal. I'm really looking forward to the next release.
I'll back to you when I find a solution, we may then integrate that also in this project.

wkeeling commented 3 years ago

OK thanks.

So just to be clear regarding 2), we're talking about the "hidden" requests the browser makes (hidden meaning that they don't appear in the network tools panel)? For example when I open a new tab:

https://www.google.com/async/newtab_ogb?hl=en-GB&async=fixed:0 https://www.google.com/async/ddljson?async=ntp:2 https://www.google.com/async/newtab_promos https://www.gstatic.com/og/... https://apis.google.com/_/s... ... etc ...

I'll need to find a complete list of these before I can exclude them in Selenium Wire. It's possible that different Chrome versions might use slightly different URLs. I'll investigate and see what's possible.

nrllh commented 3 years ago

Yes, that's true. I don't know if there are other ways to do a dynamic solution for 2). I tried to exclude all requests before calling the passed URL parameter to driver.get(), but it seems so that Chrome sends still hidden requests even after the visiting of the URL starts.

wkeeling commented 2 years ago

Filtering out hidden browser requests (e.g. Chrome and Firefox telemetry/phone home requests) is currently not being planned as part of core library functionality. It's up to client code to add rules to driver.scopes to exclude such requests if they wish to do so.

For example, to exclude requests to https://accounts.google.com/... that Chrome makes you would add the following rule:

driver.scopes = [
    r'^(?!.*accounts\.google\.com)'
]

wkeeling / selenium-wire

Some issues and questions arround requests & cookies #240