wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

Is the get method asynchronous? #119

Open eanon opened 4 years ago

eanon commented 4 years ago

Hi, When I try to catch the response immediately after a GET, I have often ("often" since it's always today and I don't remember it was the case yesterday, so maybe context matters) no response. I tried against different websites and it seems to be constant (today); whatever the speed/reactivity of the website I mean. Here is a test code:

import os, sys
from time import sleep
from seleniumwire import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

URL = "http://google.com/"
ff = FirefoxBinary("C:/Program Files/Mozilla Firefox/firefox.exe")
gecko = os.path.join(os.path.dirname(os.path.realpath(sys.argv[0])), "geckodriver.exe")
opts = webdriver.FirefoxOptions()
opts.add_argument("--headless")
driver = webdriver.Firefox(firefox_binary=ff, executable_path=gecko, options=opts)
driver.get(URL)

# TEST #1: Returns no response
resp = driver.last_request.response

# TEST #2: Returns response
# sleep(1)
# resp = driver.last_request.response

# TEST #3: Returns response
# req = driver.wait_for_request(URL)
# resp = req.response

print("HTTP %d" % resp.status_code if resp else "No response")
driver.quit()

So, is the get method supposed to be asynchronous, then the third way is the right one? Or is it a dysfunction?

-- Tested w/ Windows 7, Python 3.7, Selenium 3.141.0, Selenium Wire 1.1.2.

wkeeling commented 4 years ago

My understanding is that webdriver.get() is synchronous, as it tells the browser to load the URL and display the page before moving on to a subsequent action - such as finding an element in the page. Selenium Wire doesn't modify that behaviour directly at least.

Selenium Wire captures and persists the responses before they return back to the browser, so by the time get() returns, they should all be present. That said, I guess there might be some strange timing issue where the last response hasn't been persisted before you ask for it, although that would need a bit of further debugging.

I wonder also whether --headless might have be having an effect. Out of curiosity, does the issue still happen if you omit the --headless setting (assuming you can)?

eanon commented 4 years ago

Thanks for your quick reaction. Still the same without --headless option (yes, I can in Windows ; final code will seat in CentOS 7 w/o desktop). Also, in stride, I tried with minimal code against Selenium without Selenium Wire and the issue doesn't raise.

import os, sys
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

URL = "http://google.com/"
ff = FirefoxBinary("C:/Program Files/Mozilla Firefox/firefox.exe")
gecko = os.path.join(os.path.dirname(os.path.realpath(sys.argv[0])), "geckodriver.exe")
opts = webdriver.FirefoxOptions()
opts.add_argument("--headless")
driver = webdriver.Firefox(firefox_binary=ff, executable_path=gecko, options=opts)
driver.get(URL)
print("%s" % driver.title if driver.title else "No response")
driver.quit()

-- EDIT: If I report things like below, we see that even when HTTP code is missing from Selenium Wire, Selenium well returned title (and content, of course). So, it sounds as you said: weird timing between Selenium response arrival and time for Selenium Wire to store intercepted info.

print("HTTP code from Selenium Wire %d" % resp.status_code if resp else "No stored response")
print("Title from Selenium: %s" % driver.title if driver.title else "No title")

Output:

No stored response
Title from Selenium: Google
wkeeling commented 4 years ago

Thanks. I'll see if I can reproduce and find out what's going on.

wkeeling commented 4 years ago

I guess it's possible that driver.last_request.response doesn't necessarily correspond to the response containing the HTML markup. The last request might correspond to an image, css or some other asset which might be retrieved asynchronously (e.g.via an ajax request). So potentially you could receive the document markup (including title) before the response of the last request has arrived. Anyway I'll dig a little deeper and see if I can discover what's happening. Thanks.

eanon commented 4 years ago

OK, understood your explanation. Thanks! So, just to "clarify", I also did the test towards a CGI of mine, just returning my public IP (it was what I assigned to the URL const in my initial test code). So, neither resources like images, CSS or JS files around nor subsequent Ajax calls.

wkeeling commented 4 years ago

It turns out that this was easy to reproduce - and just as you said. It seems that webdriver.get() does not completely block until the page has finished loading. So that means it's possible that the last few requests captured by Selenium Wire might be None when asked for, unless you add an explicit wait.

I turned logging to DEBUG and ran the following:

        url = 'https://www.wikipedia.org/'
        driver = webdriver.Firefox()
        driver.get(url)

        for request in driver.requests:
            print(
                request.path,
                request.response
            )

I then saw the regular "capturing" logging messages, but interestingly these were still happening when the print output came in (I've annotated for clarity):

INFO:seleniumwire.proxy.handler:Capturing request: https://www.wikipedia.org/static/apple-touch/wikipedia.png
INFO:seleniumwire.proxy.handler:Capturing request: https://www.wikipedia.org/static/favicon/wikipedia.ico
DEBUG:urllib3.connectionpool:http://127.0.0.1:32815 "POST /session/b10c6b88-9fb5-4a66-893c-ea97b372c6a5/url HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:seleniumwire.proxy.handler:http://seleniumwire/requests 200
-- Print output starts --
https://www.wikipedia.org/ 200 OK
https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png 200 OK
https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-81a290a5.svg 200 OK
https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-0875d644b5.js 200 OK
https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikinews-logo_sister.png 200 OK
https://www.wikipedia.org/static/apple-touch/wikipedia.png None
https://www.wikipedia.org/static/favicon/wikipedia.ico None
-- Print output ends --
INFO:seleniumwire.proxy.handler:Capturing response: https://www.wikipedia.org/static/apple-touch/wikipedia.png 200 OK
INFO:seleniumwire.proxy.handler:Capturing response: https://www.wikipedia.org/static/favicon/wikipedia.ico 200 OK

Note the logging statements at the top indicate where the requests for wikipedia.png and wikipedia.ico have been captured, then the print statements that show None for the response, and then lastly the logging statements where the response has been captured. So that seems to confirm that the print statement was executing before the responses had fully arrived.

There's some discussion about whether webdriver.get() fully blocks, and it looks like it may be implementation dependent to some extent from what I've read, although I haven't been able to find anything definitive.

I guess the safest approach is to use Selenium Wire's wait_for_request() or other mechanism to ensure that a response triggered by webdriver.get() has fully arrived, although that wouldn't help for knowing when all responses had fully arrived. I wonder whether the API might benefit from a requests_completed() method that would block until all responses from a previous webdriver.get() (or other action) had come back?

wkeeling commented 4 years ago

Actually now I think about it some more, a requests_completed() method wouldn't work, because it wouldn't know which requests had completed. Unless it just considered those requests that were outstanding at the point in time that the method was called.

eanon commented 4 years ago

Thanks for sharing the path of your thinking, wkeeling. Just a idea coming in my mind : doesn't the underlying driver emit an event on page load completion (I mean, the page itself and all its assets)?

wkeeling commented 4 years ago

I'm not sure that the driver emits an event itself. There is the Javascript document.readyState but this also looks like it may be unreliable. A quick check with:

driver.execute_script('return document.readyState') 

confirmed that the "complete" state was reached before the final responses had arrived.

If you know how many requests a driver.get() will trigger, then one "workaround" might be to count the responses before permitting further execution... just an idea:


del driver.requests  # Ensure everything cleared out
driver.get(....)  # We expect 7 requests

while len([request for request in driver.requests 
    if request.path.startswith('https://www.wikipedia.org') 
    if request.response is not None]) != 7:

    time.sleep(0.5)
eanon commented 4 years ago

From what I've read about document.readyState in the meantime, you're right: it doesn't ensure any Javascript jobs and eventual Ajax calls have been completed.

About the number of underlying requests, it could be manageable for your own webpages (or even simple CGI like the one I talked about; which just returns the public IP in plain text or JSON format), but a little bit complex to maintain ; I mean any change would imply that you fine tune your code... Oops!

Searching a little, I found a lot of discussions elaborating various ways. For example, this page: https://devqa.io/webdriver-wait-page-load-example-java/ or this SO thread: https://stackoverflow.com/questions/50327132/do-we-have-any-generic-function-to-check-if-page-has-completely-loaded-in-seleni... But all of this sounds very convoluted and unsure for me.

Among the easier solutions in the "magma", I've read guys talking about a wait for jQuery to load, but it assumes the website you hit use jQuery... Not universal enough!

However, at the end of the above thread, a possible expressed solution appeared much simpler than the others. It's the pguardiario's post on Apr 3 '19 at 0:12:

Something like this should work (please excuse the python in a java answer):

idle = driver.execute_async_script("""
  window.requestIdleCallback(() => {
    arguments[0](true)
  })
""")

This should block until the event loop is idle which means all assets should be loaded.

What do you think about this? Or even about other things that could inspire you (I could have missed a valuable point) in the links above?

wkeeling commented 4 years ago

That last solution seems to work quite nicely as far as I can tell from doing a few tests, and perhaps that might be the best workaround to the problem.

We could even potentially use that to underpin a new driver.wait_for_all_requests() API method if that would be useful. It looks like the Javascript is supported in most browsers except Safari.

Thanks for taking the time to investigate!

eanon commented 4 years ago

Nice news :) So, as you said, it could be implemented in driver.wait_for_all_requests() and maybe with a differenciation doing there will be waiting for completion of all (associated) requests when it's about a main document (ie. explicit or implicit HTML; implicit being when path targets a directory, relying on default file from server conf) and no extra waiting when it's about a simple direct resource (ie. img, js, css, etc.).

Also, as an aside and just to be back on something I said in this post: What I said is not really correct! Even if my own CGI is returning a simple plain text document, the gecko driver triggers a lot of side requests by itself (eg. tracking-protection.cdn.mozilla.net, detectportal.firefox.com, etc.). I realized this looking at the requests Selenium Wire stored. So, even with very simple call where a single return is expected (and even if it's possible to limit these "parasite" requests playing with the gecko's preferences), it's necessay to handle the asynchronous nature of the requests set...

EDIT: Being currently in a code where there are some drive.get() in a loop, I see that we never know if a stored request belongs to current iteration or the previous one (and it was well the subject of this thread)... So, afterwards, I think you're right : I think too it would be better to implement the waiting for IDLE inconditionnaly in driver.wait_for_all_requests() (I mean, without this differenciation I expressed above). It will be clearer and will prevent any confusion in case of serials of driver.get() (avoiding that dev has to remember to del driver.requests when he wants to be sure current stored requests belong to current driver.get()).