wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

selenium-wire very slow on Windows 10 #65

Closed nayanamana closed 3 years ago

nayanamana commented 5 years ago

I am using selenium-wire (version 1.0.8) on Windows 10, and it appears to be very slow. For example to complete the get request for https://www.cnn.com , it takes more than 2 minutes.

Do you know what could be the issue and how I can resolve it?

... from seleniumwire import webdriver # Import from seleniumwire .... profile = FirefoxProfile(profile_ff_work) profile.accept_untrusted_certs = True profile.assume_untrusted_cert_issuer = True profile.set_preference("app.normandy.startupRolloutPrefs.network.cookie.cookieBehavior", 0) firefox_driver = 'C:\install\drivers\geckodriver-v0.24.0-win64\geckodriver.exe' executable_path = firefox_driver driver =webdriver.Firefox(firefox_binary=ff_binary, executable_path=executable_path, firefox_profile=profile) driver.get('https://www.cnn.com')

for request in driver.requests: if request.response: print( request.path, request.response.status_code, request.response.headers )

wkeeling commented 5 years ago

Thanks for raising this. I'll see if I can reproduce the issue with your configuration and I'll let you know what I find.

wkeeling commented 5 years ago

I've attempted to reproduce this on Windows 10 using Firefox 69, version 0.24.0 of geckodriver and the latest version of Selenium Wire (1.0.9). On my machine, Selenium Wire was averaging 11 seconds to fully load https://www.cnn.com, whereas Selenium itself was averaging 3 seconds. The slower load time is to be expected due to the request/response capture that Selenium Wire performs (and the cnn.com homepage seems to trigger a particularly large number of requests for embedded resources, advertising etc.). However, I'm not not seeing the +2 minute response times that you are observing.

What kind of response time do you see if you run https://www.cnn.com directly through selenium? Also, is there anything in profile_ff_work that could be affecting performance? Could you run the test with a barebones profile?

nayanamana commented 5 years ago

Thanks for looking into it. I use the same versions of geckodriver and seleniumwire (Firefox + Windows 10), and with a barebone profile, it still takes 1.5-2 minutes for cnn.com to load. However, with plain selenium it only takes 3 seconds to load. I initially thought the problem is with my environment but with plain selenium on same machine, it just takes 3 seconds to load cnn.com .

For your information, below is the script I use:


!/C:\Program Files\Python37\python.exe

import os, sys import json, datetime, time

from selenium.webdriver.firefox.firefox_binary import FirefoxBinary from selenium.webdriver.firefox.options import Options from selenium.webdriver.firefox.firefox_profile import FirefoxProfile from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

from seleniumwire import webdriver

from selenium import webdriver

ff_install_path = r'C:\Program Files\Mozilla Firefox\firefox.exe' firefox_driver = 'C:\install\drivers\geckodriver-v0.24.0-win64\geckodriver.exe'

ff_binary = FirefoxBinary(ff_install_path)

executable_path = firefox_driver

driver =webdriver.Firefox(firefox_binary=ff_binary, executable_path=executable_path, \ seleniumwire_options={'verify_ssl': False})

print("START TIME: " + str(datetime.datetime.now())) driver.get('https://www.cnn.com')

print("END TIME: " + str(datetime.datetime.now()))

driver.quit()

wkeeling commented 5 years ago

Thanks for the update.

When Selenium Wire is loading the site, are you able to bring up Windows task manager and watch what's going on with the processes and CPU? It would be interesting to understand if there is a particular process that is consuming 100% CPU and causing the slow down you are seeing.

nayanamana commented 5 years ago

With Selenium wire it only consumes 65% CPU maximum. The process with the highest CPU usage is Firefox (13% on average). This is the same observation with plain Selenium.

wkeeling commented 5 years ago

Thanks. Could you also try disabling capture of GET and POST requests using the ignore_http_methods option, for example:

driver = webdriver.Firefox(
    firefox_binary=ff_binary, 
    executable_path=executable_path, 
    seleniumwire_options={'verify_ssl': False, 
                          'ignore_http_methods': ['GET', 'POST', 'OPTIONS']}
)

That may give a clue as to whether the capture process is causing the problem.

Also, if you get a chance, could you try using a site that does not use https - for example http://web.mit.edu/ (or any others) and see how they behave?

nayanamana commented 5 years ago

Using the ignore_http_methods option still take that much of time for www.cnn.com . However for the HTTP site you mentioned takes only 2 seconds with Selenium-Wire.

wkeeling commented 5 years ago

Ok thanks. It sounds like the issue may be related to the underlying SSL interception, possibly something to do with openssl (openssl is bundled with the Windows version of Selenium Wire).

I think at this point we'd need to step through using a debugger to see which line of code is causing the problem. Would you be in a position to do that?

nayanamana commented 5 years ago

If you let me know the steps I can try..

On Mon, Sep 9, 2019, 12:09 PM Will Keeling notifications@github.com wrote:

Ok thanks. It sounds like the issue may be related to the underlying SSL interception, possibly something to do with openssl (openssl is bundled with the Windows version of Selenium Wire).

I think at this point we'd need to step through using a debugger to see which line of code is causing the problem. Would you be in a position to do that?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wkeeling/selenium-wire/issues/65?email_source=notifications&email_token=ADMTO3LCZDV6EOPP4IEOTXTQIZYKVA5CNFSM4IMPYL52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6IFHHQ#issuecomment-529552286, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMTO3OXOLY2YNRL6ZT473TQIZYKVANCNFSM4IMPYL5Q .

wkeeling commented 5 years ago

Are you comfortable with using a Python editor such as PyCharm?

nayanamana commented 5 years ago

I use Visual Studio. But can install pycharm

On Wed, Sep 11, 2019, 8:38 AM Will Keeling notifications@github.com wrote:

Are you comfortable with using a Python editor such as PyCharm?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wkeeling/selenium-wire/issues/65?email_source=notifications&email_token=ADMTO3I326H2EGJU4WOX6ITQJDRC5A5CNFSM4IMPYL52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6OKWBQ#issuecomment-530361094, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMTO3OI5LIZ3GQHQRGTETTQJDRC5ANCNFSM4IMPYL5Q .

wkeeling commented 5 years ago

Ok well for PyCharm the steps would basically be something like this:

Once the test completes, try setting a break point and then running again:

Really appreciate your offer of help on this one. Getting to the bottom of the issue would really help, especially if it results in a fix. Let us know how you get on!

idxn commented 5 years ago

@wkeeling I can confirm selenium-wire is very slow on Windows 10. I need to increase the request timeout to be able to get the test passed for #69 on my windows 10 machine. I tried to debug and step over each of them but could not find anything suspicious.

nayanamana commented 5 years ago

I will try these steps.. but need some time as I am busy with another project..

On Wed, Sep 11, 2019 at 10:09 AM Will Keeling notifications@github.com wrote:

Ok well for PyCharm the steps would basically be something like this:

  • Clone the repo with git clone https://github.com/wkeeling/selenium-wire.git
  • Start PyCharm and open the project you just cloned with File > Open... > select the selenium-wire folder
  • Navigate to tests > acceptance.py in left hand tree and double-click to edit it
  • In the first test method test_firefox_can_access_requests, change the url to be https://www.cnn.com
  • Right click the test method and select Run...
  • The test should run and should reproduce the performance problem

Once the test completes, try setting a break point and then running again:

  • In PyCharm, navigate to seleniumwire/proxy/proxy2.py in the left hand tree, double-click to open
  • Try setting a break point just inside the do_GET() method on line
    1. Do this by clicking on the left hand margin next to the line number and a red dot should appear
  • Go back to acceptance.py and right-click the test method again, but this time select 'Debug...'
  • The test should run and should drop you onto the break point. From there you can use the Step Over button in the Debug panel at the bottom (or press F6) to step over each line of code. As you step over each line, you may find that one particular line causes the debugger to pause for a very long time. This may give some clues as to what's causing the problem.

Really appreciate your offer of help on this one. Getting to the bottom of the issue would really help, especially if it results in a fix. Let us know how you get on!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wkeeling/selenium-wire/issues/65?email_source=notifications&email_token=ADMTO3JIKKP6TWY7P5GPGUTQJD33HA5CNFSM4IMPYL52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6OTW2Q#issuecomment-530398058, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMTO3MOJDRRBAH2CVRPJW3QJD33HANCNFSM4IMPYL5Q .

wkeeling commented 5 years ago

@idxn thanks - seems like it may be a general problem. Are you using any upstream proxy or is there anything special about your setup?

wkeeling commented 5 years ago

Thanks @nayanamana From what @idxn has tried it seems you may not find anything obvious, but if you do notice anything let us know,

idxn commented 5 years ago

@wkeeling No, I do not have an upstream proxy but I noticed that the target url return 301. The failed test I got is from python.org which return 301

image

idxn commented 5 years ago

cnn also return 302

image

idxn commented 5 years ago

I think it maybe the cause. I have randomly checked the url in the test_client and others seem to return only 200 without redirect.

idxn commented 5 years ago

@wkeeling Do you have any more suspect? What should we do to fix the issue?

wkeeling commented 5 years ago

@idxn thanks for looking into it. The redirect however doesn't seem to make any difference to performance on my Windows 10 machine, so I'm not sure that the redirect is the underlying cause. The tests run fine regardless.

I think the issue is probably something to do with the ssl connection wrapping because it seems that non-https sites (e.g. http://web.mit.edu/) run fine?

Without a local reproducible test case I can only guess what the issue might be. I have another older Windows 10 machine so I will see if I can reproduce the issue there. If not, I'm going to be relying on somebody else who does have the issue to do a bit of debugging on this one.

idxn commented 5 years ago

@wkeeling Could you please guide me which line or method you think it might be an issue? I'll try to look into it.

wkeeling commented 5 years ago

@idxn I think I would start by stepping through the go_GET() function in proxy2.py with the debugger, looking closely at the lines that deal with outbound requests e.g.

https://github.com/wkeeling/selenium-wire/blob/4e36c91cb06222024fbe76c5b962d4e7487ee894/seleniumwire/proxy/proxy2.py#L107

and seeing whether these lines are particularly slow to respond. It may end up being an unrelated line of code that's causing the issue. At this point it's going to need a bit of exploratory debugging unfortunately.

idxn commented 5 years ago

It is quite hard to debug. It happens sometimes but sometimes not :( Well, I will just post my python environment here then for you to replicate the issue. Python 3.7.0 Other package version pip_pkg.txt Will try again and keep you posted

wkeeling commented 5 years ago

Thanks. I'll also have another go at reproducing on a different machine.

wkeeling commented 4 years ago

The latest release of Selenium Wire (1.2.1) uses connection keep-alive by default. Previously Selenium Wire was creating new connections for each HTTP request - which was inefficient and was degrading performance.

@idxn @nayanamana You may have found a workaround/alternative solution by now, but if you're in a position to test version 1.2.1 on Windows 10 let me know how it goes.

wkeeling commented 4 years ago

One other thought I have: it's possible that Windows antivirus (e.g. Windows Defender) is intercepting the execution of openssl.exe which gets run by Selenium Wire for SSL based sites. For sites that contain a lot of external assets, openssl.exe will be run multiple times and all the antivirus checks could hugely increase the load time.

Assuming you're using an antivirus program, try adding an exception for openssl.exe (you can do this in Windows Defender I think). I'd be interested to see whether that improves things. I'll also see if I can reproduce based on this this theory.

wkeeling commented 3 years ago

Selenium Wire no longer relies on openssl.exe, and the core of the library has been reworked to improve overall performance. Closing this issue.