wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

Integration with undetected-chromedriver not working #242

Closed schlabrendorff closed 3 years ago

schlabrendorff commented 3 years ago

Hi!

I am trying to get selenium-wire to work with undetected-chromedriver as per the docs.

Conda Environment:

name: test-env
channels:
  - defaults
dependencies:
  - ca-certificates=2021.1.19=hecd8cb5_1
  - certifi=2020.12.5=py39hecd8cb5_0
  - libcxx=10.0.0=1
  - libedit=3.1.20191231=h1de35cc_1
  - libffi=3.3=hb1e8313_2
  - ncurses=6.2=h0a44026_1
  - openssl=1.1.1j=h9ed2024_0
  - pip=21.0.1=py39hecd8cb5_0
  - python=3.9.2=h88f2d9e_0
  - readline=8.1=h9ed2024_0
  - setuptools=52.0.0=py39hecd8cb5_0
  - sqlite=3.33.0=hffcf06c_0
  - tk=8.6.10=hb0a8c7a_0
  - tzdata=2020f=h52ac0ba_0
  - wheel=0.36.2=pyhd3eb1b0_0
  - xz=5.2.5=h1de35cc_0
  - zlib=1.2.11=h1de35cc_3
  - pip:
    - blinker==1.4
    - cffi==1.14.5
    - cryptography==3.4.6
    - h11==0.12.0
    - h2==4.0.0
    - hpack==4.0.0
    - hyperframe==6.0.0
    - kaitaistruct==0.9
    - pyasn1==0.4.8
    - pycparser==2.20
    - pyopenssl==20.0.1
    - pyparsing==2.4.7
    - pysocks==1.7.1
    - selenium==3.141.0
    - selenium-wire==4.2.1
    - six==1.15.0
    - undetected-chromedriver==2.1.2
    - urllib3==1.26.3
    - wsproto==1.0.0
prefix: /usr/local/Caskroom/miniconda/base/envs/test-env

My Code:

from time import sleep

from seleniumwire import webdriver

chrome_options = webdriver.ChromeOptions()
sw_options = {}

driver = webdriver.Chrome(  # Optimized for bot detection
    options=chrome_options,
    seleniumwire_options=sw_options
)
driver.get('https://www.google.com/search?hl=en&q=test')

while True:
    sleep(1)

This results in Google redirecting me to a captcha page.

While using undetected-chromedriver directly (without selenium wire) e.g. with this code.

import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('https://www.google.com/search?hl=en&q=test')

works flawlessly. As well as this code:

from time import sleep

import undetected_chromedriver as uc
uc.install()

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
sw_options = {}

driver = webdriver.Chrome(  # Optimized for bot detection
    options=chrome_options,
    # seleniumwire_options=sw_options
)
driver.get('https://www.google.com/search?hl=en&q=test')

Using the last code with seleniumwire instead of selenium, i.e.:

from time import sleep

import undetected_chromedriver as uc
uc.install()

from seleniumwire import webdriver

chrome_options = webdriver.ChromeOptions()
sw_options = {}

driver = webdriver.Chrome(  # Optimized for bot detection
    options=chrome_options,
    seleniumwire_options=sw_options
)
driver.get('https://www.google.com/search?hl=en&q=test')

while True:
    sleep(1)

results in:

Traceback (most recent call last):
  File "/Users/redacted/Documents/redacted.py", line 11, in <module>
    driver = webdriver.Chrome(  # Optimized for bot detection
  File "/usr/local/Caskroom/miniconda/base/envs/test-env/lib/python3.9/site-packages/undetected_chromedriver/__init__.py", line 53, in __new__
    instance.__init__(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/test-env/lib/python3.9/site-packages/seleniumwire/webdriver.py", line 109, in __init__
    super().__init__(*args, **kwargs)
TypeError: object.__init__() takes exactly one argument (the instance to initialize)

Does anybody have a hint what I am doing wrong or is this a bug?

wkeeling commented 3 years ago

Thanks for raising this issue.

With the code in your first example (the code which gives you the captcha page), can you try adding the disable_capture option and see whether that makes a difference? e.g.

sw_options = {
    'disable_capture': True
}

driver = webdriver.Chrome(  # Optimized for bot detection
    options=chrome_options,
    seleniumwire_options=sw_options
)
schlabrendorff commented 3 years ago

This works (Google does not throw a captcha anymore), but as I need to access the response of a request made in the background for my use-case unfortunately not a solution.

wkeeling commented 3 years ago

Ok thanks. That suggests that the Google captcha is able to detect that the SSL handshake is taking place with Selenium Wire and not with the browser. When disable_capture is True Selenium Wire just passes the traffic straight through (no HTTPS decryption takes place) and as such the SSL handshake happens directly with the browser.

I'll have to investigate exactly how the SSL handshake is triggering the captcha. In the meantime in terms of workarounds, is the request that you want to access on a different domain than the Google captcha? If so, you could potentially exclude just the Google captcha request from Selenium Wire with the exclude_hosts option:

sw_options = {
    'exclude_hosts': ['google-captcha-host.com']  # Put the host of Google captcha here
}

driver = webdriver.Chrome(  # Optimized for bot detection
    options=chrome_options,
    seleniumwire_options=sw_options
)
wkeeling commented 3 years ago

Actually thinking about it, the captcha is probably not being served separately - it's probably all originating from the same domain - so the above workaround is unlikely to be useful. I'll see if I can figure out why it's triggering in the first place.

schlabrendorff commented 3 years ago

Thank you for your answer! For my use-case I actually don't need to access Google. Because seleniumwire does not print anything whether it successfully uses undetected_chromedriver or not, I used Google as a test site, with. Is there another way to check whether undetected_chromedriver is used?

wkeeling commented 3 years ago

undetected_chromedriver will print out a log message when it starts up:

INFO:undetected_chromedriver:Selenium patched. Safe to import Chrome / ChromeOptions
INFO:undetected_chromedriver:starting undetected_chromedriver.Chrome((), {'options': <selenium.webdriver.chrome.options.Options object at 0x7fa070df3048>, 'seleniumwire_options': {}, 'executable_path': './chromedriver'})

You just need to ensure that you've activated logging at the very top of your script:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('undetected_chromedriver').level = logging.INFO

from seleniumwire import webdriver

... code ...
schlabrendorff commented 3 years ago

Thank you! That is a good option!! May I propose mentioning it in the readme? i.e.

To ensure that seleniumwire uses the patched chrome driver you can activate logging before importing from seleniumwire:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('undetected_chromedriver').level = logging.INFO

from seleniumwire import webdriver

... code ...
wkeeling commented 3 years ago

Yes good idea - will add that, thanks.