wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

Webdriver does not work in headless mode on AWS Lambda #193

Closed DanielGelfand closed 3 years ago

DanielGelfand commented 3 years ago

Chromedriver and Chromium Binary Version 86.0.4240.111

Message: unknown error: net::ERR_CONNECTION_CLOSED (Session info: headless chrome=86.0.4240.111) : WebDriverException

If I ran with the Selenium webdriver and without the seleniumwire_options, the code is able to run. Any advice for getting selenium-wire to work on aws lambda?


import os

def lambda_handler(event,context):

    print("Over here!")
    options = webdriver.ChromeOptions()
    lambda_options = [
            '--autoplay-policy=user-gesture-required',
            '--disable-background-networking',
            '--disable-background-timer-throttling',
            '--disable-backgrounding-occluded-windows',
            '--disable-breakpad',
            '--disable-client-side-phishing-detection',
            '--disable-component-update',
            '--disable-default-apps',
            '--disable-dev-shm-usage',
            '--disable-domain-reliability',
            '--disable-extensions',
            '--disable-features=AudioServiceOutOfProcess',
            '--disable-hang-monitor',
            '--disable-ipc-flooding-protection',
            '--disable-notifications',
            '--disable-offer-store-unmasked-wallet-cards',
            '--disable-popup-blocking',
            '--disable-print-preview',
            '--disable-prompt-on-repost',
            '--disable-renderer-backgrounding',
            '--disable-setuid-sandbox',
            '--disable-speech-api',
            '--disable-sync',
            '--disk-cache-size=33554432',
            '--hide-scrollbars',
            '--ignore-gpu-blacklist',
            '--ignore-certificate-errors',
            '--metrics-recording-only',
            '--mute-audio',
            '--no-default-browser-check',
            '--no-first-run',
            '--no-pings',
            '--no-sandbox',
            '--no-zygote',
            '--password-store=basic',
            '--use-gl=swiftshader',
            '--use-mock-keychain',
            '--single-process',
            '--headless']

        #chrome_options.add_argument('--disable-gpu')
    for argument in lambda_options:
        options.add_argument(argument)
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
    options.add_argument(f'user-agent={user_agent}')

    options.binary_location = os.getcwd() + "/headless-chromium"

    selenium_options = {
            'request_storage_base_dir': '/tmp' 
        }
    print("before driver")
    driver = webdriver.Chrome(executable_path=os.getcwd() + "/chromedriver",chrome_options=options,
                              seleniumwire_options=selenium_options)
    print("got driver")
    driver.get("https://www.google.com")
    print(driver.page_source) ```
wkeeling commented 3 years ago

Thanks for this. If Chrome responded with a ERR_CONNECTION_CLOSED that would suggest that Selenium Wire's internal proxy server died, which should have thrown some errors to the log/console. Are you able to post the contents of the log or the console output? It may be worth first enabling logging if it's not already enabled:

import os

import logging
logging.basicConfig(level=logging.DEBUG)

def lambda_handler(event,context):
    ...
DanielGelfand commented 3 years ago

Now the above code runs on AWS Lambda but the driver cannot go anywhere. If I try to grab the page source at any web page it prints <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

wkeeling commented 3 years ago

Can you see anything in the log/console - any messages, tracebacks etc?

DanielGelfand commented 3 years ago

None, only thing I know is that if I use the regular selenium webdriver I am able to get the page source. With selenium-wire, it unfortunately fails.

DanielGelfand commented 3 years ago

Update: Webdriver is now able to go to webpages when '--proxy-bypass-list=*' argument is added. However, this prevents me from using my proxy. Any advice?

wkeeling commented 3 years ago

Yes the --proxy-bypass-list will also bypass Selenium Wire's embedded proxy so it won't be able to intercept requests.

Without seeing an error it's hard to know what's happening. It's possible that the certificate generation may be failing. Does the environment you're running in have OpenSSL installed?

DanielGelfand commented 3 years ago

Yes, the lambda function has an OpenSSL layer.

DanielGelfand commented 3 years ago

OpenSSL Version is OpenSSL 1.0.2k-fips 26 Jan 2017 Debug Mode shows that proxy was created. seleniumwire.proxy.backend - INFO - Created proxy listening on 127.0.0.1:35587 Any advice?

wkeeling commented 3 years ago

OK thanks. So it looks like the proxy is being created, but failing at the point where it's trying to capture the request and then killing the connection to Chrome. Maybe a silly question, but can you access /tmp without any issues (I notice that's where the request storage dir is pointing)?

One other thing to try if you can, is to switch the backend to mitmproxy. This uses different code to capture and proxy the request, so it may not suffer from the same problem. Even if it does, it may yield some new clues as to what's going on. Not sure how easy that would be in the Lambda environment though? You'd need to install mitmproxy with pip install mitmproxy and then set the backend option to mitmproxy in your Selenium Wire options.

DanielGelfand commented 3 years ago

Yes, I can access /tmp while the lambda function is running. I see the .seleniumwire file in the tmp directory. I tried the mitmproxy approach but I was unable to change the confdir option using 'mitm_confdir': '/tmp/.mitmproxy'. It always went to the default ~ directory. The modules get all messed up in a Lambda environment so I'd prefer to use the default selenium-wire proxy.

I'm not sure if this is of help but I see seleniumwire.proxy.handler - DEBUG - whatismyipaddress.com:443 200 in the debugger.

wkeeling commented 3 years ago

Thanks for trying the mitmproxy backend. Looks like you found a bug where the default conf directory wasn't being overridden properly. I've fixed in version 3.0.6. Also in that version is a command line tool that allows you to start a stand-alone instance of Selenium Wire - that uses the default backend - which may help debug this problem.

If you update Selenium Wire to 3.0.6 in the Lambda environment, and then from the command line run:

python -m seleniumwire standaloneproxy addr=<your_public_ip> port=12345

Obviously change <your_public_ip> to whatever the public IP of the environment is. That will start a stand-alone proxy instance. Once done, try configuring the proxy settings in Chrome running on your local machine to point at the public IP and port above (search for "proxy" in Chrome's settings to find the proxy configuration page, and then from there enter the IP and port above for both http and https). Then open a new tab and try navigating to any site. I'd be interested to see what happens and what you see on the terminal running the standalone proxy.

DanielGelfand commented 3 years ago

I am getting Traceback (most recent call last): File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/__main__.py", line 48, in <module> commands[args.command](*pargs, **kwargs) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/__main__.py", line 14, in standalone_proxy 'verify_ssl': False, File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/proxy/backend.py", line 45, in create proxy = ProxyHTTPServer(addr, port, capture_request_handler, options=options) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/proxy/server.py", line 62, in __init__ super().__init__(self.options.get('max_threads', 9999), (host, port), *args, **kwargs) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/proxy/server.py", line 18, in __init__ super().__init__(*args, **kwargs) File "/usr/local/lib/python3.7/socketserver.py", line 452, in __init__ self.server_bind() File "/usr/local/lib/python3.7/http/server.py", line 137, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/local/lib/python3.7/socketserver.py", line 466, in server_bind self.socket.bind(self.server_address) OSError: [Errno 99] Cannot assign requested address

when I run python -m seleniumwire standaloneproxy addr=<your_public_ip> port=12345 with my instances public ip as the address.

DanielGelfand commented 3 years ago

Hello,

I provided the private IPv4 address and was then able to connect to the proxy from my browser.

I would see the requests in the proxy server. However, I kept being hit with the message "NET::ERR_CERT_AUTHORITY_INVALID Subject: whatismyipaddress.com

Issuer: Selenium Wire CA

Expires on: Jan 28, 2031

Current date: Jan 30, 2021"

on any website I went to in my browser with https.

wkeeling commented 3 years ago

OK, that's promising. I'd forgotten that Selenium Wire automatically instructs the browser to ignore the certificate error when running normally, but when running standalone you'd need to do it manually.

One way to ignore the error is to start Chrome from the command line and pass the --ignore-certificate-errors option (and at the same time you can also pass the proxy config) - as shown below:

chrome.exe --ignore-certificate-errors --proxy-server=http://<your_public_ip>:12345

The above assumes your local machine is Windows, but you can use the same command if Linux - just omit the .exe.

Alternatively, start Chrome normally and import Selenium Wire's certificate. Save that link to a text file called ca.crt, then go to Chrome's settings, search for "certificates" (it's in the "security" section, under "manage certificates"). Select the "Authorities" tab, and press the "import" button to select the ca.crt file. Once imported, ensure that Chrome's proxy settings are still pointing at your Lambda environment, then open a new tab and try browsing.

DanielGelfand commented 3 years ago

Even after adding the certificate into my trusted authorities, I still got

NET::ERR_CERT_AUTHORITY_INVALID Subject: www.google.com

Issuer: Selenium Wire CA

wkeeling commented 3 years ago

OK that's strange. Maybe the best way forward now is if I sign up for a free AWS Lambda account and see if I can get Selenium Wire up and running myself. It'll likely be quicker than us going back and forth and it'd be good to get to the bottom of the issue, particularly as AWS is a popular platform. Thanks for helping to debug and for being open to trying my various ideas out. I'll get an account set up and report back once I've got any further info. Hopefully won't be too long.

DanielGelfand commented 3 years ago

Thank you Will I really appreciate it. Selenium-wire would allow me to use user-pass authentication rather than IP whitelisting.

You can find the chromium binary here https://github.com/adieuadieu/serverless-chrome/releases Let me know if you need any help.

OliverMorgans commented 3 years ago

Just to +1 to this I am having an identical issue with AWS Lambda on selenium-wire version 4.0.4, selenium version 3.141.0

Tried replacing backend with MITMproxy and adding addr equal to the lambda ip address, to no avail. Similarly the calls work with the standard selenium library and with '--proxy-bypass-list=*', but the latter obviously returns no requests.

thanks!

wkeeling commented 3 years ago

I've managed to get this working on AWS Lambda. I created a .zip locally containing the lambda function (essentially the same code as @DanielGelfand 's above), selenium wire, headless chrome, the chrome driver and all dependencies. Once created I uploaded the whole thing - but had to use Amazon S3 for that as the file is 60 MB so too large to upload via the lambda console direct.

Versions headless-chromium = 1.0.0-57 chromedriver = 86.0.4240.22 selenium-wire = 4.0.4 selenium = 3.141.0

Lambda settings Runtime = Python 3.6 Memory = 256 MB Timeout = 30 seconds

Note that I started out with a timeout setting of 10 seconds, but this didn't seem to be enough and the function would timeout before it had a chance to complete. I had to increase to 30 seconds for things to work.

I've made the lambda zip file available if you want to download and compare it to your setup.

DanielGelfand commented 3 years ago

Hey Will,

I uploaded your file to lambda and was able to navigate the web. However, I am having trouble recreating your zip and having it work. I checked my packages and they match yours. What steps did you take to create your zip? Thanks.

wkeeling commented 3 years ago

Hi Daniel,

These are the steps:

When I tried running the exact same zip file this morning it failed, and it was because the execution was nudging above the 30 second timeout I had set. Increasing the timeout to 60 seconds fixed so it may be worth increasing your timeout just to rule that out.

wkeeling commented 3 years ago

did you manage to get this working in the end @DanielGelfand ?

DanielGelfand commented 3 years ago

Yes, I did. I appreciate the help. Thanks!