Closed DanielGelfand closed 3 years ago
Thanks for this. If Chrome responded with a ERR_CONNECTION_CLOSED
that would suggest that Selenium Wire's internal proxy server died, which should have thrown some errors to the log/console. Are you able to post the contents of the log or the console output? It may be worth first enabling logging if it's not already enabled:
import os
import logging
logging.basicConfig(level=logging.DEBUG)
def lambda_handler(event,context):
...
Now the above code runs on AWS Lambda but the driver cannot go anywhere.
If I try to grab the page source at any web page it prints
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>
Can you see anything in the log/console - any messages, tracebacks etc?
None, only thing I know is that if I use the regular selenium webdriver I am able to get the page source. With selenium-wire, it unfortunately fails.
Update: Webdriver is now able to go to webpages when '--proxy-bypass-list=*' argument is added. However, this prevents me from using my proxy. Any advice?
Yes the --proxy-bypass-list
will also bypass Selenium Wire's embedded proxy so it won't be able to intercept requests.
Without seeing an error it's hard to know what's happening. It's possible that the certificate generation may be failing. Does the environment you're running in have OpenSSL installed?
Yes, the lambda function has an OpenSSL layer.
OpenSSL Version is OpenSSL 1.0.2k-fips 26 Jan 2017 Debug Mode shows that proxy was created. seleniumwire.proxy.backend - INFO - Created proxy listening on 127.0.0.1:35587 Any advice?
OK thanks. So it looks like the proxy is being created, but failing at the point where it's trying to capture the request and then killing the connection to Chrome. Maybe a silly question, but can you access /tmp
without any issues (I notice that's where the request storage dir is pointing)?
One other thing to try if you can, is to switch the backend to mitmproxy. This uses different code to capture and proxy the request, so it may not suffer from the same problem. Even if it does, it may yield some new clues as to what's going on. Not sure how easy that would be in the Lambda environment though? You'd need to install mitmproxy with pip install mitmproxy
and then set the backend
option to mitmproxy
in your Selenium Wire options.
Yes, I can access /tmp while the lambda function is running. I see the .seleniumwire file in the tmp directory. I tried the mitmproxy approach but I was unable to change the confdir option using 'mitm_confdir': '/tmp/.mitmproxy'. It always went to the default ~ directory. The modules get all messed up in a Lambda environment so I'd prefer to use the default selenium-wire proxy.
I'm not sure if this is of help but I see seleniumwire.proxy.handler - DEBUG - whatismyipaddress.com:443 200 in the debugger.
Thanks for trying the mitmproxy backend. Looks like you found a bug where the default conf directory wasn't being overridden properly. I've fixed in version 3.0.6. Also in that version is a command line tool that allows you to start a stand-alone instance of Selenium Wire - that uses the default backend - which may help debug this problem.
If you update Selenium Wire to 3.0.6 in the Lambda environment, and then from the command line run:
python -m seleniumwire standaloneproxy addr=<your_public_ip> port=12345
Obviously change <your_public_ip>
to whatever the public IP of the environment is. That will start a stand-alone proxy instance. Once done, try configuring the proxy settings in Chrome running on your local machine to point at the public IP and port above (search for "proxy" in Chrome's settings to find the proxy configuration page, and then from there enter the IP and port above for both http and https). Then open a new tab and try navigating to any site. I'd be interested to see what happens and what you see on the terminal running the standalone proxy.
I am getting
Traceback (most recent call last): File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/__main__.py", line 48, in <module> commands[args.command](*pargs, **kwargs) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/__main__.py", line 14, in standalone_proxy 'verify_ssl': False, File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/proxy/backend.py", line 45, in create proxy = ProxyHTTPServer(addr, port, capture_request_handler, options=options) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/proxy/server.py", line 62, in __init__ super().__init__(self.options.get('max_threads', 9999), (host, port), *args, **kwargs) File "/home/ec2-user/Test/env/lib/python3.7/site-packages/seleniumwire/proxy/server.py", line 18, in __init__ super().__init__(*args, **kwargs) File "/usr/local/lib/python3.7/socketserver.py", line 452, in __init__ self.server_bind() File "/usr/local/lib/python3.7/http/server.py", line 137, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/local/lib/python3.7/socketserver.py", line 466, in server_bind self.socket.bind(self.server_address) OSError: [Errno 99] Cannot assign requested address
when I run python -m seleniumwire standaloneproxy addr=<your_public_ip> port=12345
with my instances public ip as the address.
Hello,
I provided the private IPv4 address and was then able to connect to the proxy from my browser.
I would see the requests in the proxy server. However, I kept being hit with the message "NET::ERR_CERT_AUTHORITY_INVALID Subject: whatismyipaddress.com
Issuer: Selenium Wire CA
Expires on: Jan 28, 2031
Current date: Jan 30, 2021"
on any website I went to in my browser with https.
OK, that's promising. I'd forgotten that Selenium Wire automatically instructs the browser to ignore the certificate error when running normally, but when running standalone you'd need to do it manually.
One way to ignore the error is to start Chrome from the command line and pass the --ignore-certificate-errors
option (and at the same time you can also pass the proxy config) - as shown below:
chrome.exe --ignore-certificate-errors --proxy-server=http://<your_public_ip>:12345
The above assumes your local machine is Windows, but you can use the same command if Linux - just omit the .exe
.
Alternatively, start Chrome normally and import Selenium Wire's certificate. Save that link to a text file called ca.crt
, then go to Chrome's settings, search for "certificates" (it's in the "security" section, under "manage certificates"). Select the "Authorities" tab, and press the "import" button to select the ca.crt
file. Once imported, ensure that Chrome's proxy settings are still pointing at your Lambda environment, then open a new tab and try browsing.
Even after adding the certificate into my trusted authorities, I still got
NET::ERR_CERT_AUTHORITY_INVALID Subject: www.google.com
Issuer: Selenium Wire CA
OK that's strange. Maybe the best way forward now is if I sign up for a free AWS Lambda account and see if I can get Selenium Wire up and running myself. It'll likely be quicker than us going back and forth and it'd be good to get to the bottom of the issue, particularly as AWS is a popular platform. Thanks for helping to debug and for being open to trying my various ideas out. I'll get an account set up and report back once I've got any further info. Hopefully won't be too long.
Thank you Will I really appreciate it. Selenium-wire would allow me to use user-pass authentication rather than IP whitelisting.
You can find the chromium binary here https://github.com/adieuadieu/serverless-chrome/releases Let me know if you need any help.
Just to +1 to this I am having an identical issue with AWS Lambda on selenium-wire version 4.0.4, selenium version 3.141.0
Tried replacing backend with MITMproxy and adding addr equal to the lambda ip address, to no avail. Similarly the calls work with the standard selenium library and with '--proxy-bypass-list=*', but the latter obviously returns no requests.
thanks!
I've managed to get this working on AWS Lambda. I created a .zip locally containing the lambda function (essentially the same code as @DanielGelfand 's above), selenium wire, headless chrome, the chrome driver and all dependencies. Once created I uploaded the whole thing - but had to use Amazon S3 for that as the file is 60 MB so too large to upload via the lambda console direct.
Versions headless-chromium = 1.0.0-57 chromedriver = 86.0.4240.22 selenium-wire = 4.0.4 selenium = 3.141.0
Lambda settings Runtime = Python 3.6 Memory = 256 MB Timeout = 30 seconds
Note that I started out with a timeout setting of 10 seconds, but this didn't seem to be enough and the function would timeout before it had a chance to complete. I had to increase to 30 seconds for things to work.
I've made the lambda zip file available if you want to download and compare it to your setup.
Hey Will,
I uploaded your file to lambda and was able to navigate the web. However, I am having trouble recreating your zip and having it work. I checked my packages and they match yours. What steps did you take to create your zip? Thanks.
Hi Daniel,
These are the steps:
lambda
lambda
folder and run:
pip install selenium-wire -t .
that will install Selenium Wire and all dependencies into the folder.
lambda_function.py
into the folderzip ../lambda_function.zip -r *
that's a Linux command, but if you're using Windows you should be able to create the zip file in Windows Explorer. It will create the .zip file one level above the lambda
folder.
When I tried running the exact same zip file this morning it failed, and it was because the execution was nudging above the 30 second timeout I had set. Increasing the timeout to 60 seconds fixed so it may be worth increasing your timeout just to rule that out.
did you manage to get this working in the end @DanielGelfand ?
Yes, I did. I appreciate the help. Thanks!
Chromedriver and Chromium Binary Version 86.0.4240.111
Message: unknown error: net::ERR_CONNECTION_CLOSED (Session info: headless chrome=86.0.4240.111) : WebDriverException
If I ran with the Selenium webdriver and without the seleniumwire_options, the code is able to run. Any advice for getting selenium-wire to work on aws lambda?