zytedata / zyte-smartproxy-headless-proxy

A complimentary proxy to help to use SPM with headless browsers
MIT License
109 stars 36 forks source link

SSLError if trying to connect to Crawlera Headless Proxy with Python's urllib or requests #6

Open actionless opened 5 years ago

actionless commented 5 years ago

1) first i run crawlera's proxy locally with docker:

sudo docker run -ti -p 3128:3128 -p 3130:3130 scrapinghub/crawlera-headless-proxy -p 3128 -a "$CRAWLERA_API_KEY" -x profile=desktop

2) next i run curl on some url using that proxy:

$ curl -x http://localhost:3128 -k https://www.google.com/

(and it works)

3) but if i run the python prompt like that:

env 'HTTPS_PROXY=http://localhost:3128' python

it won't work:

>>> import requests
>>> r = requests.get('https://www.google.com/', verify=False)
...
SSLError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

the problem reproduces only when trying to use Crawlera Headless Proxy, any other proxy server iss working just fine

9seconds commented 5 years ago

I believe it worth to address this issue to python libraries. I also can reproduce this error and the root cause is somewhere in how TLS handshake is managed by these libraries.

actionless commented 5 years ago

we did some investigation previously, the root cause is what crawlera-headless-proxy denies any HTTP 1.0 connection (it checking http version just from the connection string)

so even just hardcoding it to HTTP 1.1 in python code helps, but it just feels a bit strange why the proxy server denying all http 1.0 connections

https://github.com/python/cpython/blob/master/Lib/http/client.py#L883

9seconds commented 5 years ago

I'm a little bit confused with your findings :/ Probably this is undocumented behavior in a library we use (https://github.com/valyala/fasthttp). Thanks, I gonna track this issue further

actionless commented 5 years ago

hm, i see the issue with the same symptoms in the first message got already closed there: https://github.com/valyala/fasthttp/issues/16

ghost commented 3 years ago

Any movement on this issue? It's a major blocker for me.

actionless commented 3 years ago

@jjonte-berkeley i've described the workaround in one of the messages above, so it can't be technically "a blocker"

ghost commented 3 years ago

@actionless Okay, thanks. Modifying CPython's base code is the solution. It is probably worth linking to the latest commit's hash on client.py so the line number stays relevant.

https://github.com/python/cpython/blob/711381dfb09fbd434cc3b404656f7fd306161a64/Lib/http/client.py#L904

actionless commented 3 years ago

modifying cpython is a bit way too hardcore, you could just inherit that class and override there a _tunnel() method

(not working on that project involving using crawlera already for more than year though :) )

anandsork commented 3 years ago

I am also getting this error. Changing Python base code is not the solution for me. I could not understand how to use overridden _tunnel method as I directly use requests library. Is there some other workaround to fix this issue? Any help is appreciated.

guimap commented 3 years ago

Iḿ having the same problem, even using curl method this error happen for me