wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

Abort Running Requests from Response Interceptor #558

Open gravy-jones-locker opened 2 years ago

gravy-jones-locker commented 2 years ago

Hi, I don't know if this is a suggestion or an issue or an impossibility or an idiocy.

But I have a use-case where I would like to interrupt a request as soon as I have access to a 'Content-Type' response header. Basically I don't want Selenium Wire to download large .pdf files when it navigates to a link which would otherwise prompt that behaviour. The links themselves do not indicate whether or not this would be the case.

I can inspect that header using a simple response interceptor but it seems that running request.abort() on the corresponding Request object has no effect - probably because it's already completed? My other idea has been to run request.head(request.url) from within a request interceptor - but the site in question only returns CloudFlare headers.

Is this something that can already be done? And if not is it something that can be implemented within the scope of Selenium Wire? Cheers.

wkeeling commented 2 years ago

Yes by the time the response interceptor executes, the response has already come back and the file downloaded (as far as Selenium Wire at least). You can't abort the request at that point as you discovered, but you could try setting the body of the response to empty bytes b'' and the Content-Length header to 0. That might make things run a little quicker at the point Selenium Wire hands back to the browser, although I suspect you may have suffered most of the latency by that point anyway.

Running request.head() from the interceptor sounds like it may be a better approach if you can avoid CloudFlare. I guess you're using requests to make the call? CloudFlare can probably detect that. Have you ensured that all of the headers are transferred from the original GET request to the HEAD request - to ensure that the request from requests looks the same?

gravy-jones-locker commented 2 years ago

Thanks for the reply! Yes the size of the file is the issue so I think that by the time we get to the response interceptor - if it has already completed - then the latency will remain much the same.

I've just tried transferring over the headers (had previously only tried with the cookies) but with no luck. I'll keep trying various combinations of those.

Basically I was wondering if it was possible to replicate something like this: resp = requests.get(url, stream=True) .... resp.iter_content() ... if resp.headers["Content-Type"] != 'text/html' - but I guess that stream argument is not something which translates to Selenium

wkeeling commented 2 years ago

Actually the underlying engine that Selenium Wire uses (it uses mitmproxy) does support streaming of response bodies. Streaming can be controlled with the stream_large_bodies option which takes a threshold value. If a response body is larger than this value then the response will be streamed rather than buffered. You could try setting that threshold quite low - e.g.

driver = webdriver.Chrome(
    seleniumwire_options='mitm_stream_large_bodies': '20k'
)

Note that when passing mitmproxy specific options via Selenium Wire they have to prefixed with mitm_ as in the above example.

I'm not sure what effect that will have but might be worth a try.

gravy-jones-locker commented 2 years ago

! This is exactly the kind of thing I was looking for. I will give it a try and let you know how it goes. Being able to edit response bodies on the fly before they (or the underlying requests) have finished execution does seem like a cool feature.

gravy-jones-locker commented 2 years ago

hm I'm struggling a bit with these options. I tried that option with no luck. then I spotted the body_size_limit which would also suit my purposes. but after a bit of debugging I'm not sure that's working either. check out the following block:

>>> from seleniumwire import undetected_chromedriver as uc
>>> class Driver(uc.Chrome):
...     def __init__(self, *args, **kwargs):
...             super().__init__(*args, **kwargs)
...             self.response_interceptor = self.intercept
...     def intercept(self, req, resp):
...             print(resp.headers["Content-Length"])
... 
>>> driver = Driver(seleniumwire_options={"mitm_body_size_limit": "2k"})
>>> driver.get('https://mdpi-res.com/d_attachment/jpm/jpm-12-00677/article_deploy/jpm-12-00677.pdf?version=1650708175')
1652758
>>> 

Unless I've done something stupid it seems like that option is having no effect.

**EDIT: I've used an obvious .pdf link just for demonstration. normally that aspect is hidden as mentioned

FloatingMind12 commented 2 years ago

Did you find a way to get the response headers before the request finishes? Using requests.head(request.url) also doesn't work in my case because it returns something else.

Regarding the mitm_body_size_limit option , I already tried but the response_interceptor function was still called only after the body request ended