wkeeling / selenium-wire

Extends Selenium's Python bindings to give you the ability to inspect requests made by the browser.
MIT License
1.9k stars 254 forks source link

Chrome doesn't store files in Disk Cache #519

Closed alirf81 closed 2 years ago

alirf81 commented 2 years ago

Hi, thank you for your contribution I found an issue and when I use selenium-wire with Chrome, the files are not stored in disk-cache. image I need to store the static files of a web page in disk-cache like in the above image so I can re-use them when I run the script later. Could you give me some comments?

wkeeling commented 2 years ago

Thanks for raising this.

Is this a general problem or a problem with a specific site? I notice if I visit https://www.wikipedia.org/ using Selenium Wire and Chrome, the assets seem to be retrieved from the cache without any problem.

image

Do you see the same if you visit wikipedia with Selenium Wire?

I'm using ChromeDriver 99.0.4844.51 and Selenium Wire 4.6.3

alirf81 commented 2 years ago

Thank you for your reply, https://m.facebook.com/reg Would you please open this URL and if it uses disk-cache? If it uses disk-cache, could you send me your Python script which initializes Selenium driver? If they use memory-cache only, they will be removed when the script finishes and when I start it again, I need to load them again

wkeeling commented 2 years ago

https://m.facebook.com/reg seems to use a combination of both:

image

But then if I don't use Selenium Wire, I see very similar caching behaviour with Chrome:

image

So it seems that Chrome is using caching (disk and memory) whether I use Selenium Wire or not.

Do you see the same issue if you use pure Selenium (no Selenium Wire)?

alirf81 commented 2 years ago

The pure Selenium use disk cache but the Selenium Wire doesn't. As I debugged, the Selenium Wire adds the chrome option for proxy (I guess this is for capturing requests) and it makes it use memory cache only. As a result, the files are not stored in user-data-dir and when I run again, it loads files again from web url

alirf81 commented 2 years ago

Is it possible to use interceptor of Selenium Wire instead of cache?

alirf81 commented 2 years ago

If I create a mock-up response by using interceptor, the request wouldn't be passed to website server, right?

wkeeling commented 2 years ago

Yes that's correct. With regard to your other question Is it possible to use interceptor of Selenium Wire instead of cache? you could try using an interceptor to set the Cache-Control header for certain requests. That indicates to the browser that it can reuse the response from the cache for those requests. e.g.

def interceptor(request):
    if request.path = '/some/path/of/request/i/want/to/cache':
        del request.headers["Cache-Control"]
        request.headers["Cache-Control"] = "max-age=604800"

That might require a bit of experimentation.

alirf81 commented 2 years ago

https://pypi.org/project/CacheControl/ By using this library and interceptor, I managed to implement the cache. Thank you for your help.

ElliotSalisbury commented 2 years ago

Hi @alirf81

are you able to share your solution?

I have a similar problem, selenium-wire isn't using the disk cache.

ElliotSalisbury commented 2 years ago

For anyone else searching how to do this, here's how I solved it kinda. It's not a great solution, it's pretty messy, but it works and I have bigger fish to fry, so I'm moving on from this.

The reason I was looking for this was that it kept downloading the same static content from a CDN, instead of recieving it from a cache. So I check if the cdn url is in the request, and if it is, I try to retrieve it from the cache.

response_cache = {}
def request_interceptor(request):
    if "cdn" in request.url and request.url in response_cache:
        request.response = response_cache[request.url]

def response_interceptor(request, response):
    global response_cache
    if "cdn" in request.url and response.status_code == 200:
        if request.url not in response_cache:
            response_cache[request.url] = response
        else:
            response.body = response_cache[request.url].body

You may also need to put the CDN url into the driver's scope

    driver = webdriver.Chrome(executable_path=executable_path, chrome_options=chrome_options, seleniumwire_options=options, desired_capabilities=capabilities)
    driver.scopes = [
        ".*cdn.com.*"
    ]
    driver.request_interceptor = request_interceptor
    driver.response_interceptor = response_interceptor