Cachex.fetch hangs and breaks my app

ericlathrop commented 11 months ago

We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.

There's no errors in my logs.

I suspect I'm running into the unfixed scenario mentioned here: https://github.com/whitfin/cachex/issues/287#issuecomment-1180564987

Sorry I don't have any more specific details on how to reproduce.

whitfin commented 11 months ago

We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.

Next time rather than toggle it off, I suggest you check several things about the state of the cache:

The key you're trying to read/write
The size of the cache (the keyspace)
The memory consumption of the cache
The memory consumption of your application

Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.

I suspect I'm running into the unfixed scenario mentioned here: https://github.com/whitfin/cachex/issues/287#issuecomment-1180564987

Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!

Since it seems like Cachex is unreliable, we're going to have to rip it out and build our own simple caching with ETS.

There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful.

Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem.

ericlathrop commented 11 months ago

Sorry, you're right, I apologize for the last part. I was frustrated with debugging production issues for the last couple days.

The key is just the ID taken from the page URL, just a simple string.

I don't know the exact size of the caches, but the entire app is using around 300mb total.

I think there's usually less than a hundred keys.

I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.

Are there any other things I can log to debug this?

On December 7, 2023 4:45:35 PM EST, Isaac Whitfield @.***> wrote:

We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.

Next time rather than toggle it off, I suggest you check several things about the state of the cache:

The key you're trying to read/write

The size of the cache (the keyspace)

The memory consumption of the cache

The memory consumption of your application

Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.

I suspect I'm running into the unfixed scenario mentioned here: https://github.com/whitfin/cachex/issues/287#issuecomment-1180564987

Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!

Since it seems like Cachex is unreliable, we're going to have to rip it out and build our own simple caching with ETS.

There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful.

Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem.

-- Reply to this email directly or view it on GitHub: https://github.com/whitfin/cachex/issues/319#issuecomment-1846158840 You are receiving this because you authored the thread.

Message ID: @.***>

ericlathrop commented 11 months ago

I edited the description to remove my offensive comment. Sorry again.

whitfin commented 11 months ago

Sorry, you're right, I apologize for the last part. I was frustrated with debugging production issues for the last couple days.

No problem, totally get it - hopefully I can do at least something here to help!

The key is just the ID taken from the page URL, just a simple string.

I mean more to see if there's a trend in the keys triggering the issue; there shouldn't be but I'm really curious about the "after about a day" part. This implies it's something specific to set it off, so starting with the keys (or maybe the data behind them) could be a good start.

I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.

Yeah, makes perfect sense! Definitely would have done the same.

I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.

Yeah this is the most interesting thing here; it seems like nothing crashes it just gets stuck. The only thing that can do this is the background process handling fetch calls, but I'm kinda at a loss as to why that would happen.

Are there any other things I can log to debug this?

I think if you log before you call fetch, log inside your fetch function, and log afterwards, it might give me somewhere to start. It'll at least tell me where it's getting stuck, which is likely going to be after the first two logs but before the third.

What version of Cachex are you using? And is this totally reproducible for you? If we e.g. found a fix, would you be able to tell for sure if it helped? I'm also more than happy to try your exact code flow; if you don't want to expose this on GitHub, I'm reachable on the Elixir Slack or via the email on my GitHub profile.

In the other issue you linked, I spent some time staring at the Courier service inside Cachex, which is where this is likely to be caused - but I couldn't find anything specific. I rewrote it a bit to hopefully clean it up, but it seems there may still be a problem. I'm definitely willing to put the time in to figure this out, I'm just lacking an environment where I can reproduce the issue!

Edit: even if I can't get a reproduction case, I might try clean up the Courier further. There's definitely a few weird cases of behaviour in there I can make more robust - it should effectively be the same, but in case of some edge cases it might help clear them up.

ericlathrop commented 11 months ago

We're using cachex 3.6.0.

I added the logging you asked for.

I built a way to clear a specific cache at runtime into my app. I'll see if clearing a cache helps anything.

I'm going to try to build a simpler test case, but it might take me a little while.

whitfin commented 11 months ago

@ericlathrop got it, so my changes in v3.6 didn't help - I'll spend some more time. I'm flying a lot this week, but maybe I can get some time on those to take a long look!

No rush, whenever you get chance just ping me back here!

whitfin commented 10 months ago

@ericlathrop I'm pretty sure I've located the cause here - somehow the child processes are being killed, and because the spawn is not linked to the Courier, it's not cleaning up properly. I couldn't tell you why it's dying, probably something specific to your app (network issue to the database maybe?).

I wrote a test case for this, and it hangs. I made some changes (which will probably be in main by the time you read this), and it no longer hangs. I'm not sure it's totally cleaned up fully yet, but you should be able to test it all out and see if you can still reproduce the issue. For reference, the PR is inside https://github.com/whitfin/cachex/pull/320. I'm also considering adding logging throughout so you can toggle on logs inside Cachex itself to help debug stuff like this in future.

Just to set expectation properly; this should stop the hanging issue. It could be that fixing this will give you back an error that still needs follow up investigation on my part (depending on why the child process is dying). This is why I'm not releasing these changes just yet, in case they need further work based on what we uncover.

Let me know whenever you get some time, happy holidays!

ericlathrop commented 10 months ago

@whitfin thanks for investigating. I spent a day last week creating a sample project trying to reproduce. I wasn't able to reproduce, but I got some weird pauses when I got to like 40 worker processes. I'll post the repo later when I'm at my computer

ericlathrop commented 10 months ago

I do suspect the problem was related to crashes because we get periodic ecto crashes related to running out of postgres connections. I didn't get time to test that in my example repo.

whitfin commented 10 months ago

@ericlathrop awesome, no rush :) If it is due to crashes, and you have some other crashes happening, then yeah this is likely the cause.

In that case if you can see if the errors that are raised are actually useful, I'd appreciate it (it'll probably just bubble up the Ecto error, but it would be nice to verify). If we can confirm this, I'm happy to get it published ASAP!

whitfin commented 10 months ago

(reopening just due to GitHub auto close, as we still need to verify)

whitfin commented 9 months ago

Just following up @ericlathrop - did you need to check anything on your side, or are we happy that this is solved in main? If the latter, I'll close this out and it'll be included in the 3.7 release.

ericlathrop commented 9 months ago

@whitfin Hey, sorry for the delay. My sample project where I tried to reproduce the error (but wasn't able to) is here: https://gitlab.com/ericlathrop/debug_cachex. It basically spins up a bunch of genservers that try to read random stuff from the database through Cachex.fetch with random waits between. If you're familiar with docker-compose, you can spin it up with a database and everything pretty easy.

Sorry I wasn't able to reproduce the issue, and sorry, but I wasn't able to try out the new version. I switched our app to a different caching library, and it works for our usage, and it would be problematic to switch back and deploy it to see if it hangs again. Feel free to close this issue if you want.

whitfin / cachex

Cachex.fetch hangs and breaks my app #319