Closed ericlathrop closed 9 months ago
We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.
Next time rather than toggle it off, I suggest you check several things about the state of the cache:
Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.
I suspect I'm running into the unfixed scenario mentioned here: https://github.com/whitfin/cachex/issues/287#issuecomment-1180564987
Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!
Since it seems like Cachex is unreliable, we're going to have to rip it out and build our own simple caching with ETS.
There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful.
Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem.
Sorry, you're right, I apologize for the last part. I was frustrated with debugging production issues for the last couple days.
The key is just the ID taken from the page URL, just a simple string.
I don't know the exact size of the caches, but the entire app is using around 300mb total.
I think there's usually less than a hundred keys.
I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.
Are there any other things I can log to debug this?
On December 7, 2023 4:45:35 PM EST, Isaac Whitfield @.***> wrote:
We've got a basic Phoenix app. In our controllers we use Cachex.fetch for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.
Next time rather than toggle it off, I suggest you check several things about the state of the cache:
- The key you're trying to read/write
- The size of the cache (the keyspace)
- The memory consumption of the cache
- The memory consumption of your application
Rather than build a toggle to turn it on and off, I would suggest that you enable some sort of logging so we can see exactly where things are getting stuck. It would also be helpful to know the version you are using.
I suspect I'm running into the unfixed scenario mentioned here: https://github.com/whitfin/cachex/issues/287#issuecomment-1180564987
Maybe, but I was given no reproduction case in that issue and I still have no reproduction case in this issue, not entirely sure how I'm expected to magically fix these things!
Since it seems like Cachex is unreliable, we're going to have to rip it out and build our own simple caching with ETS.
There was no real need for this comment... 🙂 I understand you're probably frustrated because your app isn't working, but filing this for help in this tone isn't particularly helpful.
Cachex is proven in production use in hundreds of systems (including several government systems). It's either something specific to your implementation, or a bug which can be fixed. I would recommend against building your own implementation until you understand exactly what is happening, because you'll likely just hit the same problem.
-- Reply to this email directly or view it on GitHub: https://github.com/whitfin/cachex/issues/319#issuecomment-1846158840 You are receiving this because you authored the thread.
Message ID: @.***>
I edited the description to remove my offensive comment. Sorry again.
Sorry, you're right, I apologize for the last part. I was frustrated with debugging production issues for the last couple days.
No problem, totally get it - hopefully I can do at least something here to help!
The key is just the ID taken from the page URL, just a simple string.
I mean more to see if there's a trend in the keys triggering the issue; there shouldn't be but I'm really curious about the "after about a day" part. This implies it's something specific to set it off, so starting with the keys (or maybe the data behind them) could be a good start.
I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.
Yeah, makes perfect sense! Definitely would have done the same.
I was just toggling the cache off in order to narrow the possibilities since I didn't have a stack trace to go off of.
Yeah this is the most interesting thing here; it seems like nothing crashes it just gets stuck. The only thing that can do this is the background process handling fetch
calls, but I'm kinda at a loss as to why that would happen.
Are there any other things I can log to debug this?
I think if you log before you call fetch
, log inside your fetch
function, and log afterwards, it might give me somewhere to start. It'll at least tell me where it's getting stuck, which is likely going to be after the first two logs but before the third.
What version of Cachex are you using? And is this totally reproducible for you? If we e.g. found a fix, would you be able to tell for sure if it helped? I'm also more than happy to try your exact code flow; if you don't want to expose this on GitHub, I'm reachable on the Elixir Slack or via the email on my GitHub profile.
In the other issue you linked, I spent some time staring at the Courier
service inside Cachex, which is where this is likely to be caused - but I couldn't find anything specific. I rewrote it a bit to hopefully clean it up, but it seems there may still be a problem. I'm definitely willing to put the time in to figure this out, I'm just lacking an environment where I can reproduce the issue!
Edit: even if I can't get a reproduction case, I might try clean up the Courier further. There's definitely a few weird cases of behaviour in there I can make more robust - it should effectively be the same, but in case of some edge cases it might help clear them up.
We're using cachex 3.6.0.
I added the logging you asked for.
I built a way to clear a specific cache at runtime into my app. I'll see if clearing a cache helps anything.
I'm going to try to build a simpler test case, but it might take me a little while.
@ericlathrop got it, so my changes in v3.6 didn't help - I'll spend some more time. I'm flying a lot this week, but maybe I can get some time on those to take a long look!
No rush, whenever you get chance just ping me back here!
@ericlathrop I'm pretty sure I've located the cause here - somehow the child processes are being killed, and because the spawn
is not linked to the Courier, it's not cleaning up properly. I couldn't tell you why it's dying, probably something specific to your app (network issue to the database maybe?).
I wrote a test case for this, and it hangs. I made some changes (which will probably be in main
by the time you read this), and it no longer hangs. I'm not sure it's totally cleaned up fully yet, but you should be able to test it all out and see if you can still reproduce the issue. For reference, the PR is inside https://github.com/whitfin/cachex/pull/320. I'm also considering adding logging throughout so you can toggle on logs inside Cachex itself to help debug stuff like this in future.
Just to set expectation properly; this should stop the hanging issue. It could be that fixing this will give you back an error that still needs follow up investigation on my part (depending on why the child process is dying). This is why I'm not releasing these changes just yet, in case they need further work based on what we uncover.
Let me know whenever you get some time, happy holidays!
@whitfin thanks for investigating. I spent a day last week creating a sample project trying to reproduce. I wasn't able to reproduce, but I got some weird pauses when I got to like 40 worker processes. I'll post the repo later when I'm at my computer
I do suspect the problem was related to crashes because we get periodic ecto crashes related to running out of postgres connections. I didn't get time to test that in my example repo.
@ericlathrop awesome, no rush :) If it is due to crashes, and you have some other crashes happening, then yeah this is likely the cause.
In that case if you can see if the errors that are raised are actually useful, I'd appreciate it (it'll probably just bubble up the Ecto error, but it would be nice to verify). If we can confirm this, I'm happy to get it published ASAP!
(reopening just due to GitHub auto close, as we still need to verify)
Just following up @ericlathrop - did you need to check anything on your side, or are we happy that this is solved in main
? If the latter, I'll close this out and it'll be included in the 3.7 release.
@whitfin Hey, sorry for the delay. My sample project where I tried to reproduce the error (but wasn't able to) is here: https://gitlab.com/ericlathrop/debug_cachex. It basically spins up a bunch of genservers that try to read random stuff from the database through Cachex.fetch
with random waits between. If you're familiar with docker-compose, you can spin it up with a database and everything pretty easy.
Sorry I wasn't able to reproduce the issue, and sorry, but I wasn't able to try out the new version. I switched our app to a different caching library, and it works for our usage, and it would be problematic to switch back and deploy it to see if it hangs again. Feel free to close this issue if you want.
We've got a basic Phoenix app. In our controllers we use
Cachex.fetch
for a basic read-through scenario to load data from the database with ecto. After a day or so, pages just hang forever. I built a toggle into my app to switch between using Cachex or just hitting the database directly. The next time the app started hanging, I switched off Cachex and my pages load again.There's no errors in my logs.
I suspect I'm running into the unfixed scenario mentioned here: https://github.com/whitfin/cachex/issues/287#issuecomment-1180564987
Sorry I don't have any more specific details on how to reproduce.