Closed AntoineAugusti closed 1 year ago
Hi!
This should be impossible. How rarely does this happen? Is there a minimal case I can use to reproduce? I looked into the repository you linked but it's quite large so it's probably not something I can play around with.
For any of the keys which do not have a TTL set, can you confirm that you see this log? https://github.com/etalab/transport-site/blob/487455d886d4055a443c1a8ce4f8c3f80a0fbd8f/apps/transport/lib/transport/cache.ex#L59
This will help me know if it's something with setting expire, or if it's something with the result of the fetch
call.
I tried to reproduce it locally with a minimal example (high number of operations, random keys, random values etc) but so far no luck.
We're adding additional logging to gather clues https://github.com/etalab/transport-site/pull/2900
@whitfin let me know if you want me to execute some code locally/try to modify our class to pinpoint the issue
elixir 1.14.1-otp-24 erlang 24.3.4.6
Mix.install([:cachex])
defmodule Test do
require Logger
def cache_name, do: :cache
def test do
1..100_000
|> Task.async_stream(
fn _ ->
value_fn = fn key ->
Logger.info("Generating cached value for key #{key}")
:timer.sleep(Enum.random(500..1_000))
42
end
:timer.sleep(Enum.random(100..5_000))
fetch("key", value_fn, :timer.seconds(5))
end,
max_concurrency: 50,
timeout: :infinity
)
|> Enum.each(&IO.inspect/1)
end
def fetch(cache_key, comp_fn, expire_value) do
{operation, result} = Cachex.fetch(cache_name(), cache_key, comp_fn)
return_value =
case operation do
:ok ->
Logger.info("Value for key #{cache_key} served from cache")
result
:commit ->
{:ok, true} = Cachex.expire(cache_name(), cache_key, expire_value)
Logger.info("Value for key #{cache_key} regenerated")
result
end
with {:ok, nil} <- Cachex.ttl(cache_name(), cache_key) do
extra = %{cache_key: cache_key, operation: operation, result: inspect(result)}
IO.inspect(extra)
raise "ttl is not set"
end
return_value
end
end
Cachex.start_link(Test.cache_name(), [])
Test.test()
When running the script (elixir cachex.exs
) a few times, I sometimes get.
10:08:29.542 [info] Value for key key served from cache
10:08:29.542 [info] Value for key key served from cache
10:08:29.542 [info] Value for key key served from cache
10:08:29.542 [info] Value for key key served from cache
10:08:29.543 [info] Value for key key served from cache
10:08:29.543 [info] Value for key key served from cache
10:08:29.543 [info] Value for key key served from cache
10:08:29.548 [info] Value for key key served from cache
%{cache_key: "key", operation: :ok, result: "42"}
%{cache_key: "key", operation: :ok, result: "42"}
%{cache_key: "key", operation: :ok, result: "42"}
10:08:29.727 [info] Value for key key served from cache
10:08:29.727 [info] Value for key key served from cache
10:08:29.748 [info] Value for key key served from cache
10:08:29.729 [error] Task #PID<0.2056.0> started from #PID<0.93.0> terminating
** (RuntimeError) ttl is not set
cachex.exs:46: Test.fetch/3
(elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
(elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
(stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
Args: [#Function<0.68188488/1 in Test.test/0>, [1842]]
10:08:29.729 [error] Task #PID<0.2074.0> started from #PID<0.93.0> terminating
** (RuntimeError) ttl is not set
cachex.exs:46: Test.fetch/3
(elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
(elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
(stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
Args: [#Function<0.68188488/1 in Test.test/0>, [1860]]
10:08:29.729 [error] Task #PID<0.2119.0> started from #PID<0.93.0> terminating
** (RuntimeError) ttl is not set
cachex.exs:46: Test.fetch/3
(elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
(elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
(stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
Args: [#Function<0.68188488/1 in Test.test/0>, [1905]]
** (EXIT from #PID<0.93.0>) an exception was raised:
** (RuntimeError) ttl is not set
cachex.exs:46: Test.fetch/3
(elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
(elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
(stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Hi @AntoineAugusti !
I was able to reproduce on the setup you provided, but I think this is actually already fixed in the master branch (I can't reproduce on this branch). Are you able to confirm that you can reproduce this on that branch?
This is the commit which I think resolves your issue: https://github.com/whitfin/cachex/commit/1ed5bcfdb5192c4bbb15b58cf4b4eb0b614625f2
If so, I can maybe find some time to get this pushed out as a 3.5.0 this week to get it resolved for you. I'm traveling over the next couple of days, so LMK!
Edit: I am planning on pushing a 3.5.0 this weekend to get this change out as well as some other unreleased features. Hopefully your problem is resolved by this!
@whitfin Hi 👋
Changed my script to use the main
branch
Mix.install([{:cachex, github: "whitfin/cachex", branch: "main"}])
I saw a few log lines with ttl is not set
but at the subsequent execution, a task gets back :commit
and therefore sets the TTL. It's not stuck forever without a TTL and everybody getting only :ok
back from fetch
.
This didn't happen before your fix because when fetch
ran very close to the moment the key expires, everybody got a :ok
and the TTL was not set.
We'll be able to try out 3.5.0 next week and see what our monitoring says afterwards!
Thanks for your work and help, very much appreciated.
Hi @AntoineAugusti!
I just wanted to follow up and let you know that I have published v3.5.0 with the change I think should resolve your issue. If you want to test it out, feel free to let me know if you still see issues and we can look further into it - or if everything looks good, we can close this!
Thanks for reporting!
@whitfin Thanks, will try to deploy it in production on https://transport.data.gouv.fr today and follow up afterwards!
@whitfin Hey, we deployed it today and we still see some errors on our Sentry and cache debug dashboard 😢 Same symptom: key is set, but the TTL is not set.
I'm wondering how a fetch
call, where the process set the value could get a :ok
and not a :commit
🤔
We also never saw this error triggered
@AntoineAugusti ah, okay, back to debugging! 😅
If we can resolve this, I can push it out in a v3.5.1 when we figure it out. I don't understand how it's happening at all; especially because this is the only time this has been reported and this pattern has been common for several years.
Is there a lot of contention on your cache key? Trying to figure out if there's some condition with locking somewhere causing this.
@whitfin I'll describe our main use case.
We have a proxy/cache for real-time data for multiple feeds. Each feed has a defined TTL. We serve the data, if we don't have it in cache, we send a request to the origin and save it using Cachex with the defined TTL.
It's possible that when the feed key expires we will perform multiple requests to the origin and have a concurrent process trying to commit the new value. It shouldn't be that frequent because TTLs are around 30s and we're < 10 requests per second per feed.
I've tried to replicate the issue with parameters approaching the situation I'm describing (or even worse) and it doesn't generate in a scenario where we update the value in the cache without setting the TTL forever after 😭
Mix.install([{:cachex, github: "whitfin/cachex", branch: "main"}])
defmodule Test do
require Logger
def cache_name, do: :cache
def test do
1..100_000_000
|> Task.async_stream(
fn _ ->
value_fn = fn key ->
Logger.info("Generating cached value for key #{key}")
:timer.sleep(Enum.random(500..1_000))
42
end
:timer.sleep(Enum.random(100..500))
fetch("key", value_fn, :timer.seconds(2))
end,
max_concurrency: 100,
timeout: :infinity
)
|> Enum.each(fn _ -> :ok end)
end
def fetch(cache_key, comp_fn, expire_value) do
{operation, result} = Cachex.fetch(cache_name(), cache_key, comp_fn)
return_value =
case operation do
:ok ->
result
:commit ->
{:ok, true} = Cachex.expire(cache_name(), cache_key, expire_value)
Logger.info("Value for key #{cache_key} regenerated")
result
end
with {:ok, nil} <- Cachex.ttl(cache_name(), cache_key) do
extra = %{cache_key: cache_key, operation: operation, result: inspect(result)}
IO.inspect(extra)
Logger.error("ttl is not set")
end
return_value
end
end
Cachex.start_link(Test.cache_name(), [])
Test.test()
Let me know if you've got ideas/things to try.
Hi! I'm @AntoineAugusti colleague and I did another bit of investigation. I am not reproducing exactly what we mentioned, but noticed something related to the way to handle expire with fetch (https://github.com/whitfin/cachex/issues/288#issuecomment-1138801417).
With the following code:
Mix.install([{:cachex, github: "whitfin/cachex", branch: "main"}])
defmodule Test do
require Logger
def cache_name, do: :cache
def func(_x) do
IO.puts("#{self() |> inspect()} I'm computing...")
:timer.sleep(5_000)
{:commit, 42}
end
end
Cachex.start_link(Test.cache_name(), [])
cache_key = "hello-world"
1..2
|> Enum.each(fn _ ->
IO.puts("\nRound...\n")
:timer.sleep(1_000)
1..3
|> Task.async_stream(
fn _ ->
IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} calling fetch...")
result =
with {:commit, val} <- Cachex.fetch(Test.cache_name(), cache_key, &Test.func/1) do
IO.puts(
"#{self() |> inspect()} - #{Time.utc_now() |> inspect()} commit - setting expire"
)
{:ok, true} = Cachex.expire(Test.cache_name(), cache_key, :timer.seconds(100))
IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} commit - expire set")
{:commit, val}
else
{:ok, val} ->
{:ok, val}
x ->
IO.inspect(x)
[]
end
{:ok, ttl} = Cachex.ttl(Test.cache_name(), cache_key)
if is_nil(ttl) do
IO.puts(
"#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have nil TTL (return is #{result |> inspect})"
)
else
IO.puts(
"#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have TTL #{ttl} (return is #{result |> inspect})"
)
end
end,
max_concurrency: 100,
timeout: :infinity
)
|> Enum.each(fn _ -> :ok end)
end)
It looks like the 2 fetch will wait for the first to complete, but the expire will be allowed to run after the 3 fetch are ran:
❯ elixir other-repro.exs
Round...
#PID<0.203.0> - ~T[10:53:21.024554] calling fetch...
#PID<0.204.0> - ~T[10:53:21.024621] calling fetch...
#PID<0.205.0> - ~T[10:53:21.024641] calling fetch...
#PID<0.206.0> I'm computing...
#PID<0.203.0> - ~T[10:53:26.037074] commit - setting expire
#PID<0.203.0> - ~T[10:53:26.039463] commit - expire set
#PID<0.204.0> - ~T[10:53:26.038401] we have nil TTL (return is {:ok, 42})
#PID<0.205.0> - ~T[10:53:26.038444] we have nil TTL (return is {:ok, 42})
#PID<0.203.0> - ~T[10:53:26.039543] we have TTL 100000 (return is {:commit, 42})
Round...
#PID<0.208.0> - ~T[10:53:27.046076] calling fetch...
#PID<0.208.0> - ~T[10:53:27.046262] we have TTL 98993 (return is {:ok, 42})
#PID<0.209.0> - ~T[10:53:27.046323] calling fetch...
#PID<0.210.0> - ~T[10:53:27.046380] calling fetch...
#PID<0.209.0> - ~T[10:53:27.046418] we have TTL 98993 (return is {:ok, 42})
#PID<0.210.0> - ~T[10:53:27.046468] we have TTL 98993 (return is {:ok, 42})
This means that at times, we can get a nil TTL. It does not explain yet, though, how we can get "long lasting nil TTL" like we observe (unless something has crashed somewhere that we don't know about yet!).
While we will continue investigating as time permits, @whitfin we are considering moving to a transaction-based pattern instead, like mentioned here:
Thanks for your insights - a transaction looks acceptable for us (since fetch wait for other fetches anyway today) in terms of throughput.
Hi @thbar!
This is great, thank you for your additional work on this. You're correct; it is definitely just a race condition between the tasks - the two other tasks are receiving their completion before the expire call is made. Transactions are definitely the way to support this, and should work for this use case (but at a fairly large cost).
A transaction does make sense, but I'm wondering again about an expiration option or some of expiration interface to fetch/4
. Maybe returning {:commit, value, expiration}
. Not sure how I feel about that, but maybe it will help here.
I want to ask if this is something that matters in practice though; if I understand it correctly it will be only a few milliseconds where a TTL is not set yet. Is the concern having to protect around a nil
TTL rather than being able to assume that there is always one set? Technically this would always be possible unless fetch/4
handles the expiration itself.
Hello, Here is a third colleague of the same team :laughing: The problem is that sometimes TTL are not set at all and we end up having cache values that are never refreshed. This happens every day on our application and this is a real problem for us, as it is difficult to detect.
Hi @fchabouis,
Okay, this makes sense (totally forgot the context of the thread).
There are a couple of options that I can think of right now:
fetch/4
and it should be guaranteed. I'm not sure if this is an option for you, as this is a single expiration value.fetch/4
. I didn't really want to do this for many reasons, but I think allowing {:commit, result, expiration}
is a viable approach which solves some of the problems I initially had with adding this.What are your thoughts on these options?
Sorry this is taking up so much of your teams time to get resolved 😅
Responding to these options :
fetch/4
would be useful to a lot of users I believe, but I understand it raise concerns as a maintainer. It would be appreciated, but we also do not want to put too much a burden on you! If it's reasonably easy to do though, we will definitely try it out & report back.Thanks for those exchanges in all cases, much appreciated. We are confident we'll find a way to fix this!
Hi @thbar + @fchabouis + @AntoineAugusti!
While I agree transactions would work, I really hate that it would be required for this to work correctly. Instead, I have gone for option #3 described above. By providing options from inside the return tuple, it maintains consistency and is pretty easy to maintain (much better than adding options to fetch/4
).
I have pushed my changes into https://github.com/whitfin/cachex/tree/issue-304, if you would like to test it out and get back to me when you have some time. The change is simply this (from the examples above):
defmodule Test do
require Logger
def cache_name, do: :cache
def func(_x) do
IO.puts("#{self() |> inspect()} I'm computing...")
:timer.sleep(5_000)
{:commit, 42, ttl: :timer.seconds(100) }
end
end
Internally, fetch/4
uses put/4
to insert values, so this third element in the returned tuple is passed directly through - so :ttl
becomes automatically supported, and helps keep the API in line.
I still am unsure about adding a :ttl
option directly to the fetch/4
API, because two places in the code calling fetch/4
on the same key with a different TTL would be very confusing. I think doing it inside the return type makes it very clear, even if it does require a little more documentation on my side.
What do you think about these changes? If you test them out and all works well and solves your issue, I am happy to publish a Cachex v3.6.0 including this!
Thanks a lot @whitfin! We'll try your proposal and get back to you shortly.
I gave #308 a first try and noticed something I wanted to report:
What surprises me is:
Here is the script output:
❯ elixir cache.exs
Round...
#PID<0.203.0> - ~T[13:45:14.565694] calling fetch...
#PID<0.204.0> - ~T[13:45:14.565772] calling fetch...
#PID<0.205.0> - ~T[13:45:14.565793] calling fetch...
#PID<0.206.0> I'm computing...
#PID<0.203.0> - ~T[13:45:19.578011] :commit with options [ttl: 100000] returned, val=42
#PID<0.204.0> - ~T[13:45:19.578083] :ok with options [ttl: 100000] returned, val=42
#PID<0.205.0> - ~T[13:45:19.578116] :ok with options [ttl: 100000] returned, val=42
#PID<0.203.0> - ~T[13:45:19.582858] we have TTL 99995 (return is {:commit, 42, [ttl: 100000]})
#PID<0.204.0> - ~T[13:45:19.582885] we have TTL 99995 (return is {:ok, 42, [ttl: 100000]})
#PID<0.205.0> - ~T[13:45:19.582901] we have TTL 99995 (return is {:ok, 42, [ttl: 100000]})
Round...
#PID<0.208.0> - ~T[13:45:20.588129] calling fetch...
#PID<0.208.0> - ~T[13:45:20.588378] :ok without options returned, val=42
#PID<0.209.0> - ~T[13:45:20.588424] calling fetch...
#PID<0.210.0> - ~T[13:45:20.588534] calling fetch...
#PID<0.208.0> - ~T[13:45:20.588584] we have TTL 98989 (return is {:ok, 42})
#PID<0.209.0> - ~T[13:45:20.588673] :ok without options returned, val=42
#PID<0.210.0> - ~T[13:45:20.588743] :ok without options returned, val=42
#PID<0.209.0> - ~T[13:45:20.588801] we have TTL 98989 (return is {:ok, 42})
#PID<0.210.0> - ~T[13:45:20.588893] we have TTL 98989 (return is {:ok, 42})
I wonder if the "ok" output should be normalised in some way (e.g. never TTL provided), to avoid putting the burden of normalising on the caller?
Anyway that was a first feedback. I've not yet taken the time to try that out in our application, because I wanted to have your feedback on this behaviour before.
@thbar you are absolutely right, this was an oversight! I've made the change to stop this from happening going forward.
Behaviour is consistent on "ok" now:
I'm going to start integrating your branch into a test branch here and see if we can get somewhere, thanks!
@whitfin We've deployed your fix in production and we are currently evaluating it. It has been running fine for the last 7 hours and we have seen 0 TTL errors in the meantime, this looks promising!
@AntoineAugusti that's great! Would you usually have seen issues in this amount of time?
I will add your Dialyzer change to the PR before merging, then wrap it into a v3.6.0 and publish it (probably tomorrow). Unless I hear more from you that the issue is not fixed :D
I don't want to shout victory too soon but this looks very promising!
A quick look at our previous exceptions:
We deployed it ~2023-02-06 09h00 UTC. Will keep it running and report in a few hours.
Thank you for the Dialyzer change, we had to fork and do this commit to make sure our CI was 🟢 before 🚢.
@whitfin I can add my own empirical testing the previous week (in addition to the great feedback by @AntoineAugusti): I had a way to get 100% reproduction, only the production app. The level of traffic & cache contention made it easy to trigger the bug via a "home page refresh" while another component of the app (a slightly crowded proxy using cachex as well), and with the new branch I never managed to trigger it. So I believe the fact that you moved the expiration inside your own transaction block removed a race condition happening only when using the advice provided to setup expiry until now, which was the culprit.
So I'm equally confident it is going to work quite well!
Thanks everyone involved :-)
@thbar okay, perfect! This makes sense to me also, I understand the race involved and it should definitely be removed, I agree. I just didn't want to break anything else with the changes 😅.
Okay, then I will move ahead with a v3.6.0, expect it within a day or so!
Hi all! I have released these changes in v3.6: https://github.com/whitfin/cachex/releases/tag/v3.6.0
If there are any issues, we can address them in a patch release - but I am happy to ship this initial minor based on the positive results so far (as I have time to do so currently).
Thank you for all your help getting this addressed and sharing your notes and findings!! 🎉
@whitfin Hi Isaac! I've just pushed a commit to use the v3.6.0 in our production branch.
On behalf of the tech team @ transport.data.gouv.fr / French Ministry of Transport, thanks a lot for your help with this issue and your work. This is much appreciated.
Hello,
We're tracking down a bug and we would appreciate any idea you have. We opened #281 in the past.
This issue happens with a wrapper we wrote around
fetch
: we want to set a TTL when committing the value to the cache.In production, we see that sometimes (very rarely) keys are written in the cache but without a TTL set and therefore are never cleaned out.
We do not currently understand how this can happen.
Our code looks like this https://github.com/etalab/transport-site/blob/487455d886d4055a443c1a8ce4f8c3f80a0fbd8f/apps/transport/lib/transport/cache.ex#L34-L60 (feel free to take a look at the whole file). It is inspired by pieces of code you've shared in the past.
Any ideas would be appreciated!