whitfin / cachex

A powerful caching library for Elixir with support for transactions, fallbacks and expirations
https://hexdocs.pm/cachex/
MIT License
1.6k stars 104 forks source link

fetch and expire not setting TTL #304

Closed AntoineAugusti closed 1 year ago

AntoineAugusti commented 1 year ago

Hello,

We're tracking down a bug and we would appreciate any idea you have. We opened #281 in the past.

This issue happens with a wrapper we wrote around fetch: we want to set a TTL when committing the value to the cache.

In production, we see that sometimes (very rarely) keys are written in the cache but without a TTL set and therefore are never cleaned out.

We do not currently understand how this can happen.

Our code looks like this https://github.com/etalab/transport-site/blob/487455d886d4055a443c1a8ce4f8c3f80a0fbd8f/apps/transport/lib/transport/cache.ex#L34-L60 (feel free to take a look at the whole file). It is inspired by pieces of code you've shared in the past.

Any ideas would be appreciated!

whitfin commented 1 year ago

Hi!

This should be impossible. How rarely does this happen? Is there a minimal case I can use to reproduce? I looked into the repository you linked but it's quite large so it's probably not something I can play around with.

For any of the keys which do not have a TTL set, can you confirm that you see this log? https://github.com/etalab/transport-site/blob/487455d886d4055a443c1a8ce4f8c3f80a0fbd8f/apps/transport/lib/transport/cache.ex#L59

This will help me know if it's something with setting expire, or if it's something with the result of the fetch call.

AntoineAugusti commented 1 year ago

I tried to reproduce it locally with a minimal example (high number of operations, random keys, random values etc) but so far no luck.

We're adding additional logging to gather clues https://github.com/etalab/transport-site/pull/2900

AntoineAugusti commented 1 year ago

@whitfin let me know if you want me to execute some code locally/try to modify our class to pinpoint the issue

AntoineAugusti commented 1 year ago

elixir 1.14.1-otp-24 erlang 24.3.4.6

Mix.install([:cachex])

defmodule Test do
  require Logger

  def cache_name, do: :cache

  def test do
    1..100_000
    |> Task.async_stream(
      fn _ ->
        value_fn = fn key ->
          Logger.info("Generating cached value for key #{key}")

          :timer.sleep(Enum.random(500..1_000))
          42
        end

        :timer.sleep(Enum.random(100..5_000))
        fetch("key", value_fn, :timer.seconds(5))
      end,
      max_concurrency: 50,
      timeout: :infinity
    )
    |> Enum.each(&IO.inspect/1)
  end

  def fetch(cache_key, comp_fn, expire_value) do
    {operation, result} = Cachex.fetch(cache_name(), cache_key, comp_fn)

    return_value =
      case operation do
        :ok ->
          Logger.info("Value for key #{cache_key} served from cache")
          result

        :commit ->
          {:ok, true} = Cachex.expire(cache_name(), cache_key, expire_value)
          Logger.info("Value for key #{cache_key} regenerated")
          result
      end

    with {:ok, nil} <- Cachex.ttl(cache_name(), cache_key) do
      extra = %{cache_key: cache_key, operation: operation, result: inspect(result)}
      IO.inspect(extra)
      raise "ttl is not set"
    end

    return_value
  end
end

Cachex.start_link(Test.cache_name(), [])
Test.test()

When running the script (elixir cachex.exs) a few times, I sometimes get.

10:08:29.542 [info] Value for key key served from cache

10:08:29.542 [info] Value for key key served from cache

10:08:29.542 [info] Value for key key served from cache

10:08:29.542 [info] Value for key key served from cache

10:08:29.543 [info] Value for key key served from cache

10:08:29.543 [info] Value for key key served from cache

10:08:29.543 [info] Value for key key served from cache

10:08:29.548 [info] Value for key key served from cache
%{cache_key: "key", operation: :ok, result: "42"}
%{cache_key: "key", operation: :ok, result: "42"}
%{cache_key: "key", operation: :ok, result: "42"}

10:08:29.727 [info] Value for key key served from cache

10:08:29.727 [info] Value for key key served from cache

10:08:29.748 [info] Value for key key served from cache

10:08:29.729 [error] Task #PID<0.2056.0> started from #PID<0.93.0> terminating
** (RuntimeError) ttl is not set
    cachex.exs:46: Test.fetch/3
    (elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
    (elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
    (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
    Args: [#Function<0.68188488/1 in Test.test/0>, [1842]]

10:08:29.729 [error] Task #PID<0.2074.0> started from #PID<0.93.0> terminating
** (RuntimeError) ttl is not set
    cachex.exs:46: Test.fetch/3
    (elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
    (elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
    (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
    Args: [#Function<0.68188488/1 in Test.test/0>, [1860]]

10:08:29.729 [error] Task #PID<0.2119.0> started from #PID<0.93.0> terminating
** (RuntimeError) ttl is not set
    cachex.exs:46: Test.fetch/3
    (elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
    (elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
    (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
    Args: [#Function<0.68188488/1 in Test.test/0>, [1905]]
** (EXIT from #PID<0.93.0>) an exception was raised:
    ** (RuntimeError) ttl is not set
        cachex.exs:46: Test.fetch/3
        (elixir 1.14.1) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
        (elixir 1.14.1) lib/task/supervised.ex:34: Task.Supervised.reply/4
        (stdlib 3.17.2.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
whitfin commented 1 year ago

Hi @AntoineAugusti !

I was able to reproduce on the setup you provided, but I think this is actually already fixed in the master branch (I can't reproduce on this branch). Are you able to confirm that you can reproduce this on that branch?

This is the commit which I think resolves your issue: https://github.com/whitfin/cachex/commit/1ed5bcfdb5192c4bbb15b58cf4b4eb0b614625f2

If so, I can maybe find some time to get this pushed out as a 3.5.0 this week to get it resolved for you. I'm traveling over the next couple of days, so LMK!

Edit: I am planning on pushing a 3.5.0 this weekend to get this change out as well as some other unreleased features. Hopefully your problem is resolved by this!

AntoineAugusti commented 1 year ago

@whitfin Hi 👋

Changed my script to use the main branch

Mix.install([{:cachex, github: "whitfin/cachex", branch: "main"}])

I saw a few log lines with ttl is not set but at the subsequent execution, a task gets back :commit and therefore sets the TTL. It's not stuck forever without a TTL and everybody getting only :ok back from fetch.

This didn't happen before your fix because when fetch ran very close to the moment the key expires, everybody got a :ok and the TTL was not set.

We'll be able to try out 3.5.0 next week and see what our monitoring says afterwards!

Thanks for your work and help, very much appreciated.

whitfin commented 1 year ago

Hi @AntoineAugusti!

I just wanted to follow up and let you know that I have published v3.5.0 with the change I think should resolve your issue. If you want to test it out, feel free to let me know if you still see issues and we can look further into it - or if everything looks good, we can close this!

Thanks for reporting!

AntoineAugusti commented 1 year ago

@whitfin Thanks, will try to deploy it in production on https://transport.data.gouv.fr today and follow up afterwards!

AntoineAugusti commented 1 year ago

@whitfin Hey, we deployed it today and we still see some errors on our Sentry and cache debug dashboard 😢 Same symptom: key is set, but the TTL is not set.

I'm wondering how a fetch call, where the process set the value could get a :okand not a :commit 🤔

We also never saw this error triggered

https://github.com/etalab/transport-site/blob/541ccdfa7ced0541220cd054f5f79bbc48e036a9/apps/transport/lib/transport/cache.ex#L57-L62

whitfin commented 1 year ago

@AntoineAugusti ah, okay, back to debugging! 😅

If we can resolve this, I can push it out in a v3.5.1 when we figure it out. I don't understand how it's happening at all; especially because this is the only time this has been reported and this pattern has been common for several years.

Is there a lot of contention on your cache key? Trying to figure out if there's some condition with locking somewhere causing this.

AntoineAugusti commented 1 year ago

@whitfin I'll describe our main use case.

We have a proxy/cache for real-time data for multiple feeds. Each feed has a defined TTL. We serve the data, if we don't have it in cache, we send a request to the origin and save it using Cachex with the defined TTL.

It's possible that when the feed key expires we will perform multiple requests to the origin and have a concurrent process trying to commit the new value. It shouldn't be that frequent because TTLs are around 30s and we're < 10 requests per second per feed.

I've tried to replicate the issue with parameters approaching the situation I'm describing (or even worse) and it doesn't generate in a scenario where we update the value in the cache without setting the TTL forever after 😭

Mix.install([{:cachex, github: "whitfin/cachex", branch: "main"}])

defmodule Test do
  require Logger

  def cache_name, do: :cache

  def test do
    1..100_000_000
    |> Task.async_stream(
      fn _ ->
        value_fn = fn key ->
          Logger.info("Generating cached value for key #{key}")

          :timer.sleep(Enum.random(500..1_000))
          42
        end

        :timer.sleep(Enum.random(100..500))
        fetch("key", value_fn, :timer.seconds(2))
      end,
      max_concurrency: 100,
      timeout: :infinity
    )
    |> Enum.each(fn _ -> :ok end)
  end

  def fetch(cache_key, comp_fn, expire_value) do
    {operation, result} = Cachex.fetch(cache_name(), cache_key, comp_fn)

    return_value =
      case operation do
        :ok ->
          result

        :commit ->
          {:ok, true} = Cachex.expire(cache_name(), cache_key, expire_value)
          Logger.info("Value for key #{cache_key} regenerated")
          result
      end

    with {:ok, nil} <- Cachex.ttl(cache_name(), cache_key) do
      extra = %{cache_key: cache_key, operation: operation, result: inspect(result)}
      IO.inspect(extra)
      Logger.error("ttl is not set")
    end

    return_value
  end
end

Cachex.start_link(Test.cache_name(), [])
Test.test()

Let me know if you've got ideas/things to try.

thbar commented 1 year ago

Hi! I'm @AntoineAugusti colleague and I did another bit of investigation. I am not reproducing exactly what we mentioned, but noticed something related to the way to handle expire with fetch (https://github.com/whitfin/cachex/issues/288#issuecomment-1138801417).

With the following code:

Mix.install([{:cachex, github: "whitfin/cachex", branch: "main"}])

defmodule Test do
  require Logger

  def cache_name, do: :cache

  def func(_x) do
    IO.puts("#{self() |> inspect()} I'm computing...")
    :timer.sleep(5_000)
    {:commit, 42}
  end
end

Cachex.start_link(Test.cache_name(), [])

cache_key = "hello-world"

1..2
|> Enum.each(fn _ ->
  IO.puts("\nRound...\n")
  :timer.sleep(1_000)

  1..3
  |> Task.async_stream(
    fn _ ->
      IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} calling fetch...")

      result =
        with {:commit, val} <- Cachex.fetch(Test.cache_name(), cache_key, &Test.func/1) do
          IO.puts(
            "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} commit - setting expire"
          )

          {:ok, true} = Cachex.expire(Test.cache_name(), cache_key, :timer.seconds(100))
          IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} commit - expire set")
          {:commit, val}
        else
          {:ok, val} ->
            {:ok, val}

          x ->
            IO.inspect(x)
            []
        end

      {:ok, ttl} = Cachex.ttl(Test.cache_name(), cache_key)

      if is_nil(ttl) do
        IO.puts(
          "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have nil TTL (return is #{result |> inspect})"
        )
      else
        IO.puts(
          "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have TTL #{ttl} (return is #{result |> inspect})"
        )
      end
    end,
    max_concurrency: 100,
    timeout: :infinity
  )
  |> Enum.each(fn _ -> :ok end)
end)

It looks like the 2 fetch will wait for the first to complete, but the expire will be allowed to run after the 3 fetch are ran:

❯ elixir other-repro.exs

Round...

#PID<0.203.0> - ~T[10:53:21.024554] calling fetch...
#PID<0.204.0> - ~T[10:53:21.024621] calling fetch...
#PID<0.205.0> - ~T[10:53:21.024641] calling fetch...
#PID<0.206.0> I'm computing...
#PID<0.203.0> - ~T[10:53:26.037074] commit - setting expire
#PID<0.203.0> - ~T[10:53:26.039463] commit - expire set
#PID<0.204.0> - ~T[10:53:26.038401] we have nil TTL (return is {:ok, 42})
#PID<0.205.0> - ~T[10:53:26.038444] we have nil TTL (return is {:ok, 42})
#PID<0.203.0> - ~T[10:53:26.039543] we have TTL 100000 (return is {:commit, 42})

Round...

#PID<0.208.0> - ~T[10:53:27.046076] calling fetch...
#PID<0.208.0> - ~T[10:53:27.046262] we have TTL 98993 (return is {:ok, 42})
#PID<0.209.0> - ~T[10:53:27.046323] calling fetch...
#PID<0.210.0> - ~T[10:53:27.046380] calling fetch...
#PID<0.209.0> - ~T[10:53:27.046418] we have TTL 98993 (return is {:ok, 42})
#PID<0.210.0> - ~T[10:53:27.046468] we have TTL 98993 (return is {:ok, 42})

This means that at times, we can get a nil TTL. It does not explain yet, though, how we can get "long lasting nil TTL" like we observe (unless something has crashed somewhere that we don't know about yet!).

While we will continue investigating as time permits, @whitfin we are considering moving to a transaction-based pattern instead, like mentioned here:

Thanks for your insights - a transaction looks acceptable for us (since fetch wait for other fetches anyway today) in terms of throughput.

whitfin commented 1 year ago

Hi @thbar!

This is great, thank you for your additional work on this. You're correct; it is definitely just a race condition between the tasks - the two other tasks are receiving their completion before the expire call is made. Transactions are definitely the way to support this, and should work for this use case (but at a fairly large cost).

A transaction does make sense, but I'm wondering again about an expiration option or some of expiration interface to fetch/4. Maybe returning {:commit, value, expiration}. Not sure how I feel about that, but maybe it will help here.

I want to ask if this is something that matters in practice though; if I understand it correctly it will be only a few milliseconds where a TTL is not set yet. Is the concern having to protect around a nil TTL rather than being able to assume that there is always one set? Technically this would always be possible unless fetch/4 handles the expiration itself.

fchabouis commented 1 year ago

Hello, Here is a third colleague of the same team :laughing: The problem is that sometimes TTL are not set at all and we end up having cache values that are never refreshed. This happens every day on our application and this is a real problem for us, as it is difficult to detect.

whitfin commented 1 year ago

Hi @fchabouis,

Okay, this makes sense (totally forgot the context of the thread).

There are a couple of options that I can think of right now:

  1. Use a cache transaction and see if that helps.
  2. Setting a default expiration on your cache should move the expiration inside of fetch/4 and it should be guaranteed. I'm not sure if this is an option for you, as this is a single expiration value.
  3. I can look at adding support for expiration from directly inside fetch/4. I didn't really want to do this for many reasons, but I think allowing {:commit, result, expiration} is a viable approach which solves some of the problems I initially had with adding this.

What are your thoughts on these options?

Sorry this is taking up so much of your teams time to get resolved 😅

thbar commented 1 year ago

Responding to these options :

  1. "Use a cache transaction and see if that helps" -> this is very likely what we're going to try first! I believe it should work good enough for us performance-wise, but I also hope we won't find troubles with the expiry inside the transaction (re https://github.com/whitfin/cachex/issues/277#issuecomment-1141577645)
  2. Setting a default expiration on the cache is already done, but sadly we have a per-key TTL (set by customers), so sometimes 10 seconds, sometimes a bit more etc. Customisation is part of our service in that case!
  3. expiration support directly on fetch/4 would be useful to a lot of users I believe, but I understand it raise concerns as a maintainer. It would be appreciated, but we also do not want to put too much a burden on you! If it's reasonably easy to do though, we will definitely try it out & report back.

Thanks for those exchanges in all cases, much appreciated. We are confident we'll find a way to fix this!

whitfin commented 1 year ago

Hi @thbar + @fchabouis + @AntoineAugusti!

While I agree transactions would work, I really hate that it would be required for this to work correctly. Instead, I have gone for option #3 described above. By providing options from inside the return tuple, it maintains consistency and is pretty easy to maintain (much better than adding options to fetch/4).

I have pushed my changes into https://github.com/whitfin/cachex/tree/issue-304, if you would like to test it out and get back to me when you have some time. The change is simply this (from the examples above):

defmodule Test do
  require Logger

  def cache_name, do: :cache

  def func(_x) do
    IO.puts("#{self() |> inspect()} I'm computing...")
    :timer.sleep(5_000)
    {:commit, 42, ttl: :timer.seconds(100) }
  end
end

Internally, fetch/4 uses put/4 to insert values, so this third element in the returned tuple is passed directly through - so :ttl becomes automatically supported, and helps keep the API in line.

I still am unsure about adding a :ttl option directly to the fetch/4 API, because two places in the code calling fetch/4 on the same key with a different TTL would be very confusing. I think doing it inside the return type makes it very clear, even if it does require a little more documentation on my side.

What do you think about these changes? If you test them out and all works well and solves your issue, I am happy to publish a Cachex v3.6.0 including this!

fchabouis commented 1 year ago

Thanks a lot @whitfin! We'll try your proposal and get back to you shortly.

thbar commented 1 year ago

I gave #308 a first try and noticed something I wanted to report:

Code ```elixir Mix.install([{:cachex, github: "whitfin/cachex", branch: "issue-304"}]) defmodule Test do require Logger def cache_name, do: :cache def func(_x) do IO.puts("#{self() |> inspect()} I'm computing...") :timer.sleep(5_000) {:commit, 42, ttl: :timer.seconds(100)} end end Cachex.start_link(Test.cache_name(), []) cache_key = "hello-world" 1..2 |> Enum.each(fn _ -> IO.puts("\nRound...\n") :timer.sleep(1_000) 1..3 |> Task.async_stream( fn _ -> IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} calling fetch...") result = Cachex.fetch(Test.cache_name(), cache_key, &Test.func/1) case result do {:commit, val, options} -> IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} :commit with options #{options |> inspect} returned, val=#{val |> inspect}" ) {:ok, val, options} -> IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} :ok with options #{options |> inspect} returned, val=#{val |> inspect}" ) {:ok, val} -> IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} :ok without options returned, val=#{val |> inspect}" ) x -> IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} bogus #{x |> inspect}") end {:ok, ttl} = Cachex.ttl(Test.cache_name(), cache_key) if is_nil(ttl) do IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have nil TTL (return is #{result |> inspect})" ) else IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have TTL #{ttl} (return is #{result |> inspect})" ) end end, max_concurrency: 100, timeout: :infinity ) |> Enum.each(fn _ -> :ok end) end) ```

What surprises me is:

Here is the script output:

❯ elixir cache.exs

Round...

#PID<0.203.0> - ~T[13:45:14.565694] calling fetch...
#PID<0.204.0> - ~T[13:45:14.565772] calling fetch...
#PID<0.205.0> - ~T[13:45:14.565793] calling fetch...
#PID<0.206.0> I'm computing...
#PID<0.203.0> - ~T[13:45:19.578011] :commit with options [ttl: 100000] returned, val=42
#PID<0.204.0> - ~T[13:45:19.578083] :ok with options [ttl: 100000] returned, val=42
#PID<0.205.0> - ~T[13:45:19.578116] :ok with options [ttl: 100000] returned, val=42
#PID<0.203.0> - ~T[13:45:19.582858] we have TTL 99995 (return is {:commit, 42, [ttl: 100000]})
#PID<0.204.0> - ~T[13:45:19.582885] we have TTL 99995 (return is {:ok, 42, [ttl: 100000]})
#PID<0.205.0> - ~T[13:45:19.582901] we have TTL 99995 (return is {:ok, 42, [ttl: 100000]})

Round...

#PID<0.208.0> - ~T[13:45:20.588129] calling fetch...
#PID<0.208.0> - ~T[13:45:20.588378] :ok without options returned, val=42
#PID<0.209.0> - ~T[13:45:20.588424] calling fetch...
#PID<0.210.0> - ~T[13:45:20.588534] calling fetch...
#PID<0.208.0> - ~T[13:45:20.588584] we have TTL 98989 (return is {:ok, 42})
#PID<0.209.0> - ~T[13:45:20.588673] :ok without options returned, val=42
#PID<0.210.0> - ~T[13:45:20.588743] :ok without options returned, val=42
#PID<0.209.0> - ~T[13:45:20.588801] we have TTL 98989 (return is {:ok, 42})
#PID<0.210.0> - ~T[13:45:20.588893] we have TTL 98989 (return is {:ok, 42})

I wonder if the "ok" output should be normalised in some way (e.g. never TTL provided), to avoid putting the burden of normalising on the caller?

Anyway that was a first feedback. I've not yet taken the time to try that out in our application, because I wanted to have your feedback on this behaviour before.

whitfin commented 1 year ago

@thbar you are absolutely right, this was an oversight! I've made the change to stop this from happening going forward.

thbar commented 1 year ago

Behaviour is consistent on "ok" now:

Code ```elixir Mix.install([{:cachex, github: "whitfin/cachex", branch: "issue-304"}]) defmodule Test do require Logger def cache_name, do: :cache def func(_x) do IO.puts("#{self() |> inspect()} I'm computing...") :timer.sleep(5_000) {:commit, 42, ttl: :timer.seconds(100)} end end Cachex.start_link(Test.cache_name(), []) cache_key = "hello-world" 1..2 |> Enum.each(fn _ -> IO.puts("\nRound...\n") :timer.sleep(1_000) 1..3 |> Task.async_stream( fn _ -> IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} calling fetch...") result = Cachex.fetch(Test.cache_name(), cache_key, &Test.func/1) case result do {:commit, val, options} -> IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} :commit with options #{options |> inspect} returned, val=#{val |> inspect}" ) {:ok, val, options} -> raise "Should not happen" {:ok, val} -> IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} :ok without options returned, val=#{val |> inspect}" ) x -> IO.puts("#{self() |> inspect()} - #{Time.utc_now() |> inspect()} bogus #{x |> inspect}") end {:ok, ttl} = Cachex.ttl(Test.cache_name(), cache_key) if is_nil(ttl) do IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have nil TTL (return is #{result |> inspect})" ) else IO.puts( "#{self() |> inspect()} - #{Time.utc_now() |> inspect()} we have TTL #{ttl} (return is #{result |> inspect})" ) end end, max_concurrency: 100, timeout: :infinity ) |> Enum.each(fn _ -> :ok end) end) ```
Logs ``` Round... #PID<0.203.0> - ~T[13:19:58.654184] calling fetch... #PID<0.204.0> - ~T[13:19:58.654184] calling fetch... #PID<0.205.0> - ~T[13:19:58.654212] calling fetch... #PID<0.206.0> I'm computing... #PID<0.204.0> - ~T[13:20:03.663629] :ok without options returned, val=42 #PID<0.205.0> - ~T[13:20:03.663694] :ok without options returned, val=42 #PID<0.203.0> - ~T[13:20:03.663629] :commit with options [ttl: 100000] returned, val=42 #PID<0.204.0> - ~T[13:20:03.667328] we have TTL 99996 (return is {:ok, 42}) #PID<0.205.0> - ~T[13:20:03.667356] we have TTL 99996 (return is {:ok, 42}) #PID<0.203.0> - ~T[13:20:03.667375] we have TTL 99996 (return is {:commit, 42, [ttl: 100000]}) Round... #PID<0.208.0> - ~T[13:20:04.672129] calling fetch... #PID<0.208.0> - ~T[13:20:04.672338] :ok without options returned, val=42 #PID<0.209.0> - ~T[13:20:04.672379] calling fetch... #PID<0.208.0> - ~T[13:20:04.672483] we have TTL 98991 (return is {:ok, 42}) #PID<0.209.0> - ~T[13:20:04.672580] :ok without options returned, val=42 #PID<0.210.0> - ~T[13:20:04.672629] calling fetch... #PID<0.209.0> - ~T[13:20:04.672722] we have TTL 98991 (return is {:ok, 42}) #PID<0.210.0> - ~T[13:20:04.672785] :ok without options returned, val=42 #PID<0.210.0> - ~T[13:20:04.672876] we have TTL 98991 (return is {:ok, 42}) ```
thbar commented 1 year ago

I'm going to start integrating your branch into a test branch here and see if we can get somewhere, thanks!

AntoineAugusti commented 1 year ago

@whitfin We've deployed your fix in production and we are currently evaluating it. It has been running fine for the last 7 hours and we have seen 0 TTL errors in the meantime, this looks promising!

whitfin commented 1 year ago

@AntoineAugusti that's great! Would you usually have seen issues in this amount of time?

I will add your Dialyzer change to the PR before merging, then wrap it into a v3.6.0 and publish it (probably tomorrow). Unless I hear more from you that the issue is not fixed :D

AntoineAugusti commented 1 year ago

I don't want to shout victory too soon but this looks very promising!

A quick look at our previous exceptions:

image

We deployed it ~2023-02-06 09h00 UTC. Will keep it running and report in a few hours.

Thank you for the Dialyzer change, we had to fork and do this commit to make sure our CI was 🟢 before 🚢.

thbar commented 1 year ago

@whitfin I can add my own empirical testing the previous week (in addition to the great feedback by @AntoineAugusti): I had a way to get 100% reproduction, only the production app. The level of traffic & cache contention made it easy to trigger the bug via a "home page refresh" while another component of the app (a slightly crowded proxy using cachex as well), and with the new branch I never managed to trigger it. So I believe the fact that you moved the expiration inside your own transaction block removed a race condition happening only when using the advice provided to setup expiry until now, which was the culprit.

So I'm equally confident it is going to work quite well!

Thanks everyone involved :-)

whitfin commented 1 year ago

@thbar okay, perfect! This makes sense to me also, I understand the race involved and it should definitely be removed, I agree. I just didn't want to break anything else with the changes 😅.

Okay, then I will move ahead with a v3.6.0, expect it within a day or so!

whitfin commented 1 year ago

Hi all! I have released these changes in v3.6: https://github.com/whitfin/cachex/releases/tag/v3.6.0

If there are any issues, we can address them in a patch release - but I am happy to ship this initial minor based on the positive results so far (as I have time to do so currently).

Thank you for all your help getting this addressed and sharing your notes and findings!! 🎉

AntoineAugusti commented 1 year ago

@whitfin Hi Isaac! I've just pushed a commit to use the v3.6.0 in our production branch.

On behalf of the tech team @ transport.data.gouv.fr / French Ministry of Transport, thanks a lot for your help with this issue and your work. This is much appreciated.