wowica / xogmios

An Elixir client for Ogmios
9 stars 0 forks source link

Incorrect behavior with not fully synced Ogmios #42

Closed arjaz closed 1 month ago

arjaz commented 1 month ago

The problem manifests when you try to establish a connection with Ogmios when it's not fully synced (in my case when you start the application and Ogmios simultaneously). The connection seems to be established because I get logs from the handle_connect/1 callback, but when the client tries to send frames to Ogmios it fails with the following error:

[info] Connected to Ogmios
[error] :gen_statem OgmiosChainSyncClient terminating
** (FunctionClauseError) no function clause matching in :websocket_client.disconnected/3
    (websocket_client 1.5.0) /home/arjaz/code/xxxxxx/deps/websocket_client/src/websocket_client.erl:313: :websocket_client.disconnected(:cast, {:cast_frame, {:text, "{\n  \"jsonrpc\": \"2.0\",\n  \"method\": \"nextBlock\",\n  \"id\": \"initial_sync\"\n}\n"}}, {:context, {:websocket_req, :ws, ~c"0.0.0.0", 1337, ~c"/", 5000, #Reference<0.2003193081.4165206023.17349>, 1, #Port<0.21>, {:transport, :gen_tcp, :tcp, :tcp_closed, :tcp_error, [mode: :binary, active: true, packet: 0]}, "xxxxxx", :undefined, 1, :undefined, :undefined, :undefined}, {:transport, :gen_tcp, :tcp, :tcp_closed, :tcp_error, [mode: :binary, active: true, packet: 0]}, [], {:ws, ~c"0.0.0.0", 1337, ~c"/"}, {OgmiosChainSyncClient, %{restart: :permanent, handler: OgmiosChainSyncClient, start_slot: 63877794, start_block_id: "3ada6acb25b72f4540a1047fe0ad14744bdea505106863a5d7dc655e96386f05", sync_from: %{point: %{id: "6cdb572be812026c342108c6e9856ccf36e7c971d548305cc3362c43762e749d", slot: 70870715}}, ws_pid: #PID<0.455.0>}}, "", true, 0})
    (stdlib 6.0.1) gen_statem.erl:3113: :gen_statem.loop_state_callback/11
    (stdlib 6.0.1) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [cast: {:cast_frame, {:text, "{\n  \"jsonrpc\": \"2.0\",\n  \"method\": \"nextBlock\",\n  \"id\": \"initial_sync\"\n}\n"}}]
Postponed: []
State: {:disconnected, {:context, {:websocket_req, :ws, ~c"0.0.0.0", 1337, ~c"/", 5000, #Reference<0.2003193081.4165206023.17349>, 1, #Port<0.21>, {:transport, :gen_tcp, :tcp, :tcp_closed, :tcp_error, [mode: :binary, active: true, packet: 0]}, "xxxxxx", :undefined, 1, :undefined, :undefined, :undefined}, {:transport, :gen_tcp, :tcp, :tcp_closed, :tcp_error, [mode: :binary, active: true, packet: 0]}, [], {:ws, ~c"0.0.0.0", 1337, ~c"/"}, {OgmiosChainSyncClient, %{restart: :permanent, handler: OgmiosChainSyncClient, start_slot: 63877794, start_block_id: "3ada6acb25b72f4540a1047fe0ad14744bdea505106863a5d7dc655e96386f05", sync_from: %{point: %{id: "6cdb572be812026c342108c6e9856ccf36e7c971d548305cc3362c43762e749d", slot: 70870715}}, ws_pid: #PID<0.455.0>}}, "", true, 0}}
Callback mode: :state_functions, state_enter: false

The connection doesn't get restored after that, but it doesn't get disconnected either as the handle_disconnect/2 callback isn't called. Here are the relevant parts of the client I have:

defmodule OgmiosChainSyncClient do
  use Xogmios, :chain_sync

  def start_link(opts) do
    last_processed_block = Query.last_block()
    initial_state =
        [
          sync_from: %{
            point: %{
              slot: last_processed_block.slot,
              id: last_processed_block.hash
            }
          }
        ]
    opts = Keyword.merge(opts, initial_state)
    Xogmios.start_chain_sync_link(__MODULE__, opts)
  end

  @impl true
  def handle_connect(state) do
    Logger.info("Connected to Ogmios")
    {:ok, state}
  end

  @impl true
  def handle_disconnect(_reason, state) do
    Logger.warning("Ogmios connection died, closing the connection")
    {:close, :disconnect, state}
  end

  @impl true
  def handle_block(block, state) do
    {:ok, :next_block, state}
  end
end
caike commented 1 month ago

I was able to replicate this error by starting Ogmios using an invalid (inexistent) socket file as argument. Perhaps that's what's causing your issue as well.

Either way, this tells me a more sophisticated error handling logic is needed in Xogmios. I'll look into this as soon as I have some free time. Thanks for reporting!

caike commented 1 month ago

@arjaz Mind testing the following branch when you get a chance ?

defp deps do
  [ 
    #...
    {:xogmios, github: "wowica/xogmios", ref: "5f7c7cc"},
  ]
end

This should address the issue with the scenario I described. Other than that I'm not familiar with Ogmios having a "sync" period. Let me know if this is still an issue.

arjaz commented 1 month ago

What I was referring to with Ogmios syncing is that it has a period where it pushes new blocks or something:

[2024-09-19 09:14:52.17 UTC] Pushing ledger state for block 12421234b585d69736437d525d5245c6358ba6be364f30c9c5352fd64fa9857e at slot 70913787. Progress: 16.82%

The 5f7c7cc commit calls the handle_disconnect callback which is nice, but it seems it doesn't handle the :reconnect option gracefully 🤔 With the following callback:

  @impl true
  def handle_disconnect(_reason, state) do
    Logger.warning("Ogmios connection died, reconnecting in 5s")
    {:reconnect, 5_000, state}
  end

I get this log

[warning] Ogmios connection died, reconnecting in 5s

But no actual blocks after that nor any following handle_connect calls

caike commented 1 month ago

Oh I see. This log message is coming from the Cardano node itself while it syncs with other nodes in the network.

While Ogmios is able to start and receive connections from clients, I don't believe it is in usable state until the underlying Cardano node is fully synced. The node, for example, once started takes a while to create its socket file and accept connections. This is likely the reason why Xogmios cannot fully establish the connection to Ogmios.

That being said, I'll look into improving the reconnection logic to keep trying to reconnect in this particular scenario until the node is fully synced and a socket file is available for Ogmios. Just know that it might take a long time, though, depending on how the node is configured... like many hours.

caike commented 1 month ago

💡 Found some useful information reading through the TypeScript client that relates to this exact scenario you brought up.

This is great. Should be able to use Ogmios' /health endpoint as part of Xogmios' internal connection process to check whether it can start sending messages.

caike commented 1 month ago

@arjaz Pushed a few more changes. It should now display a more informative warning message and then keep trying to reconnect indefinitely until Ogmios is ready to respond with data.

Mind trying it again ?

{:xogmios, github: "wowica/xogmios", ref: "da31c26"},

Update: Actually, looking back at the TS client I'm realizing this fix might not yet fully address your use case. I'm adding a few more checks.

caike commented 1 month ago

Should include much better error/warning messages now:

{:xogmios, github: "wowica/xogmios", ref: "efedfd1"}
arjaz commented 1 month ago

Yes, I can confirm it. Thank you so much for your work!