Error when running `LocalCluster.start/0`

jeroenbourgois commented 5 years ago

When running LocalCluster.start/0 I get the following error:

{:error, 
  {{:shutdown, {:failed_to_start_child, :net_kernel, {:EXIT, :nodistribution}}},  
   {:child, :undefined, :net_sup_dynamic,   
    {:erl_distribution, :start_link, [[:"manager@127.0.0.1"], false]},   
    :permanent, 1000, :supervisor, [:erl_distribution]}}}

I followed the getting started guide, other then that I have a pretty simple phoenix app with some other deps.

This is my Elixir and Erlang/OTP version:

Erlang/OTP 21 [erts-10.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
Elixir 1.8.1 (compiled with Erlang/OTP 21)

Any clue?

keathley commented 5 years ago

Are you seeing this when running tests? I've seen these kinds of issues before if you have firewall issues that are stopping epmd from opening the ports it needs. You can try running iex --sname gold to ensure that you can start nodes with distribution.

jeroenbourgois commented 5 years ago

@keathley yes, after running mix test. Running the iex command works, and apparently after doing that the error message has gone away!

Having some other issues now, with every :ets_lookup going wrong, for example the ones in Phoenix itself:

 :ets.lookup(MyApp.Endpoint, :secret_key_base)

But then this is probably not that crazy. I have zero experience with multiple nodes, we are using Cachex as a cache layer in our project, but we are planning a massive scale. For that purpose we wanted to do a test with several nodes running the app, to anticipate having to scale like that.

jeroenbourgois commented 5 years ago

@keathley nevermind, I got around it, I don't need any Phoenix related tests for this, so I just start the LocalCluster from my test file. Then after that I can just pass the nodes to Cachex and it seems to work!

keathley commented 5 years ago

Glad you got it working. Not sure what operating system you're using but on macos you typically need to run iex with distribution and click a prompt to allow epmd to open port connections. After that things work. If you only run tests then it doesn't present the prompt (or presents it so quickly that you don't notice it). In any case, glad it's working.

toranb commented 5 years ago

@keathley I see this same error from a cold machine boot and I'm looking for some suggestions about what I could do to avoid running iex 1x to ensure mix test will work without failure. Your comment above mentions clicking a prompt -could this be done programmatically w/ some api as part of my build script for example?

... on macos you typically need to run iex with distribution and click a prompt to allow epmd to open port connections

Note: this isn't life ending and my workaround is mostly fine. I'm just curious to learn about alternatives :)

Full working example you need to see the exact failure (or if anyone who follows is interested)

https://github.com/toranb/elixir-budget/commit/78c72dbff81468ffc80b8d29f3e56030cac771bc

keathley commented 5 years ago

Generally when you see this error its because epmd isn't starting or hasn't started. I mostly see this in CI or other linux envs. My solution is to explicitly start epmd -daemon prior to running tests and whatnot. That seems to sort out the problem.

toranb commented 5 years ago

@keathley that worked perfectly! Thanks for the quick reply Chris!

jedschneider commented 4 years ago

I ran into this and figured out that the coc-elixir language server was preventing the node from coming up.

was getting this in running the tests:

PingPongTest
  * test producer sends pings to each connected nodes consumer (2.4ms)

  1) test producer sends pings to each connected nodes consumer (PingPongTest)
     test/ping_pong_test.exs:27
     ** (exit) :not_alive
     stacktrace:
       (stdlib) slave.erl:197: :slave.start/5
       (local_cluster) lib/local_cluster.ex:50: anonymous fn/2 in LocalCluster.start_nodes/3
       (elixir) lib/enum.ex:1340: anonymous fn/3 in Enum.map/2
       (elixir) lib/enum.ex:3011: Enum.reduce_range_inc/4
       (elixir) lib/enum.ex:1953: Enum.map/2
       (local_cluster) lib/local_cluster.ex:49: LocalCluster.start_nodes/3
       test/ping_pong_test.exs:16: PingPongTest.__ex_unit_setup_0/1
       test/ping_pong_test.exs:1: PingPongTest.__ex_unit__/2

--max-failures reached, aborting test suite

Finished in 0.07 seconds
1 test, 1 failure

╰─ iex -S mix
Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> LocalCluster.start()

11:19:30.286 [info]  Protocol 'inet_tcp': register/listen error: econnrefused

{:error,
 {{:shutdown, {:failed_to_start_child, :net_kernel, {:EXIT, :nodistribution}}},
  {:child, :undefined, :net_sup_dynamic,
   {:erl_distribution, :start_link, [[:"manager@127.0.0.1"], false]},
   :permanent, 1000, :supervisor, [:erl_distribution]}}}
iex(2)>

debugger to ensure the gold node can come up

╰─ iex --sname gold
Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(gold@jeds-mbp)1>

I found an existing beam process and killed it

╰─ ps aux | grep beam
jed              10700   0.0  0.6  6075676 193584   ??  S    11:18AM   7:52.07 /Users/jed/.asdf/installs/erlang/22.2.8/erts-10.6.4/bin/beam.smp -- -root /Users/jed/.asdf/installs/erlang/22.2.8 -progname erl -- -home /Users/jed -- -kernel shell_history enabled -- -pa /Users/jed/.asdf/installs/elixir/1.9.4/bin/../lib/eex/ebin /Users/jed/.asdf/installs/elixir/1.9.4/bin/../lib/elixir/ebin /Users/jed/.asdf/installs/elixir/1.9.4/bin/../lib/ex_unit/ebin /Users/jed/.asdf/installs/elixir/1.9.4/bin/../lib/iex/ebin /Users/jed/.asdf/installs/elixir/1.9.4/bin/../lib/logger/ebin /Users/jed/.asdf/installs/elixir/1.9.4/bin/../lib/mix/ebin -noshell -s elixir start_cli -- -extra -e ElixirLS.LanguageServer.CLI.main()

after that the mix tests ran. But of course, it killed my language server:

[coc.nvim] Did not receive workspace/didChangeConfiguration notification after 5 seconds. Using default settings.

I don't see anything that seems like it would prevent the LocalCluster for coming up (other than the fact that the language server is likely coming up first and the language server must be securing whatever is connecting on inet_tcp. Could we have both the language server and LocalCluster both running if we shared an erlang cookie to connect?. I also hope it might help someone that runs into the same stack trace and can't figure out how to get the test running, as the conflict seems pretty far from the act of running the tests.

whitfin / local-cluster

Error when running `LocalCluster.start/0` #7