will / crystal-pg

a postgres driver for crystal
BSD 3-Clause "New" or "Revised" License
462 stars 77 forks source link

Error handling for connect_listen #195

Closed straight-shoota closed 2 years ago

straight-shoota commented 4 years ago

When closing the connection returned by connect_listen, an exception is raised inside the listen fiber:

require "pg"
connection = PG.connect_listen ENV["DATABASE_URL"], "foo" {  }
sleep 1
connection.close
Unhandled exception in spawn: Closed stream (IO::Error)
  from /usr/share/crystal/src/io.cr:128:5 in 'check_open'
  from /usr/share/crystal/src/io/buffered.cr:69:5 in 'read'
  from /usr/share/crystal/src/io.cr:530:20 in 'read_fully?'
  from /usr/share/crystal/src/io.cr:513:5 in 'read_fully'
  from /usr/share/crystal/src/io/byte_format.cr:131:3 in 'decode'
  from /usr/share/crystal/src/int.cr:520:5 in 'from_io'
  from /usr/share/crystal/src/io.cr:889:5 in 'read_bytes'
  from lib/pg/src/pq/connection.cr:115:7 in 'read_i32'
  from lib/pg/src/pq/connection.cr:177:7 in 'read_one_frame'
  from lib/pg/src/pq/connection.cr:169:31 in 'read_async_frame_loop'
  from lib/pg/src/pg/connection.cr:50:15 in '->'
  from /usr/share/crystal/src/fiber.cr:255:3 in 'run'
  from /usr/share/crystal/src/fiber.cr:48:34 in '->'
  from ???

PQ::Connection#read_async_frame_loop only swallows Errno, it should probably be extended to cover more errors, including IO::Error. Probably any error could be ignored if the connection is closed?

m-o-e commented 4 years ago

+1

I'm now using this little monkey-patch to be able to handle errors in connect_listen:

class PG::Connection
  protected def listen
    @connection.read_async_frame_loop
  end
end

But needless to say, patching low-level shards in this way doesn't feel good. ;)

Perhaps connect_listen could just be a blocking call to begin with?

We can always wrap it in spawn ourselves where needed and it gives much better control over the error handling.

haarts commented 3 years ago

I'm currently struggling a bit to actually make connect_listen blocking. I was somewhat surprised to learn it didn't block. But perhaps that makes sense too.

carlhoerberg commented 2 years ago

Agree, PG.connect_listen should be blocking, currently the fiber will die if eg. the server is restarted, and there's no obovious way to rescue it without monkey patching

will commented 2 years ago

Thanks to @carlhoerberg for the patch. Making it block does make sense I think, but I don’t myself use listen/notify. Since this will be a breaking change to anyone using it now, I'd appreciate it if anyone using listen/notify could comment on making it blocking.

straight-shoota commented 2 years ago

I think the change makes sense.

But I'd strongly advise to avoid breaking implicit behaviour without breaking the explicit API. If you're using PG.connect_listen right now, your code continues to compile with this change. But it won't work anymore because the call blocks.

This new behaviour should be introduced with a different method signature. The easiest solution is probably to rename the methods. PG.connect_listen could become .listen_connect - or .connect_listen! (the exclamation mark isn't very self-explanatory, though). PG::Connection#listen is more complicated. #listen! might be an acceptable option. Maybe .blocking_connect_listen and #blocking_listen could be a last solution.

The old methods could either be deprecated or even kept alongside the new, blocking ones. Not sure if that makes much sense though, when the difference is just wrapping spawn.

straight-shoota commented 2 years ago

This doesn't resolve the original issue, though. It's an improvement for handling the error, but not an actual fix.

The example would become something like this:

require "pg"
connection = spawn { PG.blocking_connect_listen ENV["DATABASE_URL"], "foo" {  } }
sleep 1
connection.close

Closing the connection would still raise IO::Error in Connection#blocking_listen. Of course, the situation is better because I can rescue in user code. But I'd argue that this exception shouldn't be leaking here in the first place. Closing the connection should not be an error in the listening fiber.

m-o-e commented 2 years ago

I'd appreciate it if anyone using listen/notify could comment on making it blocking.

As one of these users I approve of the change. Monkey-patching should not be required to use listen/notify or deal with PG connections in a reliable fashion in general.

Closing the connection should not be an error in the listening fiber.

I agree with this as well, but not sure how much of that can/should be abstracted away from the user.

I think the interface that most users probably want is a blocking call that raises all connection-exceptions except when #close was previously called.

In that case it should either just return, or raise a known Exception such as ConnectionClosed.

straight-shoota commented 2 years ago

@m-o-e Yeah, I'd expect a blocking method to simply return when the connection is closed.

haarts commented 2 years ago

I, too, think it makes sense making this blocking.

FYI I've worked around it currently with:


channel = Channel(Exception).new
PG.connect_listen() do
 ... 
rescue e : Exception
  channel.send(e)
end
channel.receive
will commented 2 years ago

Sorry everyone, I got very busy right around the last comments here, then this issue slipped out of my mind. I'm having trouble gathering context on it again. There are a good number of unreleased changes, so I'm planning on doing a release soon. Can someone make a PR with the best way forward for this please so we can get this solved in the next release?

straight-shoota commented 2 years ago

I think the way forward is twofold:

will commented 2 years ago

@straight-shoota does the code you gave in the first post error for you every time? I haven't yet been able to reproduce getting any sort of error on closing the connection. I haven't tried linux yet, maybe it's a mac/linux difference?

crystal eval 'require "./src/pg"; 1000.times { c = PG.connect_listen("postgres:///", "foo") {}; c.close }; puts "done"'
done
straight-shoota commented 2 years ago

Strangely, the error doesn't reproduce anymore for me either. I even tried with 0.22.1, which I believe was the version used in the initial report. Maybe it was really a platform thing on WSL1 🤷 Or some issue in an older stdlib version.

will commented 2 years ago

Ok, I think #243 covers everything, let me know.

will commented 2 years ago

The above pr has been merged and released with v0.25.0