will / crystal-pg

a postgres driver for crystal
BSD 3-Clause "New" or "Revised" License
463 stars 78 forks source link

Linux broken pipe #110

Closed cptaffe closed 7 years ago

cptaffe commented 7 years ago

I don't get this issue when running my server on my local machine (OS X), but I do get it when running it in a docker image on my local machine, and in ECR (docker images use crystallang/crystal:0.23.1 and install libpq-dev).

I think this is a race condition, it doesn't seem to show up until there are two concurrent requests sending queries. Once it errors each subsequent request errors and the only fix is to restart the process.

The backtrace I get looks like:

0x53151b: unbuffered_write at /opt/crystal/src/io/file_descriptor.cr 276:11
0x531158: flush at /opt/crystal/src/io/buffered.cr 186:5
0x533881: write at /opt/crystal/src/io/buffered.cr 134:7
0x7f86449afb5c: BIO_write at ??
0x7f8644d08bf2: ??? at ??
0x7f8644d09404: ??? at ??
0x6474c4: unbuffered_write at /opt/crystal/src/openssl/ssl/socket.cr 111:5
0x647558: flush at /opt/crystal/src/io/buffered.cr 186:5
0x648ef4: write_byte at /opt/crystal/src/io/buffered.cr 151:7
0x642c4a: write_chr at /usr/src/app/lib/pg/src/pq/connection.cr 100:7
0x644a60: send_parse_message at /usr/src/app/lib/pg/src/pq/connection.cr 315:7
0x63cf0e: perform_query at /usr/src/app/lib/pg/src/pg/statement.cr 13:5
0x641c63: perform_query_with_rescue at /usr/src/app/lib/db/src/db/statement.cr 103:14
0x641c48: query at /usr/src/app/lib/db/src/db/statement.cr 84:7
0x68477d: query at /usr/src/app/lib/db/src/db/pool_statement.cr 29:30
0x627955: query at /usr/src/app/lib/db/src/db/query_methods.cr 38:7
0x681e81: paginate_all at /usr/src/app/src/models/url.cr 45:5
0x673fff: call at /usr/src/app/src/handlers/app.cr 41:26
0x673a23: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x67385f: call at /opt/crystal/src/http/server/handlers/compress_handler.cr 12:5
0x67331f: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x6728d9: call at /opt/crystal/src/http/server/handlers/log_handler.cr 12:5
0x672608: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x6722ed: call at /usr/src/app/src/reporter/handler.cr 13:7
0x67215a: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x671ce0: call at /opt/crystal/src/http/server/handlers/error_handler.cr 15:7
0x675307: process at /opt/crystal/src/http/server/request_processor.cr 39:11
0x674f79: process at /opt/crystal/src/http/server/request_processor.cr 16:3
0x674e80: handle_client at /opt/crystal/src/http/server.cr 191:5
0x4ea7b3: ~procProc(Nil) at /usr/src/app/src/server.cr 14:5
0x50a07e: run at /opt/crystal/src/fiber.cr 255:3
0x4e20a6: ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) at /opt/crystal/src/concurrent.cr 61:3
0x0: ??? at ??

Or

GET /urls? - Unhandled exception:
SSL_read: Unexpected EOF (OpenSSL::SSL::Error)
0x64804b: unbuffered_read at /opt/crystal/src/openssl/ssl/socket.cr 105:5
0x647ed6: fill_buffer at /opt/crystal/src/io/buffered.cr 208:5
0x647e11: peek at /opt/crystal/src/io/buffered.cr 84:7
0x647bd8: read_char_with_bytesize at /opt/crystal/src/io.cr 324:28
0x647ad3: read_char at /opt/crystal/src/io.cr 317:12
0x6435ad: read at /usr/src/app/lib/pg/src/pq/connection.cr 146:12
0x644f5f: expect_frame at /usr/src/app/lib/pg/src/pq/connection.cr 289:31
0x644f3b: expect_frame at /usr/src/app/lib/pg/src/pq/connection.cr 288:5
0x63cfb3: perform_query at /usr/src/app/lib/pg/src/pg/statement.cr 18:5
0x641c83: perform_query_with_rescue at /usr/src/app/lib/db/src/db/statement.cr 103:14
0x641c68: query at /usr/src/app/lib/db/src/db/statement.cr 84:7
0x68479d: query at /usr/src/app/lib/db/src/db/pool_statement.cr 29:30
0x627975: query at /usr/src/app/lib/db/src/db/query_methods.cr 38:7
0x681ea1: paginate_all at /usr/src/app/src/models/url.cr 45:5
0x673bcf: call at /usr/src/app/src/handlers/app.cr 41:26
0x6735f3: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x67342f: call at /opt/crystal/src/http/server/handlers/compress_handler.cr 12:5
0x672eef: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x6724a9: call at /opt/crystal/src/http/server/handlers/log_handler.cr 12:5
0x6721d8: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x671ebd: call at /usr/src/app/src/reporter/handler.cr 13:7
0x671d2a: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x6718b0: call at /opt/crystal/src/http/server/handlers/error_handler.cr 15:7
0x674ed7: process at /opt/crystal/src/http/server/request_processor.cr 39:11
0x674b49: process at /opt/crystal/src/http/server/request_processor.cr 16:3
0x674a50: handle_client at /opt/crystal/src/http/server.cr 191:5
0x4ea7d3: ~procProc(Nil) at /usr/src/app/src/server.cr 21:5
0x50a09e: run at /opt/crystal/src/fiber.cr 255:3
0x4e20c6: ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) at /opt/crystal/src/concurrent.cr 61:3
0x0: ??? at ??

Or

GET /urls? - Unhandled exception:
SSL_read: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac (OpenSSL::SSL::Error)
0x4f07b7: *CallStack::unwind:Array(Pointer(Void)) at ??
0x64804b: unbuffered_read at /opt/crystal/src/openssl/ssl/socket.cr 105:5
0x647ed6: fill_buffer at /opt/crystal/src/io/buffered.cr 208:5
0x647e11: peek at /opt/crystal/src/io/buffered.cr 84:7
0x647bd8: read_char_with_bytesize at /opt/crystal/src/io.cr 324:28
0x647ad3: read_char at /opt/crystal/src/io.cr 317:12
0x6435ad: read at /usr/src/app/lib/pg/src/pq/connection.cr 146:12
0x644f5f: expect_frame at /usr/src/app/lib/pg/src/pq/connection.cr 289:31
0x644f3b: expect_frame at /usr/src/app/lib/pg/src/pq/connection.cr 288:5
0x63cfb3: perform_query at /usr/src/app/lib/pg/src/pg/statement.cr 18:5
0x641c83: perform_query_with_rescue at /usr/src/app/lib/db/src/db/statement.cr 103:14
0x641c68: query at /usr/src/app/lib/db/src/db/statement.cr 84:7
0x68479d: query at /usr/src/app/lib/db/src/db/pool_statement.cr 29:30
0x627975: query at /usr/src/app/lib/db/src/db/query_methods.cr 38:7
0x681ea1: paginate_all at /usr/src/app/src/models/url.cr 45:5
0x673bcf: call at /usr/src/app/src/handlers/app.cr 41:26
0x6735f3: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x67342f: call at /opt/crystal/src/http/server/handlers/compress_handler.cr 12:5
0x672eef: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x6724a9: call at /opt/crystal/src/http/server/handlers/log_handler.cr 12:5
0x6721d8: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x671ebd: call at /usr/src/app/src/reporter/handler.cr 13:7
0x671d2a: call_next at /opt/crystal/src/http/server/handler.cr 24:7
0x6718b0: call at /opt/crystal/src/http/server/handlers/error_handler.cr 15:7
0x674ed7: process at /opt/crystal/src/http/server/request_processor.cr 39:11
0x674b49: process at /opt/crystal/src/http/server/request_processor.cr 16:3
0x674a50: handle_client at /opt/crystal/src/http/server.cr 191:5
0x4ea7d3: ~procProc(Nil) at /usr/src/app/src/server.cr 21:5
0x50a09e: run at /opt/crystal/src/fiber.cr 255:3
0x4e20c6: ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) at /opt/crystal/src/concurrent.cr 61:3
0x0: ??? at ??

Stress testing a fresh process on Linux looks like:

$ echo GET https://shortenr.thrillshare.io/urls | vegeta attack -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.20
Duration      [total, attack, wait]    5.502569887s, 4.979999966s, 522.569921ms
Latencies     [mean, 50, 95, 99, max]  525.492236ms, 164.552039ms, 1.38650682s, 5.295669316s, 5.331346839s
Bytes In      [total, mean]            6452, 25.81
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  0.80%
Status Codes  [code:count]             200:2  500:248
Error Set:
500 Internal Server Error

And on OS X (no issue):

$ echo "GET http://localhost:5000/urls" | vegeta attack -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.20
Duration      [total, attack, wait]    5.076995124s, 4.979999943s, 96.995181ms
Latencies     [mean, 50, 95, 99, max]  192.940717ms, 62.679728ms, 507.618503ms, 555.567446ms, 703.366638ms
Bytes In      [total, mean]            500, 2.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:250
Error Set:
cptaffe commented 7 years ago

This was my fault. Not sure why the inconsistency between OS X and Linux, but the problem is that I initialize the database object in one process, then fork workers. Since the workers are using the same socket this obviously leads to a race condition Ultimately the fix was to call close on the database object once forked

will commented 7 years ago

Yeah also not sure why there'd be a difference between os x and linux here, but thanks for posting, it should help others find the answer when they run into it.