will / crystal-pg

a postgres driver for crystal
BSD 3-Clause "New" or "Revised" License
462 stars 77 forks source link

Possible Memory Leak in exec #191

Closed haffla closed 4 years ago

haffla commented 4 years ago

I am currently writing a crystal script that performs an ETL process. I read data from one table in a MySQL database, perform some transformations and write the result into a Postgres database. The workload is quite big (9 million rows). I stream results from the source db, collect them in an array and after every 500 rows I insert them into the target DB.

# simplified code...

BATCH_SIZE = 500
results = Array(Array(T)).new(BATCH_SIZE) { Array(T).new }
# T is some union type
index = 0

DB.open "mysql://root:root@localhost:3306/some_db" do |db|
  DB.open "postgres://postgres:postgres@localhost:5432/some_db" do |pg_db|
    db.query "select one, two from some_table" do |rs|
      the_values = .... # read values from rs using rs.read(T)
      # perform some transformations
      results[index] = the_values
      if index == (BATCH_SIZE - 1)
        # something's leaking here
        pg_db.exec "insert into some_table (one, two) values #{generate_sql_from(values)}"
        index = 0
      else
        index += 1
      end
    end
  end
end

So the problem now is that the above works but over time memory consumption grows and grows and grows, on my machine finishing with like 600 MB. So initially I thought maybe it's my code or maybe it's just Crystal (I am new to Crystal). I realise that I am creating a lot of huge strings and all but when I just skip the insertion into the Postgres DB (and still generate all that SQL) this runs with a very very low and stable memory profile.

I am opening the issue in crystal-pg and not in crystal-db because I also tested the same script with two MySQL connections, transferring data from one MySQL db into another and there memory consumption is quite stable at around 180 MB.

asterite commented 4 years ago

Interesting!

Could you provide reproducible code? A snippet that we can copy, paste and run on our machines that show the problem. That will make it so much easier to debug it and solve it. Otherwise we have no idea where to start.

straight-shoota commented 4 years ago

Reproducible code:

require "pg"

def generate_values(i)
  String.build do |io|
    500.times do |j|
      io << ',' unless j == 0
      io << "(#{i}, #{j})"
    end
  end
end

DB.open ENV["DATABASE_URL"] do |db|
  db.exec "CREATE TABLE foo (a int, b int)"
  1000.times do |i|
    puts "Iteration #{i}"
    db.exec "INSERT INTO foo (a, b) VALUES #{generate_values(i)}"
    p! GC.stats
  end
  db.exec "DROP TABLE foo"
end

Heap seems to grow by about 140K per iteration.

asterite commented 4 years ago

This is a problem in crystal-db: the string passed to exec is never released. Apparently crystal-db caches query strings but never releases them. So if you keep passing different strings you'll get more and more memory consumed.

This problem is solved by using parameter placeholders, like "$1". But I still think this is an issue in crystal-db: there should be a pool of prepared statements, not an unbounded cache.

Please someone open this issue in crystal-db (I don't have more time right now, sorry).

asterite commented 4 years ago

Actually: https://github.com/crystal-lang/crystal-db/issues/114

bcardiff commented 4 years ago

Answered in crystal-db we can keep the discussion there. I agree is something that belongs in crystal-db.