Connection failed when bootstrapping large tables

zendesk / maxwell

Maxwell's daemon, a mysql-to-json kafka producer

https://maxwells-daemon.io/

Other

4.04k stars 1.01k forks source link

Connection failed when bootstrapping large tables #1344

Open leonjoyce opened 5 years ago

leonjoyce commented 5 years ago

When bootstrapping with large tables, the following error happens: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout' on the server

searching the error on google, got the following solution: https://github.com/ontop/ontop/wiki/WorkingWithMySQL Use Cursors at server side: stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(FETCH_AMOUNT)

This is different from the current maxwell design, which use 'Stream the ResultSets one row at a time'

Should the current code be changed?

leonjoyce commented 5 years ago

I tried the solution 'Cursors at server side:', it works

osheroff commented 5 years ago

Hi Leon, If you look at SynchronousBootstrapper.java:173, we're already doing exactly that. Were you able to reliable reproduce the net-write-timeout error?

asimali246 commented 5 years ago

Hi @osheroff, I have few doubts related to boostrapping large tables (more than 100mil rows).

Is there any reason we're streaming all the results in a single query? By my calculation, for a table as large as above, it would take atleast 10-12 hrs to completely fetch the result. There are two problems here: A) Queries/Connection will timeout. B) Entire progress will be lost if the query is killed. So can't stop the bootstrap in between. The cursor currently in maxwell fetches one row at a time which incurs a huge network time and decreases the throughput.
Just wondering whether any problem in using mysql queries with limit/offset pattern (can also store offset in the db to mark the progress). Will help us in giving multiple advantages: A) Bootstrapping can be resumed at any point. B) Shorter running queries. C)Can give a better throughput.

Would really appreciate if you can shed some light on above points. I can contribute for point 2 if it makes sense. Thanks in advance.

osheroff commented 5 years ago

hi @asimali246, your points in (1) are valid. Using OFFSET / LIMIT isn't really all that great though; once you're doing something like SELECT * from table OFFSET 1000000 LIMIT 10000 (for example), you're asking mysql to read and skip a million rows in order to read ten thousand rows. The next query, you ask mysql to read 1,010,000 rows in order to read the next ten thousand rows. And on and on, the performance of LIMIT/OFFSET gets worse and worse as the table grows.

A better way to read all data in a table in chunks is to paginate through via an auto increment or other primary key, something like:

   SELECT * from table ORDER by id limit 100,000
   # get back a result set with a max id of 101,011
   SELECT * from table where id > 101,011 order by id limit 100,000

If you'd like to contribute to a solution like that, it'd be much appreciated. thanks!

asimali246 commented 5 years ago

Hi @osheroff, Thanks for your prompt response. I agree with the suggestion to optimise the limit/offset pattern to help with faster pagination of the table. Although it won't work on tables without any primary keys but I think this should suffice for most of the use cases. I'll try to get this done & raise a PR. Thank you.