weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 134 forks source link

Improve insert performance #939

Closed henrykironde closed 6 years ago

henrykironde commented 7 years ago

Add to table function in engine def add_to_table(self, data_source):uses a fixed number of data linesinsert_limit = . This is currently set to 400. For each engine, this number could be increased. We can set the values in the respective engines and initializing the number in respect to the used engine.

badamakashprasad commented 7 years ago

Hi I want to look in this issue.We can set values in the respective engines by adding a variable in the function argument itself. def add_to_table(self, data_source,insert_limit = 400) where default is 400

ethanwhite commented 7 years ago

Hi @badamakashprasad - welcome and thanks for your interesting in helping out with the retriever. A couple of things to consider as you work on this:

  1. We recently merged b25071e5a9896e67dacb6538b0344f20043943d3, which changes the way we handle inserts. We should check and make sure that the new implementation using executemany and parameter binding doesn't automatically handle the previous issues we were having with insert limits. You can probably test this out by increasing insert_limit and testing it on all of the different database management systems. @henrykironde will know more about what database is mostly likely to break and how it will error if this is still an issue.

  2. I think setting an insert_limit member variable in each of the database engines that require it and then calling it in engine.py using self.insert_limit might be a cleaner and easier way to do than adding a new argument to add_to_table. You can see a similar implementation in the commit mentioned above for how we handle place_holder.

Thanks again for getting involved. Let us know what we can do to help as you proceed.

shreyneil commented 6 years ago

@ethanwhite I'll to try to work something out. I think all the engines except init.py would require it?

What are the desired values of insert_limit required for each engine?

henrykironde commented 6 years ago

Depending on the database, for example sqlite https://www.sqlite.org/limits.html

shreyneil commented 6 years ago

Just to be clear, So is the insert limit the number of values in an insert statement or the number of insert statements itself?

if it is the number of values then I guess SQLITE_MAX_COLUMN would work.

henrykironde commented 6 years ago

Thanks for this @shreyneil. The limit is on the number of rows to insert.

Given a multi row insert statement

INSERT INTO tableName
    (column1,column2,column3,column4)
VALUES
    ('value1' , 'value2', 'value3','value4'),
    ('value1' , 'value2', 'value3','value4'),
    ('value1' , 'value2', 'value3','value4');

How many of the rows can you insert

    ('value1' , 'value2', 'value3','value4'),
    ('value1' , 'value2', 'value3','value4'),
    ('value1' , 'value2', 'value3','value4');

SQL insert statements are limited by the size, which could be the number of columns or the rows (Incase of a composite/multi insert statements) or just the total size of bytes.

In our case, we assume the number is more dependent on the rows . So you could try to do some research to see how may rows you could insert.

It would be nice to know how you came up with the numbers or some references.

After success: Please run retriever install [name of engine] vertnet-birds and let us know what happens Note: vertnet-birds is 885mbs(compressed) and 5934807 line/rows

shreyneil commented 6 years ago

For sqlite : https://www.sqlite.org/limits.html For mysql: https://dev.mysql.com/doc/refman/5.7/en/database-count-limit.html For postgresqll: https://www.postgresql.org/message-id/42C3C382.5020108@cineca.it For Msaccess : https://support.office.com/en-us/article/Access-2016-specifications-0cf3c66f-9cf2-4e32-9568-98c1025bb47c For Xml : https://www.ibm.com/support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0001029.html For csvengine: I stuck to 400 due to lack of information about it. For jsonengine was going hand in hand with xml , so i provided the same insert limit.

shreyneil commented 6 years ago

Will run the tests and will get back.

henrykironde commented 6 years ago

@shreyneil, This is a good start. I have found some references too.

Using Bulk Insert Statement example:

Mysql

Let me know if you get the idea, and look for other engine references.

I think in your case, you are considering the length of the statement. But we want to find out the number of row that we can insert at a time when bulk inserting.

( We assume that some magic happens and the length of the query does not go beyond what can be handled by the server)

shreyneil commented 6 years ago

This is great , now i get the proper idea will work out the same for all other engines.

shreyneil commented 6 years ago

Please tell me something if you find something about csv,xml,json and postgresql. The following limits were found for some of the engines: Mysql : 1000 sqlite : 500

henrykironde commented 6 years ago

How have you arrived to these number? They look unreal. Based on the links I sent you, seems that for Mysql, ~1000 rows is a good number. Which I believe could be increased.

Some Rules of Thumb:
* A million rows will have a BTree depth of 3; a trillion rows, 6. That is, the "fan out" is 100, and the 'log' is base 100. (Of course there are extremes of perhaps 2 to 1000 instead of 100.)

* A batched, multi-row INSERT with 100 rows will run 10 times as fast as 100 single-row INSERTs. (The 10x is from real experiences, and it somewhat approximates the 'proportions' given on this page.)

* Don't go past 1000 rows in a batched insert -- especially if there is contention for the table and/or replication is involved.

* Another optimization is to pre-sort the rows to insert (via LOAD DATA or batched INSERT) according to some index, preferably the PRIMARY KEY or some UNIQUE key. This helps the flow of blocks through cache and helps avoid deadlocks. 

We need to find very good sources and test them out.

Then we could use the vertnet-birds dataset to test the speed before and after.

shreyneil commented 6 years ago

Ok will try this out. I'll try running the test for mysql and msaccess for 1000 rows for now as the articles say it is a good number. Will go with 500 for sqlite.

shreyneil commented 6 years ago

@henrykironde I ran the tests and the changes seem fine to work with all the engines except i could not test msaccess as i am using linux environment and for mysql and postgresql i was getting the following error:

retriever install mysql vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...

Couldn't create database (No module named pymysql). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table (No module named pymysql). Trying to continue anyway. No module named pymysql Done!

retriever install postgres vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...

Couldn't create database (No module named psycopg2). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table (No module named psycopg2). Trying to continue anyway. No module named psycopg2 Done!

PS: I used the insert_limit as 1000 for all the engines an even for an insert limit of 400 the command was failing for mysql and postgres engines.

henrykironde commented 6 years ago

No module named psycopg2, No module named pymysql Install pymysql and psycopg2

shreyneil commented 6 years ago

@henrykironde , Sir installed the modules , now getting the following errors: For mysql

$ retriever install mysql vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...

Couldn't create database ((2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)")). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table ((2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)")). Trying to continue anyway. (2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)") Done!

For postgres

$ retriever install postgres vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...

Couldn't create database (could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432? ). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table (could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432? ). Trying to continue anyway. could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432?

Done!

henrykironde commented 6 years ago

Your MySQL server looks like it is not set to port 5432 or may not be running

shreyneil commented 6 years ago

Done, these all seem to work fine with an insert limit of 1000,.

henrykironde commented 6 years ago

okey