Closed henrykironde closed 6 years ago
Hi I want to look in this issue.We can set values in the respective engines by adding a variable in the function argument itself. def add_to_table(self, data_source,insert_limit = 400) where default is 400
Hi @badamakashprasad - welcome and thanks for your interesting in helping out with the retriever. A couple of things to consider as you work on this:
We recently merged b25071e5a9896e67dacb6538b0344f20043943d3, which changes the way we handle inserts. We should check and make sure that the new implementation using executemany
and parameter binding doesn't automatically handle the previous issues we were having with insert limits. You can probably test this out by increasing insert_limit
and testing it on all of the different database management systems. @henrykironde will know more about what database is mostly likely to break and how it will error if this is still an issue.
I think setting an insert_limit
member variable in each of the database engines that require it and then calling it in engine.py
using self.insert_limit
might be a cleaner and easier way to do than adding a new argument to add_to_table
. You can see a similar implementation in the commit mentioned above for how we handle place_holder
.
Thanks again for getting involved. Let us know what we can do to help as you proceed.
@ethanwhite I'll to try to work something out. I think all the engines except init.py would require it?
What are the desired values of insert_limit required for each engine?
Depending on the database, for example sqlite https://www.sqlite.org/limits.html
Just to be clear, So is the insert limit the number of values in an insert statement or the number of insert statements itself?
if it is the number of values then I guess SQLITE_MAX_COLUMN would work.
Thanks for this @shreyneil. The limit is on the number of rows to insert.
Given a multi row insert statement
INSERT INTO tableName
(column1,column2,column3,column4)
VALUES
('value1' , 'value2', 'value3','value4'),
('value1' , 'value2', 'value3','value4'),
('value1' , 'value2', 'value3','value4');
How many of the rows can you insert
('value1' , 'value2', 'value3','value4'),
('value1' , 'value2', 'value3','value4'),
('value1' , 'value2', 'value3','value4');
SQL insert statements are limited by the size, which could be the number of columns or the rows (Incase of a composite/multi insert statements) or just the total size of bytes.
In our case, we assume the number is more dependent on the rows . So you could try to do some research to see how may rows you could insert.
It would be nice to know how you came up with the numbers or some references.
After success:
Please run retriever install [name of engine] vertnet-birds
and let us know what happens
Note: vertnet-birds is 885mbs(compressed) and 5934807 line/rows
For sqlite : https://www.sqlite.org/limits.html For mysql: https://dev.mysql.com/doc/refman/5.7/en/database-count-limit.html For postgresqll: https://www.postgresql.org/message-id/42C3C382.5020108@cineca.it For Msaccess : https://support.office.com/en-us/article/Access-2016-specifications-0cf3c66f-9cf2-4e32-9568-98c1025bb47c For Xml : https://www.ibm.com/support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0001029.html For csvengine: I stuck to 400 due to lack of information about it. For jsonengine was going hand in hand with xml , so i provided the same insert limit.
Will run the tests and will get back.
@shreyneil, This is a good start. I have found some references too.
Using Bulk Insert Statement example:
Mysql
Let me know if you get the idea, and look for other engine references.
I think in your case, you are considering the length of the statement. But we want to find out the number of row that we can insert at a time when bulk inserting.
( We assume that some magic happens and the length of the query does not go beyond what can be handled by the server)
This is great , now i get the proper idea will work out the same for all other engines.
Please tell me something if you find something about csv,xml,json and postgresql. The following limits were found for some of the engines: Mysql : 1000 sqlite : 500
How have you arrived to these number? They look unreal. Based on the links I sent you, seems that for Mysql, ~1000 rows is a good number. Which I believe could be increased.
Some Rules of Thumb:
* A million rows will have a BTree depth of 3; a trillion rows, 6. That is, the "fan out" is 100, and the 'log' is base 100. (Of course there are extremes of perhaps 2 to 1000 instead of 100.)
* A batched, multi-row INSERT with 100 rows will run 10 times as fast as 100 single-row INSERTs. (The 10x is from real experiences, and it somewhat approximates the 'proportions' given on this page.)
* Don't go past 1000 rows in a batched insert -- especially if there is contention for the table and/or replication is involved.
* Another optimization is to pre-sort the rows to insert (via LOAD DATA or batched INSERT) according to some index, preferably the PRIMARY KEY or some UNIQUE key. This helps the flow of blocks through cache and helps avoid deadlocks.
We need to find very good sources and test them out.
Then we could use the vertnet-birds dataset to test the speed before and after.
Ok will try this out. I'll try running the test for mysql and msaccess for 1000 rows for now as the articles say it is a good number. Will go with 500 for sqlite.
@henrykironde I ran the tests and the changes seem fine to work with all the engines except i could not test msaccess as i am using linux environment and for mysql and postgresql i was getting the following error:
retriever install mysql vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...
Couldn't create database (No module named pymysql). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table (No module named pymysql). Trying to continue anyway. No module named pymysql Done!
retriever install postgres vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...
Couldn't create database (No module named psycopg2). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table (No module named psycopg2). Trying to continue anyway. No module named psycopg2 Done!
PS: I used the insert_limit as 1000 for all the engines an even for an insert limit of 400 the command was failing for mysql and postgres engines.
No module named psycopg2, No module named pymysql Install pymysql and psycopg2
@henrykironde , Sir installed the modules , now getting the following errors: For mysql
$ retriever install mysql vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...
Couldn't create database ((2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)")). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table ((2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)")). Trying to continue anyway. (2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)") Done!
For postgres
$ retriever install postgres vertnet-birds => Installing vertnet-birds Creating database vertnet_birds...
Couldn't create database (could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432? ). Trying to continue anyway. Creating table vertnet_birds.birds... Couldn't create table (could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432? ). Trying to continue anyway. could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432?
Done!
Your MySQL server looks like it is not set to port 5432 or may not be running
Done, these all seem to work fine with an insert limit of 1000,.
okey
Add to table function in engine
def add_to_table(self, data_source):
uses a fixed number of data linesinsert_limit =
. This is currently set to 400. For each engine, this number could be increased. We can set the values in the respective engines and initializing the number in respect to the used engine.