worldveil / dejavu

Audio fingerprinting and recognition in Python
MIT License
6.36k stars 1.43k forks source link

significant performance gain when fingerprinting #251

Open craigjmidwinter opened 3 years ago

craigjmidwinter commented 3 years ago

So I did some poking around while getting this set up using a Postgres RDS instance and I noticed that the biggest bottleneck when fingerprinting an entire directory was inserting the fingerprints into the database. The current implementation using cur.executemany performs individual insert operations for each fingerprint, but a bulk insert would be way more performant. I'm using postgres so I just patched my local version to use psycopg2.extras.execute_values to test this and haven't looked at the mysql implementation, but if it's doing the same thing, then the same gain could be had there.

The results from an empty database, to a completed directory fingerprint which contained two albums and resulted in just over 2.8 million records in the fingerprint table were

________________________________________________________
Executed in  327.66 secs   fish           external 
   usr time  219.85 secs    0.00 micros  219.85 secs 
   sys time  104.46 secs  1315.00 micros  104.46 secs 

I didn't get a benchmark on the current implementation because it was taking hours to run the inserts. The small change that I made in my code was in common_database.py around line 170.

        # I was using 1000 for batch_size, but depending on your infrastructure you might want to fine tune it. Maybe 10000 would even be ok, or maybe only 100
        with self.cursor() as cur:
            execute_values(cur, self.INSERT_FINGERPRINT_V2, values, "(%s, decode(%s, 'hex'), %s)", batch_size) 
            # for index in range(0, len(hashes), batch_size):
            #     cur.executemany(self.INSERT_FINGERPRINT, values[index: index + batch_size])

INSERT_FINGERPRINT_V2 looks like:

    INSERT_FINGERPRINT_V2 = f"""
        INSERT INTO "{FINGERPRINTS_TABLENAME}" (
                "{FIELD_SONG_ID}"
            ,   "{FIELD_HASH}"
            ,   "{FIELD_OFFSET}")
        VALUES %s ON CONFLICT DO NOTHING;

If I've got time, I'll try to polish up that solution a bit and make a PR, but I don't know how soon I'll have a chance to do that, so if someone else wants to go for it, have at 'er