worldveil / dejavu

Audio fingerprinting and recognition in Python
MIT License
6.36k stars 1.43k forks source link

Dejavu on python 3.6.6 #205

Closed mauricio-repetto closed 4 years ago

mauricio-repetto commented 4 years ago

Since support for python 2.7 is arriving to an end I've decided to migrate the code to python 3.6.6. In the process I've refactored a little bit the solution to make it simpler (at least the sql part). I've also refactored some code to improve it by using numpy in a better way and for moments removing unnecessary steps while working with lists.

I've also updated all of the libraries being used, this is, using the latest at this moment.

The solution is working now with mysql 8 by default.

UPDATED:

Now I've added support for Postgresql as well.

Dongbox commented 4 years ago

win10, python3.6.5: ModuleNotFoundError: No module named 'dejavu.config'

Dongbox commented 4 years ago

I guess you lost the "init.py" for the config.

mauricio-repetto commented 4 years ago

Hi @topbobo, sorry, I've missed it, I've now added an init on the config folder, thanks for pointing that out.

gitteraz commented 4 years ago

Hi,

I have installed your version without any problems but: When running example_script.py on the recognition via mp3. file part, Both .wav files are recognized. Why is that?

mauricio-repetto commented 4 years ago

@gitteraz despite is true that both files are being recognized, the second one has a pretty low confidence so you should just discard it or set the TOPN parameter on settings file to return just one result, which always will be the most probable one (since has more hashes being matched against the fingerprinted data in the db). I added this feature for other use cases when you need to get or use several results, not just the most "confident". I'm also preparing more changes to add to this pr like splitting the time on the results to get a better idea of the impact of changing several parameters on the config file and thus changing a little bit the json that's being returned. I will add as well a trully measure of confidence based on the hashes matched over total hashes.

mauricio-repetto commented 4 years ago

@JustSomeHack I've found a bug regarding the align_matches method, since the code considers that there won't be any collisions between hashes and offset, which is not true at all. When this happens it causes that only one offset is being kept and therefore several matches are being wrongly considered as part of that offset. A second bug that I've found is that the code does things different when fingerprints songs and then when it queries them. In the first case it puts all hashes and offsets of both channels in a set (which gets rid of duplicates) but when it queries a song it handles both separately, getting duplicate matches (you get more hashes matched than the ones you have in the database). I also played around with some conf parameters which reduced the query times to an a half without hurting the accuracy.

I will update this pr soon with those changes together with some more robust docstring for several methods. So probably I will need your review once more, thanks for your approval!

mauricio-repetto commented 4 years ago

Well I think I have addressed most of the things that I was concerned about, so I hope this could be helpful to anyone that needs it. Please let me know any doubts you could have regarding the changes made here.

Thanks, Mauricio Repetto.

mauricio-repetto commented 4 years ago

Guys I've added a new change that changes the maximum filter mask that the current dejavu code is using, that helped me to reduce the execution times by ~3 without hurting prediction accuracy on all my local tests. If you guys are not sure of such change you can set the CONNECTIVITY_MASK parameter back to 1 on settings.py.

omertoptas commented 4 years ago

First of all thanks for this branch. Please add a guide on Readme file which includes how to use Postgresql and initialize it for using in the code. I really want to try your branch.(I am going to finish testing tonight.) Also if you tested this branch with a large database(I mean at least 1000 songs), I would really like to see the results because with Dejavu's master branch I have created a database of about 900 songs, it has 150million fingerprint hashes and recognizing from michrophone takes unacceptably long time.(For example, for 5 seconds of recording from the microphone, the program returns in 30 seconds.)

mauricio-repetto commented 4 years ago

@omertoptas hi, thanks for you comments, in the case of postgresql is just the same as mysql, you have to create a database in postgres and give the proper credentials to the dejavu instance, in this case you will have to swtich from mysql database type to postgres type.

For example:

config = {
    "database": {
        "host": "127.0.0.1",
        "user": "postgres",
        "password": "yourpass",
        "database": "dejavu"
    },
    "database_type": "postgres"
}

The code will do the rest.

Regarding tests, I've just checked and I fingerprinted 1000 audios about 20 seconds each, and it took 7 minutes to fingerprint all of them with 5,049,219 hashes in the db. Then I sent a couple of audios to recognize and in avg the process took 3 seconds. I've not used the microphone recognizer yet but I guess times should be similar.

gitteraz commented 4 years ago

Hi @omertoptas, I have used this branch version (prior to latest fixes) to fingerprint 1.8K songs 4 min average each and it takes me about 25 seconds to recognize a song. I have identified that most of the time in the recognition process is taken on this method return_matches located on common_database.py. I will try latest fixes by @mauriciorepetto and revert back with results.

mauricio-repetto commented 4 years ago

Great, thanks! :) by the way @gitteraz I'm not sure how old is the code you have (because I've introduced several changes on the last days) but you may need to drop the tables since hashes changed due to config modifications.

omertoptas commented 4 years ago

Hi @gitteraz thanks for the answer, it gave me a good idea about the process.

I have identified that most of the time in the recognition process is taken on this method return_matches located on common_database.py.

I am pretty sure that you are right, most of the time is spent while retrieving data from the database. I have faced this problem while using @worldveil 's project. Once you have completed fingerprinting and started using recognizing from mic or recognizing from file, most of the time is spent in fetching data from the database. Here my stackoverflow question, investigating the issue and has more details about the problem: https://stackoverflow.com/questions/58058304/python-mysqldb-execute-took-so-much-time

@mauriciorepetto hi thanks for the reply. I hope I can help you to imrove your branch, today I have used it and first think I noticed is your fringerprinting process is so much faster then the old one.

Here is the result of inserting 100 songs with on average approximately 3.5 min long : mauricio_insert1 Most of the time spent while inserting the fingerprint hashes to the database. Fingerpirnting the songs are done in seconds.

Here is the result of inserting 100 songs with on average approximately 3.3 min long : mauricio_insert2

Here is the result of inserting 200 songs with on average approximately 3.7 min long : mauricio_insert3

Here is the result of inserting 100 songs with on average approximately 3.5 min long : mauricio_insert4

Before my test results here my database specs: mauricio_database_1 mauricio_database_2 mauricio_database_3 In total there are 502 songs inside the database which has 63.754.555 fingerprints and its size is 6.7GB where 2.2GB is index value. Also this is a MySQL database.

Firstly I have changed the TOPN value in settings.py to increase the returning result number.Then I started to test it with recording from microphone with 5 seconds, on the most of the trials the song I was trying to recognize via mic did not shown up in the top 10 result: mauricio_result In this image the song I was trying to recognize is returned in the 3rd Result where the result is ordered by hashes_macthed_in_input. The return time was 16secs where 5 secs is the recording time therefore I can say that result is returned in 11secs and most of the time was spended while fetching the data from the database. I know you have changed the align matches algorithm however there could be a problem. Could you please check because I can not understand completely that part in your code.

Regarding tests, I've just checked and I fingerprinted 1000 audios about 20 seconds each, and it took 7 minutes to fingerprint all of them with 5,049,219 hashes in the db. Then I sent a couple of audios to recognize and in avg the process took 3 seconds.

This is such a great results but I want to see also your hardware specs, I mean for example do you use an SSD or HDD to store your database. Also did you resetted your PC before test the code becasue while inserting data to the database MySQL uses RAM to store values temporarily therefore it can return datas almost instantly if you do not reset your RAM. Thanks for the help, I would love to contribute your code.

mauricio-repetto commented 4 years ago

@omertoptas sure, in my case all my tests were with these characteristics:

Probably the ssd is a boost here for sure. Regarding times because of mysql's cache I'm not seeing much difference in times after rebooting my machine.

omertoptas commented 4 years ago

Probably the ssd is a boost here for sure. Regarding times because of mysql's cache I'm not seeing much difference in times after rebooting my machine.

Thank you for the answer.

Could you please check because I can not understand completely that part in your code.

What do you think about this part?

mauricio-repetto commented 4 years ago

Sorry @omertoptas, I forgot that part.. well in the previous code you had this

# align by diffs
diff_counter = {}
largest = 0
largest_count = 0
song_id = -1
for tup in matches:
    sid, diff = tup
    if diff not in diff_counter:
        diff_counter[diff] = {}
    if sid not in diff_counter[diff]:
        diff_counter[diff][sid] = 0
    diff_counter[diff][sid] += 1

    if diff_counter[diff][sid] > largest_count:
        largest = diff
        largest_count = diff_counter[diff][sid]
        song_id = sid

So here above it uses a dicctionary to store time and song, and for each time and song it counts how many times appears, while it is iterating it keeps a maximum variable (largest_count) to save the time and song that has repeated the most.

In order to return more than one result I just calculate the count of each time for each song, what I used for that is the itertools.groupby function which requires to sort the input in the same manner as you are going to group by:

# count offset occurrences per song and keep only the maximum ones.
sorted_matches = sorted(matches, key=lambda m: (m[0], m[1]))
counts = [(*key, len(list(group))) for key, group in groupby(sorted_matches, key=lambda m: (m[0], m[1]))]

Where m[0] is the song_id and the m[1] is the time. Also notice that I'm using the length of the group because that gives me the number of times that is repeated. If you are familiar with sql this is just a

SELECT song_id, time, count(*) FROM <all_matches> GROUP BY song_id, time;

Once we have that calculated, I mean all the counts for each pair of song_id and time, I just keep the maximum count for each song_id (with its corresponding time, again, the one that has that maximum count). In this case I don't need to sort the results before the group by because the list is already sorted by song_id from the previous step. And to get the maximum count I group by song_id (count[0]) with max function based on the time count attribute (x[2]). Then the results are sorted in DESC way to put the songs that had more matches first:

songs_matches = sorted(
            [max(list(group), key=lambda g: g[2]) for key, group in groupby(counts, key=lambda count: count[0])],
            key=lambda count: count[2], reverse=True
        )

and that's it :) is it clearer now?

One thing about this particular code, and it was made on purpose, is that a song will appear only once on the final results list.

gitteraz commented 4 years ago

@mauriciorepetto I have used the most updated version. When working with few or short songs everything runs fast all the time. But with 1.8K songs (growing everyday) about 4 min average each the the database query on return_matches function takes about 20 to 60 seconds.

I am hosting it on Digital Ocean with 2 vCPUs 4GB Memory / 25GB SSD Disk Plus a 1TB High Availability Volume for the database.

I am not a database expert maybe the following could give you a hint on how to move on the right direction:

When a first query a song it takes about 20 - 60 sec. Every time a query the same song in the next few minutes it just takes 4-5 sec. Is this something to do with Indexes? How can i make this behavior last forever? How can we improve the Database Query part?

Thanks

mauricio-repetto commented 4 years ago

@gitteraz the thing is that more hashes you have more time it will take, and things will become worse if you add more songs... remember that what dejavu does is to search a certain amount of hashes (the ones generated for the input song) against ALL the hashes in the database. So there is no much to do here, SQL databases are not strong on query operations on large amount of data, you may try with sql databases with memory implementation (but this will be of course more expensive because of hardware resources). This is probably a great opportunity to move things to a NoSQL database like mongodb or a search engine like elasticsearch, where you can improve processing times by computer distribution.

Regarding your question about why times are better when you run it on a second time, it is because most of the databases have a cache where they keep the queried pages in memory, so when it comes a second query, the data is already in memory and there is no need to look for it in disk.

So as I said if you are suffering this kind of slowness it is because you passed the point where the database can handle such kind of query (I'm talking about the match against everything), so the options are to move to a memory implementation (this won't scale at some point neither and it is expensive) or use a distributed database like mongodb or elasticsearch (which actually is a search engine rather than a database) or any other that you can apply computer distribution to allow you to scale horizontally.

gitteraz commented 4 years ago

@mauriciorepetto thank you so much for the clarifications. I think I can implemente elastic search without any problems I will come back with my results. Thank you

omertoptas commented 4 years ago

that's it :) is it clearer now?

Thanks for the answer. Firstly my problem was getting wrong results while recognizing a song via microphone. The recorded song is not returned in the first 10 songs.(I have sorted them by their hashes_matched_in_input value. ) I have one more question, do you think your way to find and align hash matches could be used to find the similarity of a two song? Or can it help to find other songs that are similar to given one from a big song database?

Can we conclude that if two songs has a lot of matching hashes they could be similar songs? I mean maybe their genre or melody is same/similar. What do you think?

mauricio-repetto commented 4 years ago

@omertoptas actually that's a good question.. but I do not know the answer and if you ask me I think it is no, because what dejavu does is looking for songs that have a similar peak footprint. I don't think by this approach you can implement some sort of recommendation system by genre for example, because what you're always looking here is to find songs that sound "similar" but from a truly sound point of view. But again I'm not an expert matter here so I can be wrong.

outhud commented 4 years ago

Thanks very much for the improvements here @mauriciorepetto!

I have a project to identify about 100 unknown tracks from a database of about 10,000 tracks. The fingerprinting operation using your changes and postgresql is running much faster and more reliably.

I will try to make a change also to add a recognize_directory to run the recognition of multiple files in parallel. I don't think this is currently implemented.

mauricio-repetto commented 4 years ago

@outhud you are welcome. And yes, you are right, there is no implementation for recognition in parallel, that would be a great feature, thanks in advance.

outhud commented 4 years ago

I have been testing with different values of nprocess when calling fingerprint_directory.

With nprocess = 1, it takes about 25 minutes to fingerprint 50 tracks. About 30 seconds per track. (4 minutes of audio in each track on average).

But when I increase nprocess to 2, I can see that 2 cores are being used in htop, but it still takes about 25 minutes to fingerprint 50 tracks, not about half that time as I would have expected.

As I increase nprocess, I see more cores are being utilized, but it is not improving the fingerprinting time for a directory of 50 tracks. I'm using an 8-core, 16-thread processor. I wonder does anyone else see the same?

mauricio-repetto commented 4 years ago

@outhud in my case I didn't try with different nprocess values, I just used the default ones but it is a good input the tests that you made.

endybits commented 4 years ago

@mauriciorepetto, I want to implement this branch on my django project, but i have a problems with the modules when i add it on my djangoproject. I don´t know if it is cause of django or if i must do anything additional action regarding dejavu python 3.6.6

mauricio-repetto commented 4 years ago

Hi @endyleon, sorry I've never worked with django before but what issues are you facing?

endybits commented 4 years ago

This read the file, but has error. When the code is excecute in django it has error, but if i excecute alone this works fine.

mauricio-repetto commented 4 years ago

@endyleon and what is the error saying exactly?

endybits commented 4 years ago

"No module named dejavu.logic" ... Don't import it.

mauricio-repetto commented 4 years ago

I thought it could be a missing init.py file but it is there for the logic folder so I guess it should be something else, probably you are missing some kind of registration thing on django or a bad route to the module. Sorry if I can't help more :(

endybits commented 4 years ago

@mauriciorepetto Before I verified that all the folders had the init.py I could install pydejavu on the environment? How i install it for python 3?

endybits commented 4 years ago

If you have a chance review this https://stackoverflow.com/questions/59231592/when-i-run-my-script-all-is-fine-but-when-i-import-it-in-a-django-project-it-ha/59243991#59243991

mauricio-repetto commented 4 years ago

@endyleon in my case I never tried to install it as a regular module, since I just need to run the script. I hope you can find help with django! I took a look to your question in stackoverflow but I have no clue.

endybits commented 4 years ago

Hi... I already managed to run the scripts inside the django server. If someone gets to have the same problem, I solved it by putting the full path of the application within each import of dejavu. For example, for the next folders:

myappdjango/
    /dejavu

The import of dejavu module in the example script would be like this: from myappdjango.dejavu import Dejavu

This should be done on all imports within each script of the dejavu folder. Also go to the path /dejavu/config/settings.py and do it also in the section:

DATABASES = {
    'mysql': ("myappdjango.dejavu.database_handler.mysql_database", "MySQLDatabase"),
    'postgres': ("myappdjango.dejavu.database_handler.postgres_database", "PostgreSQLDatabase")
}

and say ready!!!

mauricio-repetto commented 4 years ago

Awesome @endyleon! thanks for sharing your solution :raised_hands:

rambaro commented 4 years ago

hello, I want to share something that works for me, it's not an elegant solution but I'm new in the programming world, what I did was to use DB MEMORY engine only for the fingerprints table, then I make a backup of the table and when I restart the computer with a sql script, I reload the data in the table, the queries used to take 100 to 150 seconds now they only take between 0.02 and 0.5, the size of the table in memory is 2Gb. I think there is another way to store the table in memory and it is with import memcache but I don't know how to modify the queries and then look for them there. Someone knows how to do it, or else he can't do it. Sorry for my bad English

omertoptas commented 4 years ago

hello, I want to share something that works for me, it's not an elegant solution but I'm new in the programming world, what I did was to use DB MEMORY engine only for the fingerprints table, then I make a backup of the table and when I restart the computer with a sql script, I reload the data in the table, the queries used to take 100 to 150 seconds now they only take between 0.02 and 0.5, the size of the table in memory is 2Gb. I think there is another way to store the table in memory and it is with import memcache but I don't know how to modify the queries and then look for them there. Someone knows how to do it, or else he can't do it. Sorry for my bad English

DB MEMORY is a good way to accelarate db related jobs however if you have a large database of fingerprints (fingerprints of 1000 or more songs) then you will not be able to use DB MEMORY since its size will exceed the RAM capacity.

mauricio-repetto commented 4 years ago

@rambaro thanks for your comments, as @omertoptas says memory databases are a good choice if you have a relative small database, once it start growing the memory db option turns out to be an expensive one because of the hardware requirements. We had a discussion regarding to this already if you check the thread, in my opinion the best would be an implementation on a NoSQL db but of course it will require some extra effort to migrate the current logic.

endybits commented 4 years ago

Hello again. I have two questions to ask.

  1. If I consult a song that is not in the database, dejavu returns the registration of some of the existing ones. That is, he is wrong. How can I know that the song that returns me is wrong?
  2. In which file can I take the output (json) to be able to manipulate it and adapt it to the requirements of my app?
mauricio-repetto commented 4 years ago

@endyleon,

  1. You can use the number of matched hashes as a measure, you may want to set a threshold based on it.
  2. You will find that most of the json generation logic is within FilerRecognizer.py and the __init__.py from the dejavu module.
rambaro commented 4 years ago

@mauriciorepetto and @omertoptas if they are right, I will try to migrate to NoSQL db, although I don't know much about it. it will be a good practice for a beginner.

renegadeandy commented 4 years ago

I am trying this - when I run get_num_of_fingerprints, i get this error:

    print(djv.db.get_num_fingerprints())
  File "/usr/local/lib/python3.7/site-packages/dejavu/base_classes/common_database.py", line 76, in get_num_fingerprints
    count = cur.fetchone()[0] if cur.rowcount != 0 else 0
  File "/usr/local/lib/python3.7/site-packages/dejavu/database_handler/mysql_database.py", line 196, in __exit__
    self.cursor.close()
  File "/usr/local/lib/python3.7/site-packages/mysql/connector/cursor_cext.py", line 402, in close
    self._cnx.handle_unread_result()
  File "/usr/local/lib/python3.7/site-packages/mysql/connector/connection_cext.py", line 695, in handle_unread_result
    raise errors.InternalError("Unread result found")
mysql.connector.errors.InternalError: Unread result found

What may cause this?

I actually managed to code what I consider a fix by changing line 189 of mysql_database.py to :

self.cursor = self.conn.cursor(dictionary=self.dictionary,buffered=True)

mauricio-repetto commented 4 years ago

Hi @renegadeandy! I'll check the error later and get back to you, thanks also for a possible solution.

mauricio-repetto commented 4 years ago

@renegadeandy well I checked and for some odd reason the rowcount of the query is giving 0, which causes to execute the else part and by that no results are fetched and thus an exception is raised with Unread result found. There is no need on set the dictionary parameter, it's enough with using the buffered param.

I've already pushed this change (and the same for other places where the rowcount is used). Thanks for pointing this out.

zaptrem commented 4 years ago

@gitteraz Did you make any progress implementing this in elasticsearch?

Coltin-dev commented 4 years ago

Has anyone had any issues installing pyAudio with python3.6? I can post the error if anyone would like to help! It looks like I can install for python2.7, but not for 3.6

mauricio-repetto commented 4 years ago

@Coltin-dev hi! No, I did not have any troubles, are you installing 0.2.11 right?

Coltin-dev commented 4 years ago

@Coltin-dev hi! No, I did not have any troubles, are you installing 0.2.11 right?

If anyone else runs into this issue, it ended up being a simple fix. When attempting to install pyAudio I got an error that in part said it was missing the header file python.h. installing python-dev did the trick.

Here's the link I used: https://www.cyberciti.biz/faq/debian-ubuntu-linux-python-h-file-not-found-error-solution/

worldveil commented 4 years ago

@mauriciorepetto this is a truly wonderful new branch! I have not been the best maintainer all these years - it was just a fun proof of concept at the start, and so many people using it now is always amazing to me. I don't always check it - but testing it, this is a wonderful addition.