worldveil / dejavu

Audio fingerprinting and recognition in Python
MIT License
6.36k stars 1.43k forks source link

How alignment works? #162

Closed pk97 closed 4 years ago

pk97 commented 5 years ago

Hi Dejavu is a great project but I am unable to understand how it is able to calculate relative offset for a segment of the sound. According to documentation following formula is used:

difference = database offset from original track - sample offset from recording

To my understanding offset is difference in time between the actual starting of song and the starting of the segment of the song. I know something is wrong.

How relative offset is been calculated?

jean72human commented 5 years ago

Hello @pk97 Your understanding is quite right. The offset difference is calculated for each matching fingerprints. Each fingerprint has its own offset which is the time at which it was found. The fingerprint from the segment of song is matched to one in the database that has exactly the same value. Then their times are substracted. So the offset difference is the difference between the time at which a fingerprint was found in the segment of song and the time its matching fingerprint was found in the actual song.

pk97 commented 5 years ago

@jean72human Thanks for reply

From mic with 5 seconds we recognized: {'song_id': 1, 'song_name': 'Brad-Sucks--Total-Breakdown', 'file_sha1': '02A83F248EFDA76A46C8B2AC97798D2CE9BC1FBE', 'confidence': 32, 'offset_seconds': 36.50177, 'offset': 786}

Above is a sample output of Dejavu.

To my understanding, 'offset_seconds' means the time corresponding to a particular fingerprint of the song which got matched to the fingerprint of sample input given to Dejavu.

'Offset': It is the relative fingerprint number

Is this correct?

Note: I tried to verify this as I played the sample input using my phone and noting the time. The output of offset_seconds varies as much as by 7 seconds.

jean72human commented 5 years ago

the offset seconds is actually an offset difference. It is the difference between the time corresponding to a particular fingerprint of the song and the time corresponding to its matching fingerprint of sample input given to Dejavu. Let's take this example, you have a song A in the database. Now you are recognizing a record and comparing it to the song A in the database. In that song A, between the 8th and the 15th millisecond, there are 5 fingerprints found with respective times 8, 10, 11, 14, 15 (in milliseconds). In the record, those same fingerprints were found but with times 0, 2, 3, 6, 7. These are the same values so they match however their times are not the same because the record does not start from the beginning of song A. But since it is the same song the sequence is the same therefore for each matching peak when we take the difference in offset it will be the same. So the first peak: 8-0=8 the second peak: 10-2=8 and with all the following peaks it is still 8. 8 is, by the way, the time in the song A from where the system started picking up the record. Meaning that the record starts from the 8th millisecond of song A. That value, 8 is the offset difference which is then converted to seconds and displayed as offset_seconds. Hope this example helps.

alexanderkladov commented 5 years ago

@jean72human Thanks for your explanation. You seem to understand the way this works very well. So I was wondering whether you might be able to help me understand a few things better as well:

  1. Based on what you said, in order to find the exact timestamp of the match relative to fingerprinted song, we simply need to refer to offset variable, correct? And since it's represented in milliseconds, we just need to divide it by 1000 to get the seconds. For example, in the sample below, the match happened 83.306 seconds into the song (or 1m 23.306s). Is that correct? {'song_id': 211, 'song_name': 'Aerosmith - Walk This Way', 'confidence': 9, 'offset': 83306, 'offset_seconds': 1740.92539, 'file_sha1': '3BA7DAF426916E8FDB7C1CF37F64F0EBEF5F1530', 'match_time': 0.2773129940032959}
  2. Do you know how can you fine-tune the fingerprint to work with extremely small samples? For example, I have a collection of songs, each at least a few minutes long. And I want to be able to find them from snippets as short as 200-300ms long. Is that possible? I am ok with sacrificing processing speed & storage to achieve this, since I understand that I will need millions of fingerprints per song to achieve this.
jean72human commented 5 years ago

Hello @alexanderkladov

  1. The offset itself is not in milliseconds. Its unit depends on the window size, the sampling rate and the overlap ratio used when generating the fingerprints. The conversion to seconds has already been done, and that's what gives you the offset-second value. So offset, and offset-seconds are the same thing just that offset-seconds is in seconds.
  2. I don't really know, but my guess would be to increase the number of fingerprints generated so that even for shorter audios it generates enough fingerprints to be able to compare. At the beginnning of the fingerprint.py file there is a list of values that can be tuned with a description of how their change affects the performance. My suggestion would be to start by reducing the neighborhood size then to do some tests. `######################################################################

    Sampling rate, related to the Nyquist conditions, which affects

    the range frequencies we can detect.

    DEFAULT_FS = 44100

######################################################################

Size of the FFT window, affects frequency granularity

DEFAULT_WINDOW_SIZE = 4096

######################################################################

Ratio by which each sequential window overlaps the last and the

next window. Higher overlap will allow a higher granularity of offset

matching, but potentially more fingerprints.

DEFAULT_OVERLAP_RATIO = 0.5

######################################################################

Degree to which a fingerprint can be paired with its neighbors --

higher will cause more fingerprints, but potentially better accuracy.

DEFAULT_FAN_VALUE = 15

######################################################################

Minimum amplitude in spectrogram in order to be considered a peak.

This can be raised to reduce number of fingerprints, but can negatively

affect accuracy.

DEFAULT_AMP_MIN = 10

######################################################################

Number of cells around an amplitude peak in the spectrogram in order

for Dejavu to consider it a spectral peak. Higher values mean less

fingerprints and faster matching, but can potentially affect accuracy.

PEAK_NEIGHBORHOOD_SIZE = 20

######################################################################

Thresholds on how close or far fingerprints can be in time in order

to be paired as a fingerprint. If your max is too low, higher values of

DEFAULT_FAN_VALUE may not perform as expected.

MIN_HASH_TIME_DELTA = 0 MAX_HASH_TIME_DELTA = 200

######################################################################

If True, will sort peaks temporally for fingerprinting;

not sorting will cut down number of fingerprints, but potentially

affect performance.

PEAK_SORT = True

######################################################################

Number of bits to throw away from the front of the SHA1 hash in the

fingerprint calculation. The more you throw away, the less storage, but

potentially higher collisions and misclassifications when identifying songs.

FINGERPRINT_REDUCTION = 20`

alexanderkladov commented 5 years ago

Thanks for the write up. Unfortunately, it's not 100% clear what I need to do exactly to achieve the best results from the fingerprint.py comments. I have experimented with various configs and have received widely different results. It's quite strange. But so far the best results were with:

DEFAULT_FS = 44100
DEFAULT_WINDOW_SIZE = 1024
DEFAULT_OVERLAP_RATIO = 0.9
DEFAULT_FAN_VALUE = 30
DEFAULT_AMP_MIN = 10
PEAK_NEIGHBORHOOD_SIZE = 15
MIN_HASH_TIME_DELTA = 0
MAX_HASH_TIME_DELTA = 200
PEAK_SORT = True
FINGERPRINT_REDUCTION = 20
  • So offset, and offset-seconds are the same thing just that offset-seconds is in seconds.

How can that be? My fingerprinted tracks are anywhere from 2.5 to 8 minutes long. But offset_seconds results could be in thousands sometimes, meaning up to 30m or more. How is that being calculated? Based on my configs above, the unit size should be 0.02089795918s or ~21ms (1 sec / 44100 1024 0.9).

Here is an example of a result I get:

{'song_id': 91, 'song_name': 'Depeche Mode - Enjoy The Silence - Single Mix', 'confidence': 28, 'offset': 61373, 'offset_seconds': 1282.57045, 'file_sha1': 'BAC931E2C1404B8C6C8A665B84746932BABA513E', 'match_time': 2.8039259910583496}

Could you please help me decode where on earth did the match happen in that song? The snippet I am feeding it is less than a second long..

jean72human commented 5 years ago

Yes so it is 21 ms and when multiplied by the offset you get the same value as the offset-seconds. Now concerning the reason why the offset-seconds is so big, I don't really have a clear idea. I can only suggest you reduce the overlap ratio. The last time I tried to tweak these values I would sometimes get results that do not make sense. You can also try with the initial values.

alvis-do commented 4 years ago

I am new in this project and I am having a problem that the same with @alexanderkladov. The offset_second is so big while my tracks are from 2 to 3 minutes in duration. @worldveil, Can you clarify for me?