worldveil / dejavu

Audio fingerprinting and recognition in Python
MIT License
6.33k stars 1.43k forks source link

Recognize TV shows using an HLS playlist #305

Open nathandebalthasar opened 2 months ago

nathandebalthasar commented 2 months ago

Hello, I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds.

The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1.

When using 60-second TS segments, the input confidence value tends towards 0. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these settings. Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio.

The files ingested are WMV files, and the audio specs are the following:

What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following:

Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that?

How can you get more relevant results?

mkommar commented 2 months ago

To be honest, you do want short clips. Google uses this method. It's there a design reason that requires 60 seconds? Think of the algorithm, any noise or variation will make the "beat" vary. In my attempt, I used 2 to 4 seconds.

On Thu, Mar 14, 2024, 1:41 PM nathandebalthasar @.***> wrote:

Hello, I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds.

The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1.

When using 60-second TS segments, the input confidence value tends towards

  1. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these https://github.com/denis-stepanov/advent?tab=readme-ov-file#dejavu-tuning settings. Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio.

The files ingested are WMV files, and the audio specs are the following:

  • 3 audio tracks
  • Codec WMA 9.2
  • Constant bit rate mode at 96kbps
  • 2 channels
  • 48 kHz sample rate

What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following:

  • Single audio track
  • Codec AAC LC Version 4
  • Muxing Mode: ADTS
  • 2 channels
  • 48 kHz sample rate
  • Lossy compression mode

Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that?

How can you get more relevant results?

— Reply to this email directly, view it on GitHub https://github.com/worldveil/dejavu/issues/305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQSG5GV7XCFHSHCR5IKE3YYHOM3AVCNFSM6AAAAABEWR7GKGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DMOJSGQ3TIMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

quannabe commented 2 months ago

Agree with the above. Shorter clips should yield better results.

Curious about your use case. Can you share more?

nathandebalthasar commented 2 months ago

To be honest, you do want short clips. Google uses this method. It's there a design reason that requires 60 seconds? Think of the algorithm, any noise or variation will make the "beat" vary. In my attempt, I used 2 to 4 seconds.

No particular reason to use 60 seconds segments, I was using 6 seconds segments at the beginning and at some points, the false positives were fewer using longer segments at the cost of a loss of precision.

To be honest, you do want short clips.

Does it apply only to the files used during recognition? Or also the files that DejaVu ingests?

Agree with the above. Shorter clips should yield better results.

Curious about your use case. Can you share more?

I'm building a solution that aims to recognize a given Television program, serie or ad in real time using TS segments from an HLS playlist.

nathanagez commented 2 months ago

Hi @quannabe @mkommar, we tested with shorter clips but we ended up with low confidence results as well.

The is how we proceeded. We have TV ads that can last between 10-20 seconds we ingested in DejaVu, if I take the exact same file and compare it with what DejaVu fingerprinted we obtain a very good confidence level (close to one or 1).

Let's assume we have the following:

We ingest all of them in DejaVu, then if we provide ad_1.wmv for recognition it will match what we have in database and end up with an input_confidence result close to 1 or equals to 1.

Now let's do the same, we ingest:

The start and end of our .ts segment contain audio unknown by DejaVu but in the middle it contains our ad_2 ingested previously.

If we run the recognition on this segment, this is where we end up with very low confidence.

quannabe commented 2 months ago

Interesting use case!

I've had issues with query times greatly increasing as the audio library size increases. Have you run into this?

mkommar commented 2 months ago

Got it. Are you requiring passive listening or is it from a direct recording that this identification will happen? Meaning is the use case always going to have a direct recording from a source stream? Or will it pick up audio from the background on a phone or Alexa device?

Mahesh

On Wed, Mar 20, 2024, 10:32 AM William Sell @.***> wrote:

Interesting use case!

I've had issues with query times greatly increasing as the audio library size increases. Have you run into this?

— Reply to this email directly, view it on GitHub https://github.com/worldveil/dejavu/issues/305#issuecomment-2009717211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQSG3GQ2WFBVXNH2WSNE3YZGMZBAVCNFSM6AAAAABEWR7GKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YTOMRRGE . You are receiving this because you were mentioned.Message ID: @.***>