Python 2 - Exact movie title matches not working well with Unicode titles

apo86 commented 3 years ago

There is a function is_best in tmdb.py that is looking for exact title matches in order to boost them to the top of the results list (because TMDB itself fails to do so):

def is_best(item): return item['title'].lower() == title and ( not year or item.get('release_date', '').startswith(year))

This does not work well with unicode titles. As far as I can tell because one of the strings in the comparison is not correctly encoded.

For example scraping the German title "Verschwörung" ("The Girl in the Spider's Web") with the release year 2018 added and search language de-DE, returns "Die UFO-Verschwörung" and "Die Damaskus Verschwörung" (both of which also released in 2018) higher than the exact match.

In my attempts to get it to work, I ended up with:

def is_best(item): return item['title'].lower().encode('utf-8') == title and ( not year or item.get('release_date', '').startswith(year))

And with this, the exact match is correctly returned at the top of the list. I'm not sure if this is a "good" fix and I have not tested this beyond the one title. I would appreciate it if someone who knows what they're doing could have a look. Thanks!

apo86 commented 3 years ago

When I first opened this issue I didn't really think it could be system related, but just in case this is on OSMC 2020.11-1, running Kodi 18.9

And what I also didn't even realize, it actually gives an error message, which either wasn't there the first time or I completely overlooked:

2021-07-13 20:59:50.201 T:3341398240 ERROR: /home/osmc/.kodi/addons/metadata.themoviedb.org.python/python/lib/tmdbscraper/tmdb.py:39: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal return item['title'].lower() == title and (

Lastly, this may come as no surprise, but the following also fixes the issue:

    def is_best(item):
        return item['title'].lower() == title.decode('utf-8') and (
            not year or item.get('release_date', '').startswith(year))

I don't know if encode left side or decode right side is the "correct" way to do it or if there even is any functional difference (my guess would be no). I do know that the current implementation is not correct.

rmrector commented 3 years ago

This is due to the goofy string handling in Python 2 - only Kodi Leia is affected by this.

We'll need a simple solution that works for both versions of Python.

apo86 commented 3 years ago

How do you feel about something like this:

    def is_best(item):
        itemtitle = item['title'].lower()
        if str(type(itemtitle)) == "<type 'unicode'>":
            itemtitle = itemtitle.encode('utf-8')
        return itemtitle == title and (
            not year or item.get('release_date', '').startswith(year))

In my testing this worked with both kodi 18.9 and kodi 19.1

I don't know a lot of python and everything I know about python data types I learned in the last few hours, so maybe there is a simpler or more elegant solution. But this is the only thing I could come up with.

WeetA34 commented 2 years ago

Hello,

also, it could be nice to remove accents for some languages because is_best function doesn't properly work if folder/file title doesn't contain accents and result title has accents (ie. Les Eternels (2021)\Eternals.2021.xxxxxxx.mkv which returns two movies in the following order:

Les Éternels (2018): original_title="江湖儿女", release_date="2018-09-21"
Les Éternels (2021): original_title="Eternals", release_date"2021-11-03"

So, the first one is used instead the second one.

I tested the following modification in kodi_19.3_android_tv:python\lib\tmdbscraper\tmdb.py. It works fine with these French titles:

--- tmdb.py.orig    2022-08-08 14:40:18.000000000 +0200
+++ tmdb.py 2022-08-08 14:40:17.000000000 +0200
@@ -1,5 +1,6 @@
 from datetime import datetime, timedelta
 from . import tmdbapi
+import unicodedata

 class TMDBMovieScraper(object):
     def __init__(self, url_settings, language, certification_country, search_language=""):
@@ -39,8 +40,10 @@
         urls = self.urls

         def is_best(item):
-            return item['title'].lower() == title and (
-                not year or item.get('release_date', '').startswith(year))
+            normalized_item_title = u''.join([c for c in unicodedata.normalize('NFKD', item['title'].lower()) if not unicodedata.combining(c)])
+            normalized_title = u''.join([c for c in unicodedata.normalize('NFKD', title.lower()) if not unicodedata.combining(c)])
+            return normalized_item_title == normalized_title and (not year or item.get('release_date', '').startswith(year))
+
         if result:
             # move all `is_best` results at the beginning of the list, sort them by popularity (if found):
             bests_first = sorted([item for item in result if is_best(item)], key=lambda k: k.get('popularity',0), reverse=True)

JimmyS83 commented 1 year ago

@rmrector

This is due to the goofy string handling in Python 2 - only Kodi Leia is affected by this.

We'll need a simple solution that works for both versions of Python.

I am not sure, if it wouldnt be easier to maintanance, actually split master to branches py2 and py3, where in py2 case there would be needed utf-8 conversions... And possibly more py2/3 differencies, if those would pops out as well.. It works kinda well for quite a number of addons..

rmrector commented 1 year ago

It is not easier to maintain two copies of a codebase.

However, Kodi 18 and Python 2 are about to lose parallel support, so I can see putting together one last Kodi 18 release with a Python 2 specific fix for this.

rmrector commented 1 year ago

Released in 1.6.3 for Leia

xbmc / metadata.themoviedb.org.python

Python 2 - Exact movie title matches not working well with Unicode titles #74