zorchenhimer / MoviePolls

Voting to decide on a movie to watch with MovieNight
https://discord.gg/F2VSgjJ
16 stars 6 forks source link

Duplicate check is broken in certain cases of api autofill #72

Open CptPie opened 4 years ago

CptPie commented 4 years ago

https://moviepolls.zorchenhimer.com/movie/82 is available for voting, but https://moviepolls.zorchenhimer.com/movie/41 was watched thirteen months and two days ago? - abridgewater on Discord

In the reported case the Movie "Princess Mononoke" got added twice and got past the duplicate check. That is caused by once using MAL autofill and the other time using IMDB autofill. Since the autofilled title has different formats depending on the API used the titles didnt match and therefore the duplicate check did not hit.

Possible solutions:

zorchenhimer commented 4 years ago

Are multiple titles returned for either IMDB or MAL? I know AniDB returns multiple titles. Maybe we can store those and use them for the duplicate check?

Other than that, there would need to be some normalization of characters or a similarity check for titles. Maybe some sort of string metric for checking similar strings? (see https://en.wikipedia.org/wiki/String_metric). If a title is close enough we could prompt for confirmation or require mod/admin approval if it's to similar to another. Although this would probably break with sequels, eg "Deadpool" and "Deadpool 2" being only one character different (two including the space).

CptPie commented 4 years ago

Regarding the API results: Jikan (it is not ensured that both title and title_english are filled - when the original title is already english the "title_english" field is null): image TMDb: image

zorchenhimer commented 4 years ago

So storing a single title for display then a bunch of alt titles is plausible then.

CptPie commented 4 years ago

In theory - yes. But i am afraid of the data quality of TMDb seeing the original Title being in kanji (?) while the MAL title is in latin script -> wont help us much.

Regardless i think it would be nice to have an "improved" movie struct with

Title string
Org_Title string
Year string (or int)

and then use a common title format for both APIs (i.e. "Movie.Title (Movie.Org_Title) (Movie.Year)" ).

With this struct we could use an approach with the assumption that Movie.Title is always the english title (whenever possible) and use that field for the duplicate check.