yairm210 / Unciv

Open-source Android/Desktop remake of Civ V
Mozilla Public License 2.0
8.5k stars 1.58k forks source link

Feature request: Translate into all languages using GPT #10592

Closed awhillas closed 7 months ago

awhillas commented 11 months ago

Before creating

Problem Description

Should be in all major languages

Related Issue Links

No response

Desired Solution

Its trivial to get GPT to translate from any language into any other language while preserving the variables/formatting with a template like:

You are a helpful language translator. You are to translate from the source language into the target language. You should not translate things between square brackets `[` and `]`, these should be copied as is.

source: en
---
Hello stranger! Weclome to UnCiv. This is a game about something or other! Blah blah blah etc
---
target: ja
---

etc Just need all the things to be translated in a CSV and a simple script to do it (if some of the strings are short and ambigous then there might be a need for some context as to what the intent of the string is for). Also an OpenAI account (but I might be able to help out there). I'm also guessing that once you have translated into many languages the UI is going to have some updates as some strings will be too long in some languages.

Alternative Approaches

Use a different language model i.e. Llama 2 or Bloom (which has a focus on many languages?

Additional Context

would be pretty cool to have it in 20 languages in a very short space of time!

SomeTroglodyte commented 11 months ago

What's a GPT? A Ghastly Precocious Teen? Or is it something to eat?

Caballero-Arepa commented 11 months ago

❌️Nope. I say no to it.

Translating Unciv is not translating a text, even if you were to formst it as so.

You can't just use a translator to get a good result because the computer does not have the level of comprehention a human does.

Why? Because there are game terms, unit formating, bracket variables, and all of that. You need to keep consistent to your self, and in some cases you need to change entirely the frase to transmit the message, not just literally translate.

A Machine Learning program won't be able to understand the usage of conditionals on uniques and their context.

Cwpute commented 11 months ago

[…] A Machine Learning program won't be able to understand the usage of conditionals on uniques and their context.

…yet. But until then i very much agree with all that you said.

Did you have a particular language in mind @awhillas ? Because if you happen to know that language, you might want to join in and translate the game yourself (here for more details).

SomeTroglodyte commented 11 months ago

Look - it's true an AI approach may add value to machine translation, specifically with patterns across "translatable units". But

The idea to take machine translations is not new. I did a version of Unciv once that even machine-translated on-the-fly, using the one open source MT engine! Massive delays of course, but a cute experiment. Yup, still accessible: Apertium. (I never converted that into a run-once tool to "seed" a new language, though, as most languages I deemed interesting to get this way have missing to superficial support from that project.)

awhillas commented 11 months ago

@SomeTroglodyte By GPT I mean ChatGPT , which does a very good job of translating (depending on how rare the language is).

@Caballero-Arepa

the computer does not have the level of comprehention a human does.

Oh yes it does. I guess you haven't played with ChatGPT? But I don't have to convince you, just give it a try.

Why? Because there are game terms, unit formating, bracket variables, and all of that.

Yeah, the model is smart enough to handle that. I've been using it to do markdown and it translates the formatting there flawlessly. If you have some special format you can include the rules in the prompt and it will get it right nearly all the time. The small number of times it doesn't it will be obvious and easily fixed.

It would certainly give a good first draft and then the speakers of the given languages, who think the translations need work, will chime in no problem, but it won't require them to do all the translation, just correct small details, which is less intimidating to start with than everything.

Anyway, it might be an interesting experiment and if it's a failure you can just remove the language files. Low risk, high reward.

SomeTroglodyte commented 11 months ago

"ChatGPT" - Crappy hack abusing trust Gruesomely (while) Perusing Tetrahydrocannabinol"?

Low risk high threshold - I'm not going to open an account with a commercial entity I don't trust and where I see no good reason to evaluate that trust. So - you go ahead. I'm out.

SomeTroglodyte commented 11 months ago

Actually, the argument may be - how will Goblin Partisan Terror output be redistributable under an undisputable open license? I guess not at all.

That makes this a feature request to support adding languages as mod.

yairm210 commented 11 months ago

IF someone can get ChatGPT to do the work of reading the file and translating line-by-line while retaining placeholders, this sounds like a possible time-saver If this requires manually copy-pasting lines to and from ChatGPT, it's a waste of time. @awhillas if you can make it work for one language and guide us through the steps, we can consider this. But this sounds to me more like "I bet GPT can solve this" than an actual solution.

Regarding licensing for GPT output in the game - that is a GOOD POINT that I had not even considered!

I would definitely evaluate this under "high risk low reward" rather than the opposite. I"m keeping this open for, say, another week so others can gather data and comment, if it's not actionable by then then I'll close it. Because currently, this is NOT actionable.

(What are the limits of input you can send to GPT? Will it be able to eat 5000 line files? What are the limits of output? More questions than answers here)

SomeTroglodyte commented 11 months ago

Actually, implementing Langauge-adding mods might be a nice step, even streamlining the new-lang process. Since not part of Ruleset, would need some special-casing, but doable. A ModOptions unique triggering language table extension or somesuch.

awhillas commented 11 months ago

The licence on the output of ChatGPT, or any LLM or, any AI model language, image, audio whatever, is ambiguous at the moment. Considering that it is trained on a lot of open content I would say they don't have a leg to stand on if trying to claim copywrite on any of its generated output. Also considering many enterprises have build products based on it can not be. Anyway, if you can find any information about copywrite of LLMs output please let me know! (my job depends on it) But just thinking about it, if OpenAI did try to claim copywrite on anything it generated it's whole business model would collapse for who would continue to use them? So there is that.

But if that is still a concern then there are plenty of open source models such as Bloom (haven't tried it for translation as hosting 70billion param model is tricky but there is a way), Llama 2 again, haven't tried for translation but Google Vertex hosts this, just need a GCP account). But I haven't used any of these (yet) as I pay for OpenAI and its easy.

So i don't really see the "high risk low reward" side of it? Translating to 59 languages on the fly for each update also doesn't seem like a "low reward" to me. I'm sure once the industry cottons on to this every game will do it, why wouldn't they?

To get started I'd generally just send one, related, block of text at a time with related context and perhaps some examples of how to handle tricky formatting stuff. But its pretty smart and can figure out most things from patterns/examples.

Give me something like a CSV with source text in one column and I'll write a script to pass it through OpenAI and output in another column of the same CSV (or another one with the same text ID or whatever your using). Should take me 30 minutes.

yairm210 commented 11 months ago

https://github.com/yairm210/Unciv/blob/master/android/assets/jsons/translations/template.properties Not exactly a CSV, delimited by =, but close enough Example translated file: https://github.com/yairm210/Unciv/blob/master/android/assets/jsons/translations/Russian.properties

awhillas commented 11 months ago

cool, I'll do a Russian one so we can compare. And a language you don't have?

SomeTroglodyte commented 11 months ago

language you don't have

Bork! Or Klingon in a pinch. Problem is, the Klingon albhapet :tm: is defined in unicode, but outside the UCS-16 range, so not trivial to include with a libGdx leg to stand on.

SomeTroglodyte commented 11 months ago

Or just look in https://github.com/yairm210/Unciv/blob/master/android/assets/jsons/translations/ and try to think of one missing. Or Greek, it's at 14% coverage, and not been updated in a while, and in a sense Civ-relevant..

yairm210 commented 11 months ago

language you don't have

Bork! Or Klingon in a pinch. Problem is, the Klingon albhapet ™️ is defined in unicode, but outside the UCS-16 range, so not trivial to include with a libGdx leg to stand on.

...no Greek is a good pick, though :)

SomeTroglodyte commented 11 months ago

...no

:sob: - the Monty Python dialect of Hungarian, then? Volapük? Loglan? Dravidian? Blissymbols? Anything that would have stumped Sapir-Whorf?

yairm210 commented 11 months ago

Ah yes the famous checks notes Whorf effect Actually never heard of Blissymbols before "My [unitName] is full of [unitName]s"

awhillas commented 11 months ago

ok, so Greek? I guess you don't have anyone who can check it? I could also do English in a pirate voice or make every longish bit of text a rap :D

yairm210 commented 11 months ago

Greek is good, pirate would be acceptable as a POC but Greek would be better, we can push it to 'prod' and see what people say

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 15 days.

yairm210 commented 7 months ago

Closing - the attempt that we have seen (czech) was deemed unworkable by a native speaker,