yairm210 / Unciv

Open-source Android/Desktop remake of Civ V
Mozilla Public License 2.0
8.53k stars 1.58k forks source link

Feature request: support non-ASCII character filenames #9548

Closed GrimPixel closed 1 year ago

GrimPixel commented 1 year ago

Is your feature request related to a problem? Please describe. Non-ASCII characters become question marks.

Describe the solution you'd like Support non-ASCII characters in filenames.

yairm210 commented 1 year ago

We absolutely do support non ascii this sounds like a problem with your default font Can you show an example that the font allows, but not in filenames?

SomeTroglodyte commented 1 year ago

Screenshot, exact Unicode codepoints affected, some System info might help.

GrimPixel commented 1 year ago

In https://github.com/NoviceCoder2000/UnCiv-Vanilla-Music-Pack, some files are written in non-ASCII characters. Here is what the folder looks like after installing it in the game: filename

SomeTroglodyte commented 1 year ago

It's probably annoyed by that misspelling of an irish town and wreaks revenge insidiously...

Jokers aside: image

So - it's not Unciv mangling that, it's your platform. We have learned recently that some Java installations default to trusting the system for file encoding settings, ending up with something other than UTF-8, that may be what's at work here. Remember me asking "some System info"? Right. If that's some desktop, and you can control the java command line, try adding -Dfile.encoding=UTF-8 before the -jar part. Or install a temurin >= 18 and use that. Some OS setting tweak may also solve it - some old Linuxes didn't do UTF-8 by default either. Windoses may default to cp 1252, not good either.

SomeTroglodyte commented 1 year ago

... that is a desktop, right?

SomeTroglodyte commented 1 year ago

@yairm210 - can we get that setting into the distribution wrappers? linuxFilesForJar is only a little side aspect. It is regrettable that a System.setProperty("file.encoding", "UTF-8") in fun main is already too late - and shoehorning it in via reflection won't work either: Unable to make field private static volatile java.nio.charset.Charset java.nio.charset.Charset.defaultCharset accessible: module java.base does not "opens java.nio.charset" to unnamed module @458c1321.

One could try the Mac main-thread thing approach - check setting, and if wrong, relaunch oneself with fixed settings? I'd say overkill. Messagebox "your system is broken" and quit :smiling_imp:

GrimPixel commented 1 year ago

It's MATE on Arch Linux.

SomeTroglodyte commented 1 year ago

Uuuuh, accusing Arch of being broken - Ouch. But you can break it for some apps nevertheless... What are your env settings for the LC_* and LANG* variables? Anything unusual? LC_COLLATE="C" is OK, everything else should be UTF-8 based. And - java --version?

GrimPixel commented 1 year ago

LANG=C and no LC_

$ echo $LC_COLLATE
$ java -version
openjdk version "1.8.0_372"
OpenJDK Runtime Environment (build 1.8.0_372-b07)
OpenJDK 64-Bit Server VM (build 25.372-b07, mixed mode)
SomeTroglodyte commented 1 year ago

Yeah, it may well be that such an old Java (OK, 372 is young but the language level 1.8 is dark ages) will interpret LANG=C as ISO-8859-1 default encoding -> everything explained.

Is that Arch standard or did you fiddle? I'd understand - in Mint, using the system settings to control locale-related stuff leads to unacceptable date formats and file manager sorting - those would be better under LANG=C but that's the wrong solution.

As for Java, I can't really recommend a clear path, als there's no Java yet both having LTS versions and "properly" ignoring the system and always using UTF-8 default encoding instead... I use Java 17 LTS, and encoding improves with Java 18+ as we've heard.

If you fix that (whichever way, including the java encoding argument), you'd likely have to redownload the affected mods.

GrimPixel commented 1 year ago

Good to learn that details. My Arch Linux is only customised in few aspects.

yairm210 commented 1 year ago

@SomeTroglodyte I would vote against the various "fixes" for encoding that mess with the system. We could have some sort of warning, but I'm not sure where would be the correct place to put it.

As for the e2e flow:

A. It looks like the Zip by default uses UTF-8: image so I assume this is a problem in reading the file name and not it writing it

B. @GrimPixel What do the filenames look like to you outside on Unciv?

C. Maybe when reading filenames we can specify encodings? I see that in fromJsonFile for example we have file.readString(Charsets.UTF_8.name()) - but I never even thought about the name of the file being encoded differently, I can't find a place where LibGDX allows to state the encoding of the filesystem

GrimPixel commented 1 year ago

That image is from file manager. It is as shown.

yairm210 commented 1 year ago

In that case, I was wrong, and it is about writing the file name 😞

SomeTroglodyte commented 1 year ago

Not necessarily. As I said elsewhere, I can't manage to debug because I cannot set an encoding locally that is not UTF-8 and still allows Unciv to run at all - so it would be critical to know which it actually is, but I keep asking and nobody knows how to answer... As for C - yes where the stream is close we can enforce UTF-8 but in atlas readwrite we can't -> look on the Gdx issue tracker, I forgot the number... No, # 7155 it is.

SomeTroglodyte commented 1 year ago

Found a Windows Box. file.encoding is "windows-1252", and that mod downloads fine, all file names are written as they are in the Zip, viewed with an older 7-zip. All except one ((Ambience) 대한제국 애국가 - Ambient) display fine in Unciv - but it can play, ergo font issue.

This issue here, however, must be still another constellation. I would guess wrong encoding mapping during the actual writing, memory to disk not zip to memory... For one, the zip format has a per-file flag for UTF-8, it's set in that mod's zip, and the zip library sources show it's respected.

Once the file is available as stream in our code, however, we go to java.io for a FileOutputStream, and there it ends - control of encoding when streaming text to bytes is easy to find, but control of file name encoding - nothing. Java's library seems intentionally obtuse about that, no mention at all in docs (Nice g00gling that method name and "encoding"). Except if we go to java.nio.Path maybe - but mention that and the IDE demands an API level 26 flag - on crossplatform code.

OP's isse I'm sure would be solved by other environment settings than that LANG=C.

yairm210 commented 1 year ago

I'm closing this as 'outside the scope of the game and solvable by other means'