dsl dictionary description parse issue on Windows

xiaoyifang commented 1 day ago

JMDict Furigana, JMDict+: https://jd4gd.com/jmdictplus.html

Originally posted by @darlopvil in https://github.com/xiaoyifang/goldendict-ng/issues/1875#issuecomment-2490265261

The description seems have some encoding issue.

shenlebantongying commented 1 day ago

Not reproducable on my machine (Linux) :sweat_smile:

xiaoyifang commented 1 day ago

Not reproducable on my machine (Linux) 😅

restrict it to Windows

xiaoyifang commented 1 day ago

dsl use QTextStream read the description from .ann file. QTextStream will use system's default codec. need to set the encoding .

example(generate with AI)

   QTextCodec* detectEncoding(const QByteArray& data) {
    // 尝试检测编码
    QTextCodec::ConverterState state;
    QTextCodec* codec = QTextCodec::codecForName("UTF-8");
    codec->toUnicode(data.constData(), data.size(), &state);

    if (state.invalidChars > 0) {
        // 如果有无效字符，尝试其他编码
        codec = QTextCodec::codecForName("ISO 8859-1");
    }

    return codec;
}

int main() {
    QFile annFile("path/to/your/file.txt");

    if (!annFile.open(QIODevice::ReadOnly | QIODevice::Text)) {
        qDebug() << "Failed to open file.";
        return -1;
    }

    QByteArray data = annFile.readAll();
    annFile.close();

    QTextCodec* detectedCodec = detectEncoding(data);

    QTextStream annStream(&annFile);
    annStream.setCodec(detectedCodec);

    annFile.open(QIODevice::ReadOnly | QIODevice::Text);
    QString content = annStream.readAll();

    qDebug() << "File content:" << content;

    annFile.close();

    return 0;
}

readAll can be replaced with readline

shenlebantongying commented 1 day ago

The default behavior of QTextStream is trying to use one of the Unicode encodings.

By default, UTF-8 is used for reading and writing, but you can also set the encoding by calling setEncoding(). Automatic Unicode detection is also supported. When this feature is enabled (the default behavior), QTextStream will detect the UTF-8, UTF-16 or the UTF-32 BOM (Byte Order Mark) and switch to the appropriate UTF encoding when reading.

https://doc.qt.io/qt-6/qtextstream.html#details

shenlebantongying commented 1 day ago

The file is UTF16 without BOM, we cannot reliably detect the byteorder.

On my Linux system, the encoding of the annStream detected is Utf8, but somewhat displayed correctly just by accident.

The file is wrong. This is not fixable.

shenlebantongying commented 1 day ago

I sent a short message to the dict author.

I don't think we can do something here. The original code works accidentally in the original GD because QTextStream in Qt4/5 don't try to detect Utf8.

xiaoyifang commented 15 hours ago

The file is UTF16 without BOM, we cannot reliably detect the byteorder.

I think maybe we can .

https://github.com/xiaoyifang/goldendict-ng/blob/5b70a7e081b655f3d7a90d1aa2fc0a65c16daff0/src/dict/dsl_details.cc#L875-L898

xiaoyifang / goldendict-ng

dsl dictionary description parse issue on Windows #1974