yodalee / mozc

Mozc - a Japanese Input Method Editor designed for multi-platform
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

[Critical] Some bad words #88

Open yodalee opened 6 years ago

yodalee commented 6 years ago

mozc-2.20.2677.102 searched by The Silver Searcher

$ cd mozc-2.20.2677.102/src/data/dictionary_oss/
$ ag チョン

dictionary02.txt
123014:あさだまおちょん 1896    1827    4003    浅田真央チョン
123015:あさだまおちょんざまあ  1896    1827    4643    浅田真央チョンザマア
123016:あべちょん    1898    1827    5846    安倍チョン
123031:しなちょん    1897    1827    6003    支那チョン

dictionary01.txt
105331:きかちょん    1817    1827    7908    帰化チョン
105332:きたちょん    1839    1827    3323    北チョン
105361:くそちょん    1827    1827    7099    糞チョン
105404:ざいにちちょん  1817    1827    6886    在日チョン
105435:ぜんいんちょん  1884    1827    7589    全員チョン
105478:ばかちょん    1817    1827    7618    馬鹿チョン
105479:ばかちょん    1906    1827    6602    馬鹿チョン
117139:ちょんしね    1827    687 7679    チョン死ね
$ ag 死ね

dictionary01.txt
117044:おおかわしね   1827    687 7457    大川死ね
117103:ざいにちしね   1817    687 6857    在日死ね
117105:しのだけしね   1827    687 7953    篠竹死ね
117139:ちょんしね    1827    687 7679    チョン死ね
117157:ねとうよしね   1827    687 7182    ネトウヨ死ね
dictionary06.txt
32107:ちょうせんのしょくふんぶんか    1817    1817    9348    朝鮮の食糞文化
Type "ちょうせんの" => "朝鮮の食糞文化" will be suggested.
yodalee commented 6 years ago

Search "浅田真央チョンザマア" by Google => only 8 hits. I don't know how you got the word, but I think you should remove some web sources for making mozc's dictionary. Sad and disappointed.

yodalee commented 6 years ago

This should definitely be fixed.

@gary816 you should make a pull request. seems simple enough, just delete the lines in question and push to your github fork and click pull request.

yodalee commented 6 years ago

Unfortunately Mozc team is not accepting pull requests at the moment. https://github.com/google/mozc/blob/master/PULL_REQUEST_TEMPLATE.md

I agree to be fixed this problem, so I will try to contact with a maintainer.

yodalee commented 6 years ago

Thank you for the reporting. We will check the logic of the automatic data generation to reduce unnecessary data.

These files are internal data which are not intended to be used directly. It may contain any character sequences.

Thank you,