willbasky / hibet

Tibetan-English translator for CLI
BSD 3-Clause "New" or "Revised" License
14 stars 4 forks source link

Add dictionaries from the tibetan-dictionary repo (resolves #75 and #51) #83

Closed dignissimus closed 4 years ago

dignissimus commented 5 years ago

Ran the follwing script as ./add_dictionaries.sh ../tibetan-dictionary/_input/dictionaries/public/ dicts/

SOURCE_LOCATION="$1"
TARGET_LOCATION="$2"

for file in $SOURCE_LOCATION/*
do
    new_name="$(echo $(basename $file) | sed -r 's/[0-9]+-(.+)/\1.txt/')"
    cp -n $file $TARGET_LOCATION/$new_name 
done

Resolves #75 and #51


This change is Reviewable

willbasky commented 5 years ago

a discussion (no related file): Thank you!

I have couple questions.

  1. Why run this script if you have already added dictionaries to the main folder?
  2. What about other dictionaries from https://github.com/christiansteinert/tibetan-dictionary/tree/master/_input/dictionaries/public_en that are from English to Tibetan (Sanskrit)?

I see some names of English dictionaries are the same as Tibetan ones. So I will extend meta description for these purposes.


dignissimus commented 5 years ago

Why run this script if you have already added dictionaries to the main folder?

I wrote the script to automate copying over the dictionaries, I ran the script to add the dictonaries to the main folder

What about other dictionaries from https://github.com/christiansteinert/tibetan-dictionary/tree/master/_input/dictionaries/public_en that are from English to Tibetan (Sanskrit)?

All the dictionaries in the public_en folder are inside the public folder, so there was no need to do both

willbasky commented 5 years ago

All the dictionaries in the public_en folder are inside the public folder, so there was no need to do both

Some of them or all of them have the same name with different content.

dignissimus commented 5 years ago

Some of them or all of them have the same name with different content.

Ah, which ones should be kept? The public_en ones?

dignissimus commented 5 years ago

I ran ./check_duplicates.sh tibetan-dictionary/_input/dictionaries/public tibetan-dictionary/_input/dictionaries/public_en with

#!/usr/bin/env bash
SOURCE_DIRECTORY="$1"
TARGET_DIRECTORY="$2"

for source_file in $SOURCE_DIRECTORY/*
do
    b_name=$(basename $source_file)
    target_file=$TARGET_DIRECTORY/$b_name
    if [ -f "$target_file" ]; then
        source_hash=$(md5sum $source_file | cut -d ' ' -f 1)
        target_hash=$(md5sum $target_file | cut -d ' ' -f 1) 
        if [ "$source_hash" == "$target_hash" ]; then
            echo "$b_name is a duplicate in both directories ($source_hash)"
        else
            source_size=$(wc -c $source_file | cut -d ' ' -f 1)
            target_size=$(wc -c $target_file | cut -d ' ' -f 1)
            echo "$b_name is not a duplicate in both directories ($source_hash, $target_hash)"
            if ((source_size > target_size)); then
                echo " - The file from the first directory is larger"
            else
                echo " - The file from the second directory is larger"
            fi
        fi
    fi
done

All the files from the public_en directory except from 23-GatewayToKnowledge are larger than their respective counterpart in the public directory, should I keep the larger files?

willbasky commented 5 years ago

Wait a bit. i remake meta for adding new dictionaries.

should I keep the larger files?

They are different. So we need to have all of them by renaming. It needs to take stock of them carefully.

willbasky commented 5 years ago

Hey! I have just push some changes to titles file with #84 Now, there are explicit format of path and we can add new dictionaries with similar names.

It may be used on #75 issue.

I am not sure about #51 when they were added by simple replacing because I fixed some wrong formatting in some of them.

dignissimus commented 5 years ago

Hey! What do "T|E" and "T|S" stand for in the file names? Edit: Also, what changes should I add to this?

willbasky commented 5 years ago

What do "T|E" and "T|S" stand for in the file names?

Tibetan | English and Tibetan | Sanskrit

Also, what changes should I add to this?

When you add new dictionary, info about it should be added to titles When you add renew version of existent dictionary, it is more complex, if there is changes in hibet side and in importing side. The difference must be taken into account.