zorbaTheRainy / autotranslate

A script/docker that automatically translates PDFs using the DeepL API
MIT License
4 stars 0 forks source link

AutoTranslate

AutoTranslate is a small lightweight application that uses the DeepL API to translate PDF documents from one language to another.

Features

AutoTranslate doesn't cost any money. The DeepL API has a free option. However, the DeepL API has a usage limit (per month) and you can pay extra to DeepL to increase that monthly limit.

Requirements

You must have a DeepL API account!

They have a free account which allows the translation of 10 documents per month. DeepL measures the quota in terms of characters (500,000 for the free account), but each document, regardless of size, seems to consume 50,000 characters. Assuming you only translate documents, that means you can translate 10 documents a month for free.

DeepL does have non-free accounts which let you translate millions of characters (20 documents) per month.

Please sign up for an account on the DeepL website.

Getting your DeepL API/Authentication Key

Getting AutoTranslate

AutoTranslate is a small Docker container. Pull the latest release from the Docker Hub:

$ docker pull zorbatherainy/autotranslate:latest

If you want to run the Python script outside of Docker, just copy it from the GitHub repository.

Using AutoTranslate

The simplest way to use AutoTranslate is to run the docker container.

Before we get to the Docker instructions, you should make sure everything is set up.

You will need:

Once you have created these you are ready for the Docker command.

I use compose. Conversion of the below into the command line command is "left as an exercise for the reader".

Sample Docker compose

---
version: "2.1"

services:
    autotranslate:
        image: zorbatherainy/autotranslate
        container_name: autotranslate
        environment:
            - DEEPL_AUTH_KEY=aaa11111-a1aa-1a11-a1a1-a1a1aa11aaa1:aa              # (mandatory) get your key at https://www.deepl.com/account/summary
            # - DEEPL_TARGET_LANG=EN-US                                           # (near mandatory)  The language isn't really changeable on the fly.  Assumes you want all your files translated to a common language
            # - DEEPL_USAGE_RENEWAL_DAY=1                                         # (optional) If you put the day of the month your DeepL allowance resets, then the expired usage sleeping will be more accurate.  Otherwise, it just waits 7 days before trying again.
            # - CHECK_EVERY_X_MINUTES=15                                          # (optional) How often you want the inputDir scanned for new files
            # - ORIGINAL_BEFORE_TRANSLATION=0                                     # (optional) When appending the original and translated files, which should go first?
            # - TRANSLATE_FILENAME=1                                              # (optional) Should the filename also be translated?
        volumes:
            - /etc/localtime:/etc/localtime:ro                                    # (optional) Sync to host time
            - /volume1/translate/:/inputDir                                       # (mandatory) The directory where you put the un-translated file
            - /volume1/consume/:/outputDir                                        # (mandatory) The directory where AutoTranslate will put the translated file
            - /volume1/autotranslate_logs/:/logDir                                # (near mandatory) The directory where log files are stored, 1 per input file and a master log file

Environment variables and configuration

AutoTranslate variables and configurations are passed via environment variables.

The only mandatory variable is the DEEPL_AUTH_KEY.

Env Variable Default Purpose
DEEPL_AUTH_KEY None The Authentication Key from DeepL that allows you to use their API server.
DEEPL_TARGET_LANG EN-US The target language to translate the documents into. The original language will be auto-detected by DeepL. A list of language codes may be found in the DeepL API documentation under the target_lang parameter.
DEEPL_USAGE_RENEWAL_DAY 0 Eventually, you are going to try to translate more documents than you have quota with DeepL. At this point, the script will stop trying to translate and sleep until your quota is renewed (the start of your new month). This variable tells the script what day of the month your quota will be renewed, and the program can wake-up and resume translating. For example, if your DeepL subscription renews on the 5th of the month, put "5" in this variable. If this variable is not set, the program will wake-up every 7 days and see if your quota has been renewed. Acceptable values are 1-31. Values outside that range are treated as if the variable was not set.
CHECK_EVERY_X_MINUTES 15 The frequency at which the input directory will be scanned for any new files.
ORIGINAL_BEFORE_TRANSLATION false (> v2.1.3) When appending the original and translated files, which should go first?
TRANSLATE_FILENAME true (> v2.2.0) Should the filename also be translated? This is a bit experimental.

Volumes

Volume Purpose
inputDir The directory where you put the un-translated PDF files. Non-PDF files will be ignored.
outputDir The directory where AutoTranslate will put the translated PDF files, which have been appended to the original un-translated file.
logDir The directory where log files are stored, 1 per input file and a master log file (_autotranslate.log).

Other random quirks you may be interested in

Thoughts on use with Paperless

My original idea for this project was I was tired of feeding in documents one at a time to the Google Translation web site. The Google API costs money; the DeepL API does not. I decided to use the DeepL API.

At about the same time, I started to look into Paperless, "a document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper."

AutoTranslate works really well with Paperless.

Auto-Tagging and Consumption by Paperless

Like AutoTranslate, Paperless works on a system where there is a monitored directory (named "consume") and files placed in that directory are moved to the Paperless database. Chaining the AutoTranlate output directory to the Paperless input/consume directory works very well.

To turbo-charge this, make use of Paperless's auto-tagging feature. Paperless can be setup to have sub-directories in the input/consume directory and to automatically tag any files in there with the sub-directory's name. By placing the output of AutoTranslate into a Paperless sub-directory, you can tag (in Paperless) those files with the original language or another tag.

For example, a directory named:

/paperless_data/consume/german

would result in Paperless importing files (in /consume/german) tagged with "german". This allows you to separate those from non-translated files in Paperless.

AutoTranslate doesn't use separate output directories for each source language, but I am assuming you really only translate from one language (probably the foreign country you live in). If you have multiple languages you could run multiple instances of the docker, each pointing to a different sub-directory.

To make this work, you need to set some Paperless ENV variables (via Paperless's docker compose file). Specifically:

PAPERLESS_CONSUMER_RECURSIVE: 1
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: 1

Again, look at Paperless's consume config documentation.

Pre-Consumption Script

You could use the autotranslate.py Python script directly as a Paperless Pre-Consumption script. The documentation for the Pre-Consumption script function is here.

The reason I don't recommend this is because it would run all files imported/consumed into Paperless through the DeepL API, eating up you monthly quota. But it is possible.

Is it Abandoned?

If this project has not been updated in awhile everyone asks, "Is it abandoned?"

No, probably not.

It is probably just done.

When I get this to a state where it doesn't need improvement, I won't improve it. I am not making improvements just to show "progress". I use this too much to just ignore it. If it works and I am happy, I won't change it. If it breaks and I need it, I'll fix it.

But if it works well enough, I'll do something else (the urge to tinker aside).

License

MIT

#