Saturday, July 10, 2021

Improving the Tellsis language translator: Update 1

I mentioned that I want to improve my Tellsis language translator app to include an image picker and OCR. So I spent last night trying to train Tesseract to recognise Tellsis alphabets.

First, why Tesseract? Because I read somewhere that Tesseract can be trained to recognise a new font, so it should be easy for me to train it to recognise Tellsis, or so I thought. The usual method for learning a new font in Tesseract is based on a trained model. For example, using the model that has been trained to recognise English alphabets, a set of training data is created using the new font and training is conducted to fine tune the existing English model. Using the script here as a hint, this was what I tried to do.

But it didn't work that well. Because the model is already trained to recognise English, and the character for "U" in Tellsis looks like an "O" in English. So the model will forever recognise "U" for "O" instead. Not good.

So I need to train a model from scratch. I used the instructions here to tweak my script to train from scratch. This took some time, but I was able to get satisfactory results... except for some characters which end up being capitalised when they should remain as small letters. This is supposed to be the hallucination effect. Which I think is caused by me using English text to create the training data. The better way is to find text in Tamil script, convert that to unaccented English alphabets, then carry out the substitution cipher to obtain text in Tellsis. The Tellsis text can then be used to create training data, and the resultant model should be able to avoid the hallucination effect.

Problem is, I don't know how to convert Tamil to unaccented English alphabets.

One way is to use the existing Tellsis language translator and adapt it to read an entire file (English text). Then translate that file into Tellsis, save it, and use that new file to generate training data. This sounds like more work again... and I slept in the wee hours last night trying to figure out how to train Tesseract, so I am a bit sleepy now...

Anyway, as for selecting an image or camera capture, image_picker does not work on Flutter Desktop, so I will need to use the universal_platform package to identify the platform, and run image_picker on Android and file_selector on Linux and Windows. Another challenge for another day...

By the way, I don't own any Apple products, so I cannot develop for iOS or MacOS. The only time I have used an Apple product was many years ago in school when we found an old Apple IIe in a corner of the computer club room and plugged it in to find that it could actually still boot up.

Back to the topic of Tesseract. I am trying to train a model from scratch using two fonts. This is the script. (I am still playing around with the script, so some parameters may change over time.)

# Remove the previously generated training data
rm -rf train/*

# Generate training data
MAX_PAGES=100
NUM_ITERATIONS=10000
cd src/training
./tesstrain.sh --fonts_dir ~/github/tesseract/fonts --fontlist \
  "Automemoryfont" \
  "TellsisTyped" \
 --lang eng --linedata_only --langdata_dir ~/github/tesseract/langdata_lstm --tessdata_dir ~/github/tesseract/tessdata \
 --maxpages $MAX_PAGES \
 --output_dir ~/github/tesseract/train
cd ..
cd ..

# Train the model from scratch
rm -rf output/*
OMP_THREAD_LIMIT=8 ./lstmtraining --debug_interval -1 \
  --traineddata ~/github/tesseract/train/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/github/tesseract/output/telsis --learning_rate 20e-4 \
  --train_listfile ~/github/tesseract/train/eng.training_files.txt \
  --eval_listfile ~/github/tesseract/train/eng.training_files.txt \
  --max_iterations $NUM_ITERATIONS
#  --max_iterations $NUM_ITERATIONS &>~/github/tesseract/output/basetrain_typed.log

# Combine the checkpoints and create the final model
./lstmtraining --stop_training \
  --continue_from ~/github/tesseract/output/telsis_checkpoint \
  --traineddata ~/github/tesseract/train/eng/eng.traineddata \
  --model_output ~/github/tesseract/output/telsis.traineddata

cp ~/github/tesseract/output/telsis.traineddata ~/github/tesseract/tessdata/telsis_typed.traineddata


Update July 26, 2021: I have worked on the app to include this trained model, so right now, the app (v0.1.4_alpha) has OCR capabilities. But the trained model is based on the font I made, so it has very poor performance on actual text. If anyone wants to work on training data for the model using text found in the anime, please feel free to use the trained model here to improve it. As the tesseract_ocr package only works on mobile devices, this feature has only been tested on Android (I don't have an Apple product).

No comments: