Fixing Romanian Text Files

2021-02-13

Tags:
Categories:

Problem

You have a text file containing strings in Romanian, with the following defects:

  • the file is encoded as ISO 8859-2 (probably produced on a legacy operating system)
  • the diacritics are wrong (i.e. with cedilla instead of comma: Ş/ş instead of Ș/ș and Ţ/ţ instead of Ț/ț)
As a result of the above, a program that assumes UTF-8 for encoding might display ª/º and Þ/þ.

You want to fix the file encoding and to use the correct diacritics.

Solution

TL; DR

$ iconv -f ISO-8859-2 -t UTF-8 < bad | sed 'y/ŞşŢţÃã/ȘșȚțĂă/' > good

Step by Step

Let's take the letter Ș as an example. The following table shows the byte representation of the majuscule and the minuscule for both wrong and correct versions.

ISO 8859-2UTF-8
Ş (cedilla, wrong)AAC5 9E
ş (cedilla, wrong)BAC5 9F
Ș (comma, correct)n/a C8 98
ș (comma, correct)n/a C8 99

The plan is to start from 0xAA, convert the entire file to UTF-8 in order to obtain 0xC5 0x9E and then replace S-cedilla with S-comma, ending up with 0xC8 0x98 (and the same for the other possibly wrong characters).

We could also try to replace 0xAA with 0xC8 0x98 directly. However, we would have to take care of all non-ASCII characters, otherwise we would get a file with mixed (broken) encoding. Therefore it's safer to convert the entire file from ISO 8859-2 to UTF-8 as a first step.

Let's prepare a file with the wrong encoding and the wrong characters. For example, upper S-comma (0xAA) and lower S-comma (0xBA). We're going to name the file sh1.

$ printf "\xaa\xba" > sh1

Make sure the file is detected as ISO-8859.

$ file sh1
sh1: ISO-8859 text, with no line terminators

Look at the binary content, make sure the 2 bytes got there as intended.

$ xxd sh1
00000000: aaba                                     ..

Convert the sh1 file to UTF-8 using iconv and save the result to a new file, sh2.

$ iconv -f "ISO-8859-2" -t "UTF-8" sh1 -o sh2

Make sure the new file is detected as UTF-8.

$ file sh2
sh2: UTF-8 Unicode text, with no line terminators

Look at the binary content of the new file, make sure we now have 4 bytes for the UTF-8 representation of the 2 characters.

$ xxd sh2
00000000: c59e c59f                                ....

Use sed to read from the intermediary file sh2, and write to a new file, sh3. The y command translates each character to its couterpart.

$ sed 'y/ŞşŢţÃã/ȘșȚțĂă/' < sh2 > sh3

Look at the binary content of the final file, make sure we still have 4 bytes for the UTF-8 representation of the 2 characters, but this time with the new values.

$ xxd sh3
00000000: c898 c899                                ....

Now that we know what works, we can combine everything in one step and skip the intermediary file.

$ iconv -f ISO-8859-2 -t UTF-8 < bad | sed 'y/ŞşŢţÃã/ȘșȚțĂă/' > good