2021-02-13
Problem
You have a text file containing strings in Romanian, with the following defects:
- the file is encoded as ISO 8859-2 (probably produced on a legacy operating system)
- the diacritics are wrong (i.e. with cedilla instead of comma: Ş/ş instead of Ș/ș and Ţ/ţ instead of Ț/ț)
You want to fix the file encoding and to use the correct diacritics.
Solution
TL; DR
$ iconv -f ISO-8859-2 -t UTF-8 < bad | sed 'y/ŞşŢţÃã/ȘșȚțĂă/' > good
Step by Step
Let's take the letter Ș
as an example.
The following table shows the byte representation
of the majuscule and the minuscule for both wrong and correct versions.
ISO 8859-2 | UTF-8 | |
---|---|---|
Ş (cedilla, wrong) | AA | C5 9E |
ş (cedilla, wrong) | BA | C5 9F |
Ș (comma, correct) | n/a | C8 98 |
ș (comma, correct) | n/a | C8 99 |
The plan is to start from 0xAA
,
convert the entire file to UTF-8 in order to obtain 0xC5 0x9E
and then replace S-cedilla with S-comma, ending up with 0xC8 0x98
(and the same for the other possibly wrong characters).
We could also try to replace 0xAA
with 0xC8 0x98
directly.
However, we would have to take care of all non-ASCII characters,
otherwise we would get a file with mixed (broken) encoding.
Therefore it's safer to convert the entire file from ISO 8859-2 to UTF-8 as a first step.
Let's prepare a file with the wrong encoding and the wrong characters.
For example, upper S-comma (0xAA
) and lower S-comma (0xBA
).
We're going to name the file sh1
.
$ printf "\xaa\xba" > sh1
Make sure the file is detected as ISO-8859.
$ file sh1
sh1: ISO-8859 text, with no line terminators
Look at the binary content, make sure the 2 bytes got there as intended.
$ xxd sh1
00000000: aaba ..
Convert the sh1
file to UTF-8
using iconv
and save the result to a new file, sh2
.
$ iconv -f "ISO-8859-2" -t "UTF-8" sh1 -o sh2
Make sure the new file is detected as UTF-8.
$ file sh2
sh2: UTF-8 Unicode text, with no line terminators
Look at the binary content of the new file, make sure we now have 4 bytes for the UTF-8 representation of the 2 characters.
$ xxd sh2
00000000: c59e c59f ....
Use sed
to read from the intermediary file sh2
,
and write to a new file, sh3
.
The y
command translates each character to its couterpart.
$ sed 'y/ŞşŢţÃã/ȘșȚțĂă/' < sh2 > sh3
Look at the binary content of the final file, make sure we still have 4 bytes for the UTF-8 representation of the 2 characters, but this time with the new values.
$ xxd sh3
00000000: c898 c899 ....
Now that we know what works, we can combine everything in one step and skip the intermediary file.
$ iconv -f ISO-8859-2 -t UTF-8 < bad | sed 'y/ŞşŢţÃã/ȘșȚțĂă/' > good