ASCII to Unicode Converter for Assamese scripts
Problem Statement:
Convert Assamese texts written in ASCII scripts using Ramdhenu DTP software or Adarsha Ratne Lipi into text written in Unicode scripts.
Why Unicode:
Assamese scripts written in Ramdhenu or Adarsha Ratne Font are good for print medium as they have better coverage for complex Juktakhors and supports designer / professional looking glyphs. However these tools render the script in ASCII characters which is essentially English characters with a font face or glyph or a display symbol attached to a Assamese letter or number. This however has quite a few disadvantage:
- They share the ASCII character values of Roman scripts, so a font or software having a specific mapping is required to map and display the Assamese characters. Again, this mapping itself is not standard, so text written in Ramdhenu won’t get displayed properly in Adarsha Ratne. This is a major blocker for publishing content over Internet as end user may not have these softwares due to technological or licensing restrictions.
- As they share the character codes with Roman scripts, writing software codes to manipulate them is not easy as the codes can’t differentiate between a Assamese character from an English one since both share the same character codes or identity.
On the other hand, Unicode allocates individual codes or identity to each character for a variety of languages including all major India scripts (i.e. Devanagari, Tamil, Bengali etc). Tha facilitates any modern computing system to correctly and uniqely identify each character of a script.
So when the text or content written in ASCII scripts is converted to Unicode character-set, it enables users to view the content with generally and freely distributed Unicode fonts (all newer OS releases of Windows, Linux, Mac , iPhone OS comes bundled with Unicode enabled fonts at no extra cost and all major Internet Browsers like Firefox, Safari, Internet Explorer, Opera or Chrome supports Unicode enabled UTF8 encoded websites). Only issue with Unicode is that the display is generic, may not reflect the perfect representation of Assamese characters. Also the end user experience may not be same for a user visiting a website from Windows or from Linux.
Intended users:
There is a huge content of Assamese texts residing with the publishing houses and news media houses in Assam which are written using ASCII based DTP tools. Such content are very useful for printing but they are useless when intended for sharing in the Internet. Unless these texts are converted into Unicode, they would remain locked out of a wide range of audience who is more accessible over web only.
The owners of these content can use this tool to convert the content into Unicode and publish them over web.
The Design:
The tool can be designed two ways:
- Online version: Lightweight and preferably web bases utility that can take plain text as input and converts the text into Unicode. This model can be useful when the content is small and preservation of the formatting is not required.
- Full fledged converter (or the Desktop version): This should be able to read binary documents like Word docs or PDFs containing multiple ASCI fonts and convert the text into Unicode and preserve the formatting at the same time. This would require installation of the software at users desktop. This might be a VB Macro script embedded with MS Word to read the document within word and execute the conversion.
The prototypes:
As a prototype or proof of concept program, I had taken up developing the online version of this tool. Its developed using PHP and deployed on Apache webserver. This was developed with two different set of ASCII source as input.
One is for converting from Adarsha Ratne Lipi, a functional program is running here: http://www.xophura.org/a2u/a2u.php. This one is the original version of this program and I used it to convert most of the legacy scripts of www.xophura.org into Unicode. This one does not have a fully developed converter for the Ramdhenu scripts
The other is to work with Ramdhenu Fonts, a semi functional model is available here: http://www.xophura.org/a2u2/a2u.php
These prototypes are free to use, and one can actually provide ASCII texts as input to get corresponding Unicode output.
Code walk-through of the converter:
While converting the text from Ramdhenu, the following points to be noted:
- Typing in Ramdhenu is positional based. i.e. When somebody wants to write “kiba eta (কিবা এটা)”, they would type in the “hrosso e kar”(ি ) first and then type in ko (ক). But for the next letter, it will be “ba”(ব) and then “aa kar”(া). Its the same way as one may write with hand in paper, we complete the symbols from left to right. But in Unicode, its typed in the way when we spell. All “hrosso e kar”(ি ) / “aa kar”(া) would be typed in after the letter. The Unicode fonts would have the grammar, which then decides the display position of these “kar”s. So, in Unicode, “ko” + “hrosso e” would result in “ki” but in Ramshenu, it has to be typed in the reverse order. This requires shuffling of position of such characters (i.e. “hrosso e karি “ ) while converting from ASCII to Unicode
- Most of the “juktakhor”s in Ramdhenu are symbols or single ASCII characters having a monolithic glyph for display. This enhances the display or print quality of the content but would cause complexities when presented in html. The “juktakhor”s in Unicode is represented as combination of multiple characters including the “byonjon sinho”. So, the “koi toi akto (ক্ত)” in Unicode is actually the combination of three Unicode characters: “ko ক) + “byonjon siho ( ্ )” + “to (ত)”
- Ramdhenu also uses specific symbols to create specifically shaped letters/juktakhors. i.e. it has one symbol for “toi toi atto ত্ত” and another symbo for the hook of “ko”. So when typed and placed togather, it actually showed up like “koi toi akto (ক্ত)” in print.
- Ramdhenu typically uses 3 sets of fonts together to display different characters to display different charactes using the same ASCII codes. So, when the formatting of the symbol for 1 with “Geetanjali Lite font”, it would show up as “ro” and the same is formatted to “Geetanjali P” font, it will show up as Assamese number “ek ১”. This makes impossible to display Assamese content written in Ramdhenu as plain text.
- For o kar & aou kar, Ramdhenu uses e kar & aa kar symbols typed in before and after the Assamese letter for display. Unicode on the other hand has specific Unicode characters to represent them and the font grammar rule internally places them around the letter when displayed. So, the converter needs to look for combination of e kar & aa kar around a letter or compound letter(juktakhor) to convert them to single o kar or aou kar.
Due to above positional differences of characters and also many to many mappings of the ASCII codes to Unicode characters, the conversion process has to be done in multiple passes or runs. For each source (i.e. Ramdhenu or Adarsha Ratne), I have created a mapping table (either in xml or stored in DB), that maps each ASCII character in the source to a Unicode character or Unicode character combinations (symbols to Juktakhor mapping). Also, in the mapping table, the Unicode character code is also stored along with the Unicode character value. Due to the many to many mapping relationship of ASCII characters to Unicode values, some of the characters to be converted before others. So, a conversion preference of 1 to 10 is defined for each mapping. The mappings having a smaller preference number is converted before the ones with higher preference number.
Here is the flow of the converter prototype that is written for plain text conversion (input is plain text of ASCII characters, output of Unicode Assamese characters).
- Read the full input text into a variable
- Load the ASCII to Unicode mapping table into an array
- Convert each character of the ASCII input into the corresponding Unicode Code (not value) as per mapping and in the order of mapping preference.
- Look for the “ae kar”, “hrosso e kaar” and “oi kaar” and move them one character behind. i.e [ “hrosso e kaar” + “ko” ] ->[ “ko” + “hrosso e kaar” ]
- Similarly Switch position of “sondro bindu” before “aa kaar” or “u kaar” to after these “kaar”s
- Convert all “ae kar” + “aa kar” to “o kar”
- Convert all the Unicode Codes into actual Unicode Values.
This process of converting from plain text/web based tool has coupe of limitations:
- Some of the ASCII character, specifically the numeric characters are reused in Ramdhenu to display different Assamese characters using different fonts. This tool uses plain text, so no font information is present in the input and hence it is technically impossible to always correctly convert to Unicode. The user has to manually review the content to fix such issue
- Being web based, there is a limitation of the text it can take as input and process
- Its plain text, so the formatting may have to be done again.
- Its not always 100% correct. I would rate 90% + accuracy at this point. It improves with further tuning the mapping table. In fact the mapping table is one of the most critical part.
Future direction:
- There is scope for improving the mapping table for the web based version.
- The UI can be made more efficient and resilient by setting limit on input and output text.
- This web based program can be ported to a desktop edition and then improve the features to retain formatting and most importantly automatically recognize the font being used to correctly identify what Assamese character is being represented in the text.