1 of 35

Universal Acceptance (UA) Micro-Learning Module: Module 2- Unicode Advanced Programming in Python.

Instructor Guide

1st Edition.

���© 2024 Creative Commons License - Attribution 4.0 International (CC BY 4.0).

Universal Acceptance

2 of 35

Unicode Advanced Programming- Micro-Learning Module Objectives.

  • This module is designed to expand your understanding and proficiency in working with Unicode by covering key aspects such as the character-glyph model, Unicode normalization, accessing the Unicode character database, and comparing and sorting Unicode strings.
  • When relevant, instructors can cover optional topics as indicated in the students guide.

  • At the end of this module, students should be able to:
    • Gain a clear understanding of the character-glyph model and how it relates to Unicode, including the concepts of code points, character encoding, and glyph representations;
    • Explore the functionality and capabilities of text engines that handle complex text layouts;
    • Acquire knowledge and practical skills in Unicode normalization techniques;
    • Understand the importance of normalization for consistent text processing and comparison;
    • Learn how to access and utilize the Unicode character database; and
    • Understand the challenges and best practices for accurate string comparison in a multilingual and diverse context.

| 2

3 of 35

Note About the Utilization of Unicode String Literals :

  • The Unicode string literals utilized in the code examples necessitate input methods, whether virtual or physical keyboards, suited to the script from which the literals are derived. In the event that these input methods are unavailable, alternatives such as language translation tools can be used.
  • Throughout the module, Unicode string literals derived from various scripts have been used in the examples code, spanning from lesser-known to more widely used ones. This approach aims to broaden learners' exposure to a diverse range of scripts..
  • The module provides the meanings of Unicode string literals used in the example codes in English, along with their transliterations, to aid learners in accurately pronouncing them.
  • Instructors are free to use Unicode literal strings derived from a script of their choice rather than the ones included in the examples code.

| 3

4 of 35

Character-glyph Model

  • Overview of Character-glyph Model.
    • Clear separation between characters and glyphs
    • Characters represent the abstract units of written language
    • glyphs are the visual representations or renditions of characters.
  • key difference between processing textual information and displaying
    • distinction between characters and glyphs.
    • Operations are performed based on the character's Unicode code points, independent of their visual appearance.
    • displaying text involves converting characters into their corresponding glyphs for visual presentation.

| 4

5 of 35

How Character-Glyph Model is Utilized.

  • Python's support for Unicode allows you to perform operations on characters based on their abstract representation, without being concerned about their visual appearance.

,

unicode_string = "Hello, 你好, नमस्ते"

# Iterate through the characters in the string

for char in unicode_string:

print(char)

| 5

6 of 35

Comparing Unicode Strings:

  • Compare two Unicode characters

  • In this code snippet, the ord() function is used to retrieve the Unicode code point of each character, enabling a comparison based on their abstract representation rather than their appearance as glyphs.

,

char1 = 'ሀ'

char2 = 'ለ'

if ord(char1) < ord(char2):

print(f"{char1} comes before {char2}")

else:

print(f"{char1} comes after {char2}")

| 6

7 of 35

Text Rendering: Engines, Fonts, and Glyph Shaper.

  • Text rendering:
    • Text engines, fonts, and glyph shapers play crucial roles.
  • Text Engines::
    • are responsible for handling the layout and rendering of text.
    • analyze the Unicode text, apply appropriate layout algorithms, and determine the correct positioning and shaping.
  • Fonts:
    • provide the necessary glyphs (visual representations) for each character.
  • Glyph Shapers:
    • shaping glyphs to produce visually connected and contextually appropriate representations of characters.

| 7

8 of 35

How do Glyph Shapers Handle the Shaping of Characters

with Diacritical Marks or Vowel Signs?

  • Glyph shapers:
    • handle the shaping of characters with diacritical marks or vowel signs.
  • Glyph shapers does the following tasks:
    • Base Character and Diacritical Mark Identification.
    • Positioning and Placement.
    • Mark Positioning Rules.
    • Contextual Adjustments.
    • Ligatures and Special Forms.

| 8

9 of 35

Normalization of Unicode strings - NFC and NFD(1/2)

  • NFC (Normalization Form C):
    • Composes characters and combining marks whenever possible.
    • Represents characters as precomposed forms, where a single code point represents a composed character.
    • Useful for compatibility with older systems and applications that rely on precomposed characters.
    • It is generally preferred for interoperability and compatibility purposes.
    • It allows for efficient storage and comparison of strings.
    • It can be beneficial when working with most modern software and protocols.

| 9

10 of 35

Normalization of Unicode strings - NFC and NFD(2/2)

  • NFD (Normalization Form D):
    • Decomposes characters and combining marks.
    • Represents characters as a sequence of separate code points, with combining marks following their base characters.
    • Useful for applications that require working with decomposed characters and need to accurately process combining marks.
    • NFD can be beneficial for applications that perform in-depth text processing or linguistic analysis.
    • It provides a more granular representation of characters, making it suitable for certain text manipulation tasks.
    • NFD can help identify and handle specific linguistic and orthographic variations.

| 10

11 of 35

Unicode Text Normalization with NFC and NFD: Example Using Python(1/3):

  • Normalization Examples Using Python:

,

import unicodedata

# Example text

text = "Café"

# Normalize to NFC

nfc_text = unicodedata.normalize('NFC', text)

print("NFC normalized text:", nfc_text)

#Output: Individual Unicode Code Points for the NFC normalized text: Café

for char in nfc_text:

if char != text:

print(char, f"U+{ord(char):04X}")

# NFC Outputs: diacritic character "é" is U+00E9.

| 11

12 of 35

Unicode Text Normalization with NFC and NFD: Example Using Python(2/3):

  • Normalization Examples Using Python:

,

# Normalize to NFD

nfd_text = unicodedata.normalize('NFD', text)

print("NFD normalized text:", nfd_text)

#Output: Individual Unicode Code Points for the NFD normalized text: Café

for char in nfd_text:

if char != text:

print(char, f"U+{ord(char):04X}")

# NFD Outputs: diacritic character: "e" (U+0065)

followed by the combining character "´" (U+0301).

| 12

13 of 35

Unicode Text Normalization with NFC and NFD: Example Using Python(3/3):

  • Both NFC and NFD normalization produce similar Unicode code point outputs for characters such as C, a, and f.
  • However, there is a difference in the representation of the diacritic character.
  • In NFC, the Unicode code point for "é" is U+00E9, whereas in NFD, it is represented as the base character "e" (U+0065) followed by the combining character "´" (U+0301).

| 13

14 of 35

Exploring the Unicode Character Database(UCD) for Better Text Processing:

  • UCD features for Unicode-aware applications:
    • Character Properties and Metadata.
    • Collation and Sorting.
    • Normalization and Composition
    • Bidirectional Text Handling
    • Emoji and Emoji Presentation
    • Character Set Support and Block Ranges
    • Case Mapping

| 14

15 of 35

How to Access the Unicode Character Database: Some of the Methods and Tools:

  • Commonly used methods and tools:
    • Programming Language APIs.
    • Web-based APIs
    • Unicode Database Tools and Utilities
    • Offline Unicode Database Dumps.
    • Unicode Consortium Resources.

| 15

16 of 35

Accessing UCD Using Programming Languages:

  • Python code using the unicodedata module:

,

import unicodedata

char = 'ሀ'

category = unicodedata.category(char)

print(f"Character: {char}")

print(f"Category: {category}")

# Output: Character: ሀ

#Output: Category: Lo

| 16

17 of 35

Accessing UCD Using Command Line Utilities and Complementary Command Line Utilities:

  • Using a Command Line Utility or Library:
    • curl -X GET "https://unicode.org/cldr/utility/character/properties?&character=ሀ
    • uconv -f utf-8 -t ascii input.txt output.txt
    • The curl command sends a GET request to the Unicode CLDR utility, which provides information about Unicode characters.
    • The uconv converts text from UTF-8 to ASCII

,

| 17

18 of 35

Accessing UCD Using Web Interfaces:

  • Using Web interfaces:
    • website: https://util.unicode.org/UnicodeJsps/character.jsp
    • In the "Character" field, enter the character for which you want to query properties
    • Click on the code point of interest.
    • displays the properties of the character you entered.
    • scroll down to explore additional details and properties.

,

| 18

19 of 35

Comparing Unicode strings: Case Insensitive Comparisons:

,

string1 = "Café"

string2 = "café"

if string1.casefold() == string2.casefold():

print("The strings are equal (case-insensitive comparison)")

else:

print("The strings are not equal (case-insensitive comparison)")

#Output: The strings are equal (case-insensitive comparison)

| 19

20 of 35

Comparing Unicode strings: Locale-Based Comparisons:

,

import locale

# Define the strings to compare

string1 = "café"

string2 = "cafe"

# Perform a locale-based comparison

def compare_strings(result, locale_name):

locale.setlocale(locale.LC_ALL, locale_name)

if result < 0:

print(f"{string1} comes before {string2} in the {locale_name} locale.")

elif result > 0:

print(f"{string1} comes after {string2} in the {locale_name} locale.")

else:

print(f"{string1} and {string2} are equivalent in the {locale_name} locale.")

result1 = locale.strcoll(string1, string2)

compare_strings(result1, 'en_US.UTF-8')

result2 = locale.strcoll(string1, string2)

compare_strings(result2, 'fr_FR.utf8')

#First Output: café comes after cafe in the en_US.UTF-8 locale.

#Second Output: café comes before cafe in the fr_FR.utf8 locale.

| 20

21 of 35

Bidirectional Scripts and Shaped Scripts:

  • Bidirectional Scripts:
    • text flows from right to left (RTL).
    • text flows from right to left (LTR)
  • Shaped Scripts also known as Complex Scripts:
    • additional shaping and contextual transformation of characters to achieve the correct visual representation.
    • Examples of shaped scripts include.
      • Arabic, Hebrew, Indic scripts (e.g., Devanagari, Bengali), and
      • Southeast Asian scripts (e.g., Thai, Khmer)
  • Shaping is the process of transforming input text into a sequence of glyphs that visually represent the characters.

,

| 21

22 of 35

Bidirectional Display Format:

  • Ensures that the text flows correctly from right to left (RTL) for RTL scripts and from left to right (LTR) for LTR scripts:
    • Text directionality.
    • Bi-directional control characters.
    • Contextual analysis.
  • Shaped Display Format:
    • Character joining.
    • Glyph substitution.
    • Glyph positioning.

| 22

23 of 35

Reshaping Text Using the python-bidi Library or Package:

  • Python Code Example reshape Arabic text:

from bidi.algorithm import get_display

def reshape_arabic_text(text):

reshaped_text = get_display(text)

return reshaped_text

# Example usage

text = 'مرحبا بكم' #’Marhaban bikum’ in Arabic

reshaped_text = reshape_arabic_text(text)

print(reshaped_text)

| 23

24 of 35

File Storage in Key-press Order in Bidirectional and Shaped Scripts:

  • Consideration in storing text in a file in key-press order for bidirectional and shaped scripts
    • Character Encoding.
    • Logical Order.
    • Shaping and Ligatures.
    • Bi-directional Control Characters.

| 24

25 of 35

Glyph Shapers in Bidirectional and Shaped Scripts:

  • Glyph Shaping in Bidirectional Scripts:
    • Contextual Analysis.
    • Ligatures and Joining.
    • Positional Variants.
  • Glyph Shaping in Shaped Scripts:
    • Contextual Analysis.
    • Ligatures and Substitutions.
    • Complex Shaping Rules.

| 25

26 of 35

ICU Examples for Complex Script Shaping and Rendering :

  • Shaping and Rendering Arabic Text in Python:

import icu

# Create an ICU Transliterator object for Arabic shaping

transliterator = icu.Transliterator.createInstance("Arabic")

# Define the Arabic text to shape and render

arabic_text = "السلام عليكم" # 'Assalamu alaikum' in Arabic

# Shape the Arabic text

shaped_text = transliterator.transliterate(arabic_text)

# Display the shaped text

print(shaped_text)

| 26

27 of 35

Unicode in Other File Formats and their Handling -

JSON File Unicode Handling:

  • Considerations for Unicode handling in JSON file handling:
    • Encoding
    • Escaping Special Characters
    • Unicode Support in JSON Libraries
    • String Handling

| 27

28 of 35

Unicode in JSON File Format using Python:

import json

contact_info = {

"name": {

"Ethiopic": "ዮናስ ተስፋዬ",

"Arabic": "يونس تسفاي",

"Sinhala": "යුනෝස් ටෙස්ෆායි",

"Japanese": "ユナス・テスファイ",

"Latin": "Yonas Tesfaye"

},

"email_address": {

"Ethiopic": "ዮናስ.ተስፋዬ@ድርጅት.ኢትዮጵያ",

"Arabic": "younes.tesfai@domain.sa",

"Sinhala": "yonas.tesfaye@domain.lk",

"Japanese": "yonas.tesfaye@domain.jp",

"Latin": "yonas.tesfaye@domain.com"

},

| 28

29 of 35

Unicode in JSON File Format using Python:

"job_title": {

"Ethiopic": "ሶፍትዌር አልሚ",

"Arabic": "مهندس برمجيات",

"Sinhala": "වෘත්තීය ගැටළුවක්",

"Japanese": "ソフトウェアエンジニア",

"Latin": "Software Engineer"

}

}

#Saving a data to a Json file named contactinfo.json

with open("contactinfo.json", "w", encoding="utf-8") as file:

json.dump(contact_info, file, ensure_ascii=False)

#Opening a file named "contactinfo.json"

with open("contactinfo.json", "r", encoding="utf-8") as file:

loaded_data = json.load(file)

# Accessing and printing the loaded data

print(loaded_data["name"]) # John Doe

print(loaded_data["email_address"]) # 30

print(loaded_data["job_title"]) # こんにちは世界

| 29

30 of 35

Unicode in JSON File Format using Python:

# Accessing the information in different scripts

print("Name (Ethiopic):", contact_info["name"]["Ethiopic"])

print("Email (Arabic):", contact_info["email_address"]["Arabic"])

print("Job Title (Sinhala):", contact_info["job_title"]["Sinhala"])

print("Name (Japanese):", contact_info["name"]["Japanese"])

print("Email (Chinese):", contact_info["email_address"]["Chinese"])

| 30

31 of 35

Unicode String Manipulations Using Programming Language Specific Libraries and ICU: Python Code(1/2)

import icu

# Define the Japanese text to shape and render

japanese_text = "こんにちは世界" # 'Hello, World' in Japanese

# Create an ICU Transliterator object for Han-Latin script shaping

han_latin_transliterator = icu.Transliterator.createInstance("Han-Latin")

# Shape the Japanese text

shaped_text = han_latin_transliterator.transliterate(japanese_text)

# Display the shaped text

print(shaped_text)

# Character count

character_count = icu.UnicodeString(japanese_text).length()

print("Character count:", character_count)

# Character iteration and properties

unicode_string = icu.UnicodeString(japanese_text)

| 31

32 of 35

Unicode String Manipulations Using Programming Language Specific Libraries and ICU: Python Code(2/2)

for i in range(character_count):

code_point = unicode_string.charAt(i)

char = chr(code_point)

char_name = icu.Char.charName(code_point)

is_digit = str(char).isdigit()

is_uppercase = str(char).isupper()

print("Character:", char)

print("Character code point:", code_point)

print("Character name:", char_name)

print("Is character a digit?", is_digit)

print("Is character uppercase?", is_uppercase)

print("--------------------")

| 32

33 of 35

Examples on How to Use generic ICU Library in Python for Text Collation: Python Code

import icu

def compare_strings(text1, text2, locale='ar'):

collator = icu.Collator.createInstance(icu.Locale(locale))

result = collator.compare(text1, text2)

if result < 0:

return f"{text1} comes before {text2}"

elif result > 0:

return f"{text1} comes after {text2}"

else:

return f"{text1} is equal to {text2}"

string1 = "تفاحة"

string2 = "موز"

comparison = compare_strings(string1, string2)

print("Comparison Result:", comparison)

  • The above code produces the output: Comparison Result: تفاحة comes before موز

| 33

34 of 35

Reference:

  • Unicode Consortium. "Glyph Shaping." In The Unicode Standard, Version 15.0. Unicode Consortium. 2022. Retrieved October 21, 2023, from https://www.unicode.org/versions/Unicode15.0.0/.
  • International Business Machines Corporation (IBM). (2023). ICU Documentation. Retrieved October 21, 2023, from https://docs.python.org/3/.
  • Python Software Foundation. (2023). Python 3.12 Documentation. Retrieved October 21, 2023, from https://docs.python.org/3/.

| 34

35 of 35

Author:

  • Dessalegn Mequanint Yehuala, dessalegn.mequanint@aau.edu.et.

| 35