Universal Acceptance (UA) Micro-Learning Module: Module 2- Unicode Advanced Programming in Python.
Instructor Guide
1st Edition.
���© 2024 Creative Commons License - Attribution 4.0 International (CC BY 4.0).�
Universal Acceptance
Unicode Advanced Programming- Micro-Learning Module Objectives.
| 2
Note About the Utilization of Unicode String Literals :
| 3
Character-glyph Model
| 4
How Character-Glyph Model is Utilized.
,
unicode_string = "Hello, 你好, नमस्ते"
# Iterate through the characters in the string
for char in unicode_string:
print(char)
| 5
Comparing Unicode Strings:
,
char1 = 'ሀ'
char2 = 'ለ'
if ord(char1) < ord(char2):
print(f"{char1} comes before {char2}")
else:
print(f"{char1} comes after {char2}")
| 6
Text Rendering: Engines, Fonts, and Glyph Shaper.
| 7
How do Glyph Shapers Handle the Shaping of Characters
with Diacritical Marks or Vowel Signs?
| 8
Normalization of Unicode strings - NFC and NFD(1/2)
| 9
Normalization of Unicode strings - NFC and NFD(2/2)
| 10
Unicode Text Normalization with NFC and NFD: Example Using Python(1/3):
,
import unicodedata
# Example text
text = "Café"
# Normalize to NFC
nfc_text = unicodedata.normalize('NFC', text)
print("NFC normalized text:", nfc_text)
#Output: Individual Unicode Code Points for the NFC normalized text: Café
for char in nfc_text:
if char != text:
print(char, f"U+{ord(char):04X}")
# NFC Outputs: diacritic character "é" is U+00E9.
| 11
Unicode Text Normalization with NFC and NFD: Example Using Python(2/3):
,
# Normalize to NFD
nfd_text = unicodedata.normalize('NFD', text)
print("NFD normalized text:", nfd_text)
#Output: Individual Unicode Code Points for the NFD normalized text: Café
for char in nfd_text:
if char != text:
print(char, f"U+{ord(char):04X}")
# NFD Outputs: diacritic character: "e" (U+0065)
followed by the combining character "´" (U+0301).
| 12
Unicode Text Normalization with NFC and NFD: Example Using Python(3/3):
| 13
Exploring the Unicode Character Database(UCD) for Better Text Processing:
| 14
How to Access the Unicode Character Database: Some of the Methods and Tools:
| 15
Accessing UCD Using Programming Languages:
,
import unicodedata
char = 'ሀ'
category = unicodedata.category(char)
print(f"Character: {char}")
print(f"Category: {category}")
# Output: Character: ሀ
#Output: Category: Lo
| 16
Accessing UCD Using Command Line Utilities and Complementary Command Line Utilities:
,
| 17
Accessing UCD Using Web Interfaces:
,
| 18
Comparing Unicode strings: Case Insensitive Comparisons:
,
string1 = "Café"
string2 = "café"
if string1.casefold() == string2.casefold():
print("The strings are equal (case-insensitive comparison)")
else:
print("The strings are not equal (case-insensitive comparison)")
#Output: The strings are equal (case-insensitive comparison)
| 19
Comparing Unicode strings: Locale-Based Comparisons:
,
import locale
# Define the strings to compare
string1 = "café"
string2 = "cafe"
# Perform a locale-based comparison
def compare_strings(result, locale_name):
locale.setlocale(locale.LC_ALL, locale_name)
if result < 0:
print(f"{string1} comes before {string2} in the {locale_name} locale.")
elif result > 0:
print(f"{string1} comes after {string2} in the {locale_name} locale.")
else:
print(f"{string1} and {string2} are equivalent in the {locale_name} locale.")
result1 = locale.strcoll(string1, string2)
compare_strings(result1, 'en_US.UTF-8')
result2 = locale.strcoll(string1, string2)
compare_strings(result2, 'fr_FR.utf8')
#First Output: café comes after cafe in the en_US.UTF-8 locale.
#Second Output: café comes before cafe in the fr_FR.utf8 locale.
| 20
Bidirectional Scripts and Shaped Scripts:
,
| 21
Bidirectional Display Format:
| 22
Reshaping Text Using the python-bidi Library or Package:
from bidi.algorithm import get_display
def reshape_arabic_text(text):
reshaped_text = get_display(text)
return reshaped_text
# Example usage
text = 'مرحبا بكم' #’Marhaban bikum’ in Arabic
reshaped_text = reshape_arabic_text(text)
print(reshaped_text)
| 23
File Storage in Key-press Order in Bidirectional and Shaped Scripts:
| 24
Glyph Shapers in Bidirectional and Shaped Scripts:
| 25
ICU Examples for Complex Script Shaping and Rendering :
import icu
# Create an ICU Transliterator object for Arabic shaping
transliterator = icu.Transliterator.createInstance("Arabic")
# Define the Arabic text to shape and render
arabic_text = "السلام عليكم" # 'Assalamu alaikum' in Arabic
# Shape the Arabic text
shaped_text = transliterator.transliterate(arabic_text)
# Display the shaped text
print(shaped_text)
| 26
Unicode in Other File Formats and their Handling -
JSON File Unicode Handling:
| 27
Unicode in JSON File Format using Python:
import json
contact_info = {
"name": {
"Ethiopic": "ዮናስ ተስፋዬ",
"Arabic": "يونس تسفاي",
"Sinhala": "යුනෝස් ටෙස්ෆායි",
"Japanese": "ユナス・テスファイ",
"Latin": "Yonas Tesfaye"
},
"email_address": {
"Ethiopic": "ዮናስ.ተስፋዬ@ድርጅት.ኢትዮጵያ",
"Arabic": "younes.tesfai@domain.sa",
"Sinhala": "yonas.tesfaye@domain.lk",
"Japanese": "yonas.tesfaye@domain.jp",
"Latin": "yonas.tesfaye@domain.com"
},
| 28
Unicode in JSON File Format using Python:
"job_title": {
"Ethiopic": "ሶፍትዌር አልሚ",
"Arabic": "مهندس برمجيات",
"Sinhala": "වෘත්තීය ගැටළුවක්",
"Japanese": "ソフトウェアエンジニア",
"Latin": "Software Engineer"
}
}
#Saving a data to a Json file named contactinfo.json
with open("contactinfo.json", "w", encoding="utf-8") as file:
json.dump(contact_info, file, ensure_ascii=False)
#Opening a file named "contactinfo.json"
with open("contactinfo.json", "r", encoding="utf-8") as file:
loaded_data = json.load(file)
# Accessing and printing the loaded data
print(loaded_data["name"]) # John Doe
print(loaded_data["email_address"]) # 30
print(loaded_data["job_title"]) # こんにちは世界
| 29
Unicode in JSON File Format using Python:
# Accessing the information in different scripts
print("Name (Ethiopic):", contact_info["name"]["Ethiopic"])
print("Email (Arabic):", contact_info["email_address"]["Arabic"])
print("Job Title (Sinhala):", contact_info["job_title"]["Sinhala"])
print("Name (Japanese):", contact_info["name"]["Japanese"])
print("Email (Chinese):", contact_info["email_address"]["Chinese"])
| 30
Unicode String Manipulations Using Programming Language Specific Libraries and ICU: Python Code(1/2)
import icu
# Define the Japanese text to shape and render
japanese_text = "こんにちは世界" # 'Hello, World' in Japanese
# Create an ICU Transliterator object for Han-Latin script shaping
han_latin_transliterator = icu.Transliterator.createInstance("Han-Latin")
# Shape the Japanese text
shaped_text = han_latin_transliterator.transliterate(japanese_text)
# Display the shaped text
print(shaped_text)
# Character count
character_count = icu.UnicodeString(japanese_text).length()
print("Character count:", character_count)
# Character iteration and properties
unicode_string = icu.UnicodeString(japanese_text)
| 31
Unicode String Manipulations Using Programming Language Specific Libraries and ICU: Python Code(2/2)
for i in range(character_count):
code_point = unicode_string.charAt(i)
char = chr(code_point)
char_name = icu.Char.charName(code_point)
is_digit = str(char).isdigit()
is_uppercase = str(char).isupper()
print("Character:", char)
print("Character code point:", code_point)
print("Character name:", char_name)
print("Is character a digit?", is_digit)
print("Is character uppercase?", is_uppercase)
print("--------------------")
| 32
Examples on How to Use generic ICU Library in Python for Text Collation: Python Code
import icu
def compare_strings(text1, text2, locale='ar'):
collator = icu.Collator.createInstance(icu.Locale(locale))
result = collator.compare(text1, text2)
if result < 0:
return f"{text1} comes before {text2}"
elif result > 0:
return f"{text1} comes after {text2}"
else:
return f"{text1} is equal to {text2}"
string1 = "تفاحة"
string2 = "موز"
comparison = compare_strings(string1, string2)
print("Comparison Result:", comparison)
| 33
Reference:
| 34
Author:
| 35