1 of 36

Universal Acceptance (UA) Micro-Learning Module: Module 2- Unicode Advanced Programming in Java.

Instructor Guide

1st Edition.

���© 2024 Creative Commons License - Attribution 4.0 International (CC BY 4.0).

Universal Acceptance

2 of 36

Unicode Advanced Programming- Micro-Learning Module Objectives.

  • This module is designed to expand your understanding and proficiency in working with Unicode by covering key aspects such as the character-glyph model, Unicode normalization, accessing the Unicode character database, and comparing and sorting Unicode strings.
  • When relevant, instructors can cover optional topics as indicated in the students guide.

  • At the end of this module, students should be able to:
    • Gain a clear understanding of the character-glyph model and how it relates to Unicode, including the concepts of code points, character encoding, and glyph representations;
    • Explore the functionality and capabilities of text engines that handle complex text layouts;
    • Acquire knowledge and practical skills in Unicode normalization techniques;
    • Understand the importance of normalization for consistent text processing and comparison;
    • Learn how to access and utilize the Unicode character database; and
    • Understand the challenges and best practices for accurate string comparison in a multilingual and diverse context.

| 2

3 of 36

Note About the Utilization of Unicode String Literals :

  • The Unicode string literals utilized in the code examples necessitate input methods, whether virtual or physical keyboards, suited to the script from which the literals are derived. In the event that these input methods are unavailable, alternatives such as language translation tools can be used.
  • Throughout the module, Unicode string literals derived from various scripts have been used in the examples code, spanning from lesser-known to more widely used ones. This approach aims to broaden learners' exposure to a diverse range of scripts..
  • The module provides the meanings of Unicode string literals used in the example codes in English, along with their transliterations, to aid learners in accurately pronouncing them.
  • Instructors are free to use Unicode literal strings derived from a script of their choice rather than the ones included in the examples code.

| 3

4 of 36

Character-glyph Model

  • Overview of Character-glyph Model.
    • Clear separation between characters and glyphs.
    • Characters represent the abstract units of written language.
    • glyphs are the visual representations or renditions of characters.
  • key difference between processing textual information and displaying
    • distinction between characters and glyphs.
    • Operations are performed based on the character's Unicode code points, independent of their visual appearance.
    • displaying text involves converting characters into their corresponding glyphs for visual presentation.

| 4

5 of 36

Examples How Character-Glyph Model is Utilized.

  • Java’s support for Unicode allows you to perform operations on characters based on their abstract representation, without being concerned about their visual appearance.

public class UnicodeStringIteration {

public static void main(String[] args) {

// Declare a Unicode string

String unicodeString = "Hello, 你好, नमस्ते"; // "你好" (nǐ hǎo)

is a Chinese greeting, and "नमस्ते" (namaste) is a greeting in Hindi.

// Convert the string to an array of characters

char[] charArray = unicodeString.toCharArray();

// Iterate through the characters in the string

for (char c : charArray) {

System.out.println(c);

}

}

}

| 5

6 of 36

Comparing Unicode Strings:

  • Compare two Unicode characters

  • In character comparison operations like the one demonstrated in the above code example, it's essential to recognize that the comparison is solely based on the abstract representation of characters through their Unicode code points not by their visual representation or glyph model.

public class UnicodeComparison {

public static void main(String[] args) {

// Define two Unicode characters using character literals

char char1 = 'ሀ'; // the first character (ha) in the Ethiopic script

char char2 = 'ለ'; // the character "le" in the Ethiopic script

// Compare the characters directly

if (char1 < char2) {

System.out.println(char1 + " comes before " + char2);

} else {

System.out.println(char1 + " comes after " + char2);

}

}

}

| 6

7 of 36

Text Rendering: Engines, Fonts, and Glyph Shaper.

  • Text rendering
    • Text engines, fonts, and glyph shapers play crucial roles.
  • Text Engines:
    • are responsible for handling the layout and rendering of text.
    • analyze the Unicode text, apply appropriate layout algorithms, and determine the correct positioning and shaping.
  • Fonts:
    • provide the necessary glyphs (visual representations) for each character.
  • Glyph Shapers:
    • shaping glyphs to produce visually connected and contextually appropriate representations of characters.

,

| 7

8 of 36

How do Glyph Shapers Handle the Shaping of Characters

with Diacritical Marks or Vowel Signs?

  • Glyph shapers
    • handle the shaping of characters with diacritical marks or vowel signs.
  • Glyph shapers does the following tasks:
    • Base Character and Diacritical Mark Identification.
    • Positioning and Placement.
    • Mark Positioning Rules.
    • Contextual Adjustments.
    • Ligatures and Special Forms.

| 8

9 of 36

Normalization of Unicode strings - NFC and NFD(1/2)

  • NFC (Normalization Form C)
    • Composes characters and combining marks whenever possible.
    • Represents characters as precomposed forms, where a single code point represents a composed character.
    • Useful for compatibility with older systems and applications that rely on precomposed characters.
    • It is generally preferred for interoperability and compatibility purposes.
    • It allows for efficient storage and comparison of strings.
    • It can be beneficial when working with most modern software and protocols.

| 9

10 of 36

Normalization of Unicode strings - NFC and NFD(2/2)

  • NFD (Normalization Form D)
    • Decomposes characters and combining marks.
    • Represents characters as a sequence of separate code points, with combining marks following their base characters.
    • Useful for applications that require working with decomposed characters and need to accurately process combining marks.
    • NFD can be beneficial for applications that perform in-depth text processing or linguistic analysis.
    • It provides a more granular representation of characters, making it suitable for certain text manipulation tasks.
    • NFD can help identify and handle specific linguistic and orthographic variations.

| 10

11 of 36

Unicode Text Normalization with NFC and NFD: Example Using Java.(1/3):

  • Normalization Examples Using Java:

,

import java.text.Normalizer;

public class UnicodeNormalization {

public static void main(String[] args) {

// Example text

String text = "Café";

// Normalize to NFC

String nfcText = Normalizer.normalize(text, Normalizer.Form.NFC);

System.out.println("NFC normalized text: " + nfcText);

// Output: Individual Unicode Code Points for the NFC normalized text: Café

for (int i = 0; i < nfcText.length(); i++) {

char ch = nfcText.charAt(i);

if (ch != text.charAt(i)) {

System.out.println(ch + " U+" +

Integer.toHexString(ch | 0x10000).substring(1));

// NFC Outputs: diacritic character "é" is U+00E9.

}

}

| 11

12 of 36

Unicode Text Normalization with NFC and NFD: Example Using Java.(2/3):

  • Normalization Examples Using Java:

,

// Normalize to NFD

String nfdText = Normalizer.normalize(text, Normalizer.Form.NFD);

System.out.println("NFD normalized text: " + nfdText);

// Output: Individual Unicode Code Points for the NFD normalized text: Café

for (int i = 0; i < nfdText.length(); i++) {

char ch = nfdText.charAt(i);

if (ch != text.charAt(i)) {

System.out.println(ch + " U+" +

Integer.toHexString(ch | 0x10000).substring(1));

// NFD Outputs: diacritic character: "e" (U+0065)

//followed by the combining character "´" (U+0301).

}

}

}

}

| 12

13 of 36

Unicode Text Normalization with NFC and NFD: Example Using Java (3/3):

  • Both NFC and NFD normalization produce similar Unicode code point outputs for characters such as C, a, and f.
  • However, there is a difference in the representation of the diacritic character.
  • In NFC, the Unicode code point for "é" is U+00E9, whereas in NFD, it is represented as the base character "e" (U+0065) followed by the combining character "´" (U+0301).

| 13

14 of 36

Exploring the Unicode Character Database(UCD) for Better Text Processing:

  • UCD features for Unicode-aware applications:
    • Character Properties and Metadata:
    • Collation and Sorting
    • Normalization and Composition
    • Bidirectional Text Handling
    • Emoji and Emoji Presentation
    • Character Set Support and Block Ranges
    • Case Mapping

,

| 14

15 of 36

How to Access the Unicode Character Database: Some of the

Methods and Tools:

  • Commonly used methods and tools:
    • Programming Language APIs
    • Web-based APIs
    • Unicode Database Tools and Utilities
    • Offline Unicode Database Dumps
    • Unicode Consortium Resources

,

| 15

16 of 36

Accessing UCD Using Programming Languages:

  • Java using the java.lang.Character package:

,

public class UCDExample2 {

public static void main(String[] args) {

// Retrieve character properties

char character = 'ሀ'; // Ethiopic Syllable Ha (ሀ)

int codePoint = character;

int category = Character.getType(codePoint);

System.out.println("Character: " + character);

System.out.println("Category: " + category);

//Output: Character: ሀ

//Output: Category: Lo

}

}

| 16

17 of 36

Accessing UCD Using Command Line Utilities and Complementary Command Line Utilities:

  • Using a Command Line Utility or Library:
    • curl -X GET "https://unicode.org/cldr/utility/character/properties?&character=ሀ
    • uconv -f utf-8 -t ascii input.txt output.txt
    • The curl command sends a GET request to the Unicode CLDR utility, which provides information about Unicode characters.
    • The uconv converts text from UTF-8 to ASCII.

| 17

18 of 36

Accessing UCD Using Web Interfaces

  • Using Web interfaces:
    • website: https://util.unicode.org/UnicodeJsps/character.jsp
    • In the "Character" field, enter the character for which you want to query properties
    • Click on the code point of interest.
    • displays the properties of the character you entered.
    • scroll down to explore additional details and properties.

| 18

19 of 36

Comparing Unicode Strings: Case Insensitive and Locale-Based Comparisons(½):

  • Case-insensitive comparisons in Java:

,

public class CaseInsensitiveComparison {

public static void main(String[] args) {

String string1 = "Café";

String string2 = "café";

if (string1.equalsIgnoreCase(string2)) {

System.out.println("The strings are equal (case-insensitive comparison)");

} else {

System.out.println("The strings are not equal (case-insensitive comparison)");

}

}

}

| 19

20 of 36

Comparing Unicode Strings: Case Insensitive and Locale-Based Comparisons(2/2):

  • Locale-based comparisons in Java (code snippet):

,

import java.text.Collator;

import java.util.Locale;

Locale locale = Locale.US; // Example: English (United States)

// Create a Collator object with the specified locale

Collator collator = Collator.getInstance(locale);

// Define the strings to compare

String string1 = "café";

String string2 = "cafe";

// Perform a locale-based comparison

int result = collator.compare(string1, string2);

if (result < 0) {

System.out.println(string1 + " comes before " + string2 + " in the specified locale.");

} else if (result > 0) {

System.out.println(string1 + " comes after " + string2 + " in the specified locale.");

} else {

System.out.println(string1 + " and " + string2 + " are equivalent in the specified locale.");

}

| 20

21 of 36

Bidirectional Scripts and Shaped Scripts:

  • Bidirectional Scripts:
    • text flows from right to left (RTL)
    • text flows from right to left (LTR)
  • Shaped Scripts also known as Complex Scripts:
    • additional shaping and contextual transformation of characters to achieve the correct visual representation.
    • Examples of shaped scripts include:
      • Arabic, Hebrew, Indic scripts (e.g., Devanagari, Bengali), and
      • Southeast Asian scripts (e.g., Thai, Khmer)
  • Shaping is the process of transforming input text into a sequence of glyphs that visually represent the characters.

| 21

22 of 36

Bidirectional Display Format:

  • Ensures that the text flows correctly from right to left (RTL) for RTL scripts and from left to right (LTR) for LTR scripts:
    • Text directionality.
    • Bi-directional control characters.
    • Contextual analysis.
  • Shaped Display Format:
    • Character joining.
    • Glyph substitution.
    • Glyph positioning.

| 22

23 of 36

Reshaping Text Using the ICU Library:

  • Java Code Example to reshape Arabic text: The Java code below utilizes the ICU4J library's com.ibm.icu.text.Bidi package to reshape Arabic text:

,

import com.ibm.icu.text.Bidi;

public class ReshapeArabicText {

public static String reshapeArabicText(String text) {

Bidi bidi = new Bidi(text, Bidi.DIRECTION_DEFAULT_LEFT_TO_RIGHT);

return bidi.writeReordered(Bidi.REORDER_DEFAULT);

}

public static void main(String[] args) {

String text = "مرحبا بكم"; // 'Marhaban bikum' in Arabic

String reshapedText = reshapeArabicText(text);

System.out.println(reshapedText);

}

}

| 23

24 of 36

File Storage in Key-press Order in Bidirectional and Shaped Scripts:

  • Consideration in storing text in a file in key-press order for bidirectional and shaped scripts:
    • Character Encoding.
    • Logical Order.
    • Shaping and Ligatures.
    • Bi-directional Control Characters.

| 24

25 of 36

Glyph Shapers in Bidirectional and Shaped Scripts:

  • Glyph Shaping in Bidirectional Scripts:
    • Contextual Analysis.
    • Ligatures and Joining.
    • Positional Variants.
  • Glyph Shaping in Shaped Scripts:
    • Contextual Analysis.
    • Ligatures and Substitutions.
    • Complex Shaping Rules.

| 25

26 of 36

ICU Examples for Complex Script Shaping and Rendering Using Java:

  • Shaping and Rendering Arabic Text in Java:

import com.ibm.icu.text.ArabicShaping;

import com.ibm.icu.text.Bidi;

public class ArabicTextShaping {

public static void main(String[] args) {

// Create an ICU Arabic text shaping object

ArabicShaping arabicShaper = new ArabicShaping();

// Define the Arabic text to shape and render

String arabicText = "السلام عليكم"; // 'Assalamu alaikum' in Arabic

// Shape the Arabic text

String shapedText = arabicShaper.shape(arabicText);

// Render the shaped text using ICU's BiDi (Bi-Directional) algorithm

Bidi bidi = new Bidi(shapedText, Bidi.DIRECTION_LEFT_TO_RIGHT);

// Display the rendered text

System.out.println(bidi.writeReordered(Bidi.REORDER_DEFAULT));

}

}

| 26

27 of 36

Unicode in Other File Formats and their Handling -

JSON File Unicode Handling:

  • Considerations for Unicode handling in JSON file handling:
    • Encoding.
    • Escaping Special Characters.
    • Unicode Support in JSON Libraries.
    • String Handling

| 27

28 of 36

Unicode in JSON File Format using Java(1/3):

import java.io.FileWriter;

import java.io.FileReader;

import java.io.IOException;

import org.json.simple.JSONObject;

import org.json.simple.parser.JSONParser;

import org.json.simple.parser.ParseException;

public class ContactInfo {

public static void main(String[] args) {

JSONObject contactInfo = new JSONObject();

// Define the contact information

JSONObject name = new JSONObject();

name.put("Ethiopic", "ዮናስ ተስፋዬ");

name.put("Arabic", "يونس تسفاي");

name.put("Sinhala", "යුනෝස් ටෙස්ෆායි");

name.put("Japanese", "ユナス・テスファイ");

name.put("Chinese", "尤纳斯·特斯法伊");

name.put("Latin", "Yonas Tesfaye");

contactInfo.put("name", name);

J

| 28

29 of 36

Unicode in JSON File Format using Java(2/3):

JSONObject emailAddress = new JSONObject();

emailAddress.put("Ethiopic", "ኢሜይል-ሙከራ@ሁለንአቀፍ-ተቀባይነት-ሙከራ.com");

emailAddress.put("Arabic", "تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتانيا");

emailAddress.put("Sinhala", "-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා");

emailAddress.put("Japanese", "ユナス・テスファイ@ユナス・テスファイ");

emailAddress.put("Chinese", "mailto:電子郵件測試@普遍適用測試.台灣");

emailAddress.put("Latin", "yonas.tesfaye@domain.com");

contactInfo.put("email_address", emailAddress);

JSONObject jobTitle = new JSONObject();

jobTitle.put("Ethiopic", "ሶፍትዌር አልሚ");

jobTitle.put("Arabic", "مهندس برمجيات");

jobTitle.put("Sinhala", "වෘත්තීය ගැටළුවක්");

jobTitle.put("Japanese", "ソフトウェアエンジニア");

jobTitle.put("Chinese", "软件工程师");

jobTitle.put("Latin", "Software Engineer");

contactInfo.put("job_title", jobTitle);

| 29

30 of 36

Unicode in JSON File Format using Java(3/3):

// Saving data to a JSON file named contactinfo.json

try (FileWriter file = new FileWriter("contactinfo.json")) {

file.write(contactInfo.toJSONString());

System.out.println("Successfully wrote JSON object to file.");

} catch (IOException e) {

e.printStackTrace();

}

// Opening the file named "contactinfo.json" and accessing the loaded data

try (FileReader reader = new FileReader("contactinfo.json")) {

JSONParser jsonParser = new JSONParser();

JSONObject loadedData = (JSONObject) jsonParser.parse(reader);

// Accessing and printing the loaded data

System.out.println("Name: " + loadedData.get("name"));

System.out.println("Email Address: " + loadedData.get("email_address"));

System.out.println("Job Title: " + loadedData.get("job_title"));

} catch (IOException | ParseException e) {

e.printStackTrace();

}

}

}

| 30

31 of 36

Unicode String Manipulations Using Programming Language Specific Libraries and ICU: Java Code Snippet.

import java.util.Iterator;

String text = "こんにちは世界"; // Japanese greeting "Hello, World"

// Character count

int characterCount = text.codePointCount(0, text.length());

System.out.println("Character count: " + characterCount);

// Character iteration and properties

Iterator<Integer> codePointIterator = text.codePoints().iterator();

while (codePointIterator.hasNext()) {

int codePoint = codePointIterator.next();

System.out.println("Character: " + (char) codePoint);

System.out.println("Character code point: " + codePoint);

System.out.println("Character name: " + Character.getName(codePoint));

System.out.println("--------------------");

}

| 31

32 of 36

Java - Using the java.text.Normalizer Class for Unicode Normalization:

import java.text.Normalizer;

public class UnicodeStringManipulationJava {

public static void main(String[] args) {

String input = "Café";

// Normalize the string to NFC form

String normalized = Normalizer.normalize(input, Normalizer.Form.NFC);

// Remove diacritical marks

String withoutDiacritics = normalized.replaceAll("\\p{M}", "");

System.out.println("Normalized: " + normalized);

System.out.println("Without Diacritics: " + withoutDiacritics);

}

}

| 32

33 of 36

Java - Using ICU (icu4j) for Unicode Normalization and Case Folding:

import com.ibm.icu.text.Normalizer2;

import com.ibm.icu.text.Transliterator;

public class UnicodeStringManipulationICUJava {

public static void main(String[] args) {

String input = "Café";

// Normalize the string to NFC form using ICU

Normalizer2 normalizer = Normalizer2.getNFCInstance();

String normalized = normalizer.normalize(input);

// Remove diacritical marks using ICU

Transliterator diacriticRemover = Transliterator.getInstance("Any-NFD; [:M:] Remove; NFC");

String withoutDiacritics = diacriticRemover.transform(normalized);

System.out.println("Normalized: " + normalized);

System.out.println("Without Diacritics: " + withoutDiacritics);

}

| 33

34 of 36

ICU Library in Java for Text Collation:

import com.ibm.icu.text.Collator;

import com.ibm.icu.util.ULocale;

public class StringComparison {

public static String compareStrings(String text1, String text2, String locale) {

Collator collator = Collator.getInstance(new ULocale(locale));

int result = collator.compare(text1, text2);

if (result < 0) {

return text1 + " comes before " + text2;

} else if (result > 0) {

return text1 + " comes after " + text2;

} else {

return text1 + " is equal to " + text2;

}

}

public static void main(String[] args) {

String string1 = "تفاحة"; // 'tuffaha' in Arabic, Apple in English

String string2 = "موز"; // 'mawz' in Arabic, Banana in English

String comparison = compareStrings(string1, string2, "ar");

System.out.println("Comparison Result: " + comparison);

}

}

| 34

35 of 36

Reference:

  • Unicode Consortium. "Glyph Shaping." In The Unicode Standard, Version 15.0. Unicode Consortium. 2022. Retrieved October 21, 2023, from https://www.unicode.org/versions/Unicode15.0.0/.
  • Python Software Foundation. (2023). Python 3.12 Documentation. Retrieved October 21, 2023, from https://docs.python.org/3/.
  • International Business Machines Corporation (IBM). (2023). ICU Documentation. Retrieved October 21, 2023, from https://docs.python.org/3/.

| 35

36 of 36

Author:

  • Dessalegn Mequanint Yehuala, dessalegn.mequanint@aau.edu.et.

| 36