1 of 23

Universal Acceptance (UA) Micro-Learning Module: Module 12- Unicode Support in Operating Systems.

Instructor Guide

1st Edition.

���© 2024 Creative Commons License - Attribution 4.0 International (CC BY 4.0).

Universal Acceptance

2 of 23

Unicode Support in Operating Systems UA Micro-Learning Module Objectives:

  • In this module, we will explore the fundamental concepts and mechanisms that are essential for Unicode support in contemporary operating systems. Understanding how operating systems handle Unicode is essential for ensuring proper handling, storage, and display of multilingual text.
  • At the end of this module, students should be able to:
    • Understand the significance of Unicode in operating systems;
    • Identify examples of operating systems that offer Unicode support, ensuring compatibility and seamless handling of multilingual text;
    • Explore Unicode string manipulation functions, collation, and sorting mechanisms in operating systems and understand their impact on efficient text processing and comparison;
    • Gain knowledge of operating system APIs and system calls available for converting and handling Internationalized Domain Names (IDNs); and
    • Examine file systems support for Unicode, understanding how different file systems handle Unicode file names and metadata.

| 2

3 of 23

Note About the Utilization of Unicode String Literals :

  • The Unicode string literals utilized in the examples necessitate input methods, whether virtual or physical keyboards, suited to the script from which the literals are derived. In the event that these input methods are unavailable, alternatives such as language translation tools can be used.
  • Instructors are free to use Unicode literal strings derived from a script of their choice rather than the ones included in the examples.

| 3

4 of 23

Why do we need Unicode in Operating Systems?

  • The following are some of the reasons why we need Unicode in operating systems:
    • Multilingual Support.
    • Character Encoding Standard.
    • File System Management.
    • Global Compatibility.
    • Text Rendering and Display.
    • Localization and Internationalization.
    • Future-Proofing and Scalability.

| 4

5 of 23

Some Examples of Operating Systems that Offer Unicode Support:

  • The following are some examples of operating systems that offer Unicode support:
    • Linux.
    • FreeBSD.
    • Windows.
    • macOS.
    • Android.
    • iOS.
    • Chrome OS.

| 5

6 of 23

Unicode String Manipulation Functions, Collation and Sorting in Operating Systems:

  • Unicode String Manipulation Functions, Collation, and Sorting are essential features in operating systems that enable efficient handling and processing of text in different languages and scripts:
    • Unicode String Manipulation Functions.
    • Collation.
    • Sorting

| 6

7 of 23

Example on Setting the Locale of a Linux Operating System:

  • The following are the general steps to set the locale on a Linux system:
    • Check Available Locales:
      • Before setting the locale, you may want to check the available locales on your system.
      • You can do this by running the command:

    • Set the Locale:

    • Permanently Set the Locale:
      • If you want to make the changes permanent, you can add the export command to your shell profile file.
    • Verify the Locale:
      • You can verify that the locale has been set by running:

locale -a

export LC_ALL=ar_MA.UTF-8

locale

| 7

8 of 23

Unicode-aware Collation in Operating Systems:

  • Key considerations for Unicode-aware Collation in Operating Systems:
    • Unicode Collation Algorithm (UCA).
    • Tailoring and Locale-Specific Rules.
    • Linguistic Data and Libraries.
    • Language-Specific Collation Tables.
    • Customization.
    • Collation Tailoring for Complex Scripts.
    • Unicode Collation Element Table (UCET).

| 8

9 of 23

Examples of Unicode String Manipulation Functions Performed by Operating Systems(1/5):

  • Unicode String Manipulation Functions Performed by Operating Systems:
    • String Length:
      • functions to determine the length of a Unicode string
      • Example:

        • The `wc -m` command counts the number of characters in a string, considering Unicode characters as a single unit.
        • Output: 49
    • Concatenation:
      • Concatenate or join multiple Unicode strings together
      • Example:

$ echo -n "تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتانيا" | wc -m

$ echo "Arabic Email Address: " > file1

$ echo "تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتاني" >> file1

$ cat file1

Output: Arabic Email Address:

تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتاني

| 9

10 of 23

Examples of Unicode String Manipulation Functions Performed by Operating Systems(2/5):

  • Unicode String Manipulation Functions Performed by Operating Systems:
    • Substring Extraction (String Slicing):
      • Extract a substring from a Unicode string:
      • Example:

    • Case Conversion:

$ echo تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتانيا | cut -d '@' -f2

cut: This command cuts the text into fields based on a delimiter.

-d '@': This specifies the delimiter as "@" (at symbol).

-f2: This tells cut to print the second field (everything after the "@").

Output: تجربة-القبول-الشامل.موريتانيا

$ echo "Hello, 世界" | tr '[:lower:]' '[:upper:]'

Output: HELLO, 世界

$ echo "αγορί" | python -c "import sys; print(sys.stdin.read().upper(), end='')"

Output: ΑΓΟΡΊ

| 10

11 of 23

Examples of Unicode String Manipulation Functions Performed by Operating Systems(3/5):

  • Unicode String Manipulation Functions Performed by Operating Systems:
    • Character Encoding Conversion:
      • Convert Unicode strings between different character encodings
      • Example:

        • The `iconv` command encodes a Unicode string from UTF-8 to UTF-16, producing the below corresponding encoded byte sequence.
        • Output:

echo -n "تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتاني" | iconv -f UTF-8 -t UTF-16 | hexdump -C

00000000 ff fe 2a 06 2c 06 31 06 28 06 29 06 2d 00 28 06 |..*.,.1.(.).-.(.|

00000010 31 06 4a 06 2f 06 2d 00 27 06 44 06 43 06 2a 06 |1.J./.-.'.D.C.*.|

00000020 31 06 48 06 46 06 4a 06 40 00 2a 06 2c 06 31 06 |1.H.F.J.@.*.,.1.|

00000030 28 06 29 06 2d 00 27 06 44 06 42 06 28 06 48 06 |(.).-.'.D.B.(.H.|

00000040 44 06 2d 00 27 06 44 06 34 06 27 06 45 06 44 06 |D.-.'.D.4.'.E.D.|

00000050 2e 00 45 06 48 06 31 06 4a 06 2a 06 27 06 46 06 |..E.H.1.J.*.'.F.|

00000060 4a 06 20 00 |J. .|

00000064

| 11

12 of 23

Examples of Unicode String Manipulation Functions Performed by Operating Systems(4/5):

  • Unicode String Manipulation Functions Performed by Operating Systems:
    • Pattern Matching:
      • Using Regular Expressions.

      • The `grep -P` command performs pattern matching for Unicode characters using Unicode character properties.
      • Output: 世界
    • String Comparison:
      • Compares Unicode strings based on collation rules.
      • Example: [Output: Not equal]

#!/bin/bash

string1="مرحبا"

string2="مرحبًا"

if [[ "$string1" = "$string2" ]]; then

echo "Equal"

else

echo "Not equal"

fi

$ echo "Hello, 世界" | grep -oP "\p{Han}"

| 12

13 of 23

Examples of Unicode String Manipulation Functions Performed by Operating Systems(5/5):

  • Unicode String Manipulation Functions Performed by Operating Systems:
    • Normalization
      • Normalization ensures that equivalent Unicode strings with different representations are transformed into a canonical form.
      • Example:

        • The `iconv` command normalizes Unicode strings by converting them to a canonical form, eliminating any redundant or equivalent representations.
        • Output: 00000000 c3 a9

echo -n "é" | iconv -f UTF-8 -t UTF-8//IGNORE | hexdump -C

| 13

14 of 23

Operating Systems APIs or System Calls for Converting and Handling IDNs:

  • Some examples of OS APIs or system calls support for handling IDNs popular operating systems:
    • libidn (GNU Libidn):
      • GNU Libidn is a widely used library for handling IDN domain names.
    • Idn2:
      • idn2 is another popular library for handling IDN domain names.
    • Windows API (Windows):
      • The Windows Internationalized Domain Names API (IDNA).
    • Core Foundation (macOS):
      • macOS provides IDN support through the Core Foundation framework

| 14

15 of 23

Examples on Handling IDNs in Linux:

  • Some examples of OS APIs or system calls support for handling IDNs popular operating systems:
    • Prerequisites:
      • Install IDN2 Utility:

      • Example: Encoding.

        • Output: xn—lnfbb8fe3cvkui0de0bcg5hxagsg7d5lwail.xn—i1b6b1a6a2e.
      • Example: Decoding.

        • Output: सार्वभौमिक-स्वीकृति-परीक्षण.संगठन.

sudo apt install idn2

idn2 "सार्वभौमिक-स्वीकृति-परीक्षण.संगठन"

idn2 -d "xn—lnfbb8fe3cvkui0de0bcg5hxagsg7d5lwail.xn--i1b6b1a6a2e"

| 15

16 of 23

EAI Compatibility of Basic Mail Command Line Utilities:

  • Various mail command-line utilities offer essential functionality for sending and managing emails from terminals or character-based screens.
  • In Linux, notable among these utilities are "mail" and "sendmail".
  • The mail command in Linux is a command-line utility that allows users to send and receive email from the command line interface.
  • It is a basic mail user agent (MUA) that provides a simple way to send and read emails without the need for a graphical email client.
  • It's important to note that the mail command relies on a properly configured mail transfer agent (MTA) on your system, such as Sendmail or Postfix.
  • MTA such as Postfix has also introduced EAI support in later versions, specifically from version 3.0 onwards.
  • Postfix uses the Cyrus SASL library for SASL support, which in turn supports the SMTPUTF8 extension necessary for EAI.
  • While using any MTA make sure you are using a version of the MTA that includes EAI support.

| 16

17 of 23

Operating Systems APIs or System Calls for Handling EAI in Email Clients:

  • The following are some key aspects of EAI integration in operating systems:
    • Unicode Support.
    • Email Client Support.
    • Input Methods.
    • DNS Support.
    • Validation and Security.
    • Interoperability

| 17

18 of 23

File Systems Support for Unicode:

  • The following are some examples of file systems that support Unicode:
    • Ext4 (Fourth Extended File System).
    • NTFS (New Technology File System).
    • APFS (Apple File System).
    • FAT32 (File Allocation Table).
    • exFAT (Extended File Allocation Table):

| 18

19 of 23

Example on Unicode Support in File Systems:

  • Example: Creating a File with a Unicode Name in a Linux:

    • the touch command is used to create the file named "日本語ファイル.txt", which includes Japanese characters.
  • Example: Creating a Directory with a Unicode Name in Linux:

    • This command creates a directory named "مجلد عربي", which uses Arabic characters.
  • Example: Accessing a Directory with a Unicode Name:

  • Example: Accessing a File with a Unicode Name:

    • This command displays the content of the file named "日本語ファイル.txt" that includes Japanese characters.

touch "日本語ファイル.txt"

mkdir "مجلد عربي"

cd "مجلد عربي"

cat "日本語ファイル.txt"

| 19

20 of 23

Working with Unicode: Case-sensitive vs Case-insensitive vs Case-insensitive but Case-preserving Filename Handling:

  • The three common approaches:
    • Case-Sensitive Filename Handling.
    • Case-Insensitive Filename Handling.
    • Case-Insensitive but Case-Preserving Filename Handling.

| 20

21 of 23

Reference:

[1]. The Unicode Consortium. https://unicode.org/consortium/ Accessed from https://home.unicode.org/ on December 20 2023.

[2]. Unicode Technical Reports. Accessed from https://www.unicode.org/reports/ on December 20, 2023.

[3]. Greenberg, J., & Sussman, M. (2019). Unicode explained. O'Reilly Media, Inc.

[4]. Linux Documentation Project. (n.d.). Unicode HOWTO. Retrieved from https://tldp.org/HOWTO/Unicode-HOWTO.html on December 20 2023.

[5]. FreeBSD Documentation. (n.d.). Unicode Support. Retrieved from https://www.freebsd.org/doc/handbook/unicode.html on December 20 2023.

[6]. Microsoft. (n.d.). Unicode in the Windows API. Retrieved from https://docs.microsoft.com/en-us/windows/win32/intl/unicode-in-the-windows-api on December 20 2023.

[7]. Davis, M., & Duerst, M. (2018). Unicode Technical Introduction. Unicode Consortium. Retrieved from https://www.unicode.org/standard/principles.html on December 20 2023.

| 21

22 of 23

Reference:

[8]. GNU Core Utilities. (n.d.). wc command. Retrieved from https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html on December 20 2023.

[9]. GNU Core Utilities. (n.d.). cut command. Retrieved from https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html on December 20 2023.

[10]. GNU Core Utilities. (n.d.). tr command. Retrieved from https://www.gnu.org/software/coreutils/manual/html_node/tr-invocation.html on December 20 2023.

[11]. Linux Documentation Project. (n.d.). Ext4 File System. Retrieved from https://www.kernel.org/doc/html/latest/filesystems/ext4/index.html on December 20 2023.

[12]. Microsoft. (n.d.). NTFS Technical Reference. Retrieved from https://docs.microsoft.com/en-us/windows/win32/fileio/ntfs-technical-reference on December 20 2023.

[13]. Apple Developer. (n.d.). Apple File System Guide. Retrieved from https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/Introduction/Introduction.html on December 20 2023.

| 22

23 of 23

Author:

  • Dessalegn Mequnint Yehuala, dessalegn.mequanint@aau.edu.et

| 23