1 of 46

Universal Acceptance (UA) Micro-Learning Module: Module 3- Unicode in Data Structures and Algorithms in Java.

Instructor Guide

1st Edition.

���© 2024 Creative Commons License - Attribution 4.0 International (CC BY 4.0).

Universal Acceptance

2 of 46

Unicode In Data Structures and Algorithms in Java- Micro-Learning Module Objectives.

  • This module aims to enhance students' understanding and proficiency in working with Unicode in data structures and algorithms. The module covers key aspects such as lists, dictionaries (maps), sets, trees, graphs, Unicode Collation Algorithm for Sorting of Unicode Strings, Techniques for Sorting and Searching Local Language Strings within Data Structures, Commonly Used Sorting Algorithms in Data Structures and Algorithms, Binary Searches using Unicode Strings, and Searching through Unicode Text by combining techniques such as Normalization and Collation.

  • At the end of this module, students should be able to:
    • Understand the fundamentals of Unicode and its significance in data structures and algorithms;
    • Learn how to effectively store and manipulate Unicode data within data structures;
    • Develop techniques to perform searching, and sorting operations on Unicode strings; and
    • Acquire knowledge of Unicode-aware algorithms for text parsing, matching, and comparison.

| 2

3 of 46

Note About the Utilization of Unicode String Literals :

  • The Unicode string literals utilized in the code examples necessitate input methods, whether virtual or physical keyboards, suited to the script from which the literals are derived. In the event that these input methods are unavailable, alternatives such as language translation tools can be used.
  • Throughout the module, Unicode string literals derived from various scripts have been used in the examples code, spanning from lesser-known to more widely used ones. This approach aims to broaden learners' exposure to a diverse range of scripts..
  • The module provides the meanings of Unicode string literals used in the example codes in English, along with their transliterations, to aid learners in accurately pronouncing them.
  • Instructors are free to use Unicode literal strings derived from a script of of their choice rather than the ones included in the examples code.

| 3

4 of 46

Implementing Data Structures with Local Language Strings(1/6):

  • Lists:
    • Lists are ordered collections of elements.
    • Example: Creating Lists with Unicode Strings in Java- code snippet:

import java.util.ArrayList;

import java.util.List;

// Create a list of Unicode strings

List<String> unicodeList = new ArrayList<>();

unicodeList.add("ሰላም"); // Ethiopic (Amharic)

unicodeList.add("مرحبا"); // ‘Marhaba’ in Arabic, hello or welcome in English

unicodeList.add("සුබ උදෑසනක්"); // Suba Udaesank’ in Sinhala

unicodeList.add("你好"); // ‘nǐ hǎo’ in Chinese

| 4

5 of 46

Implementing Data Structures with Local Language Strings(2/6):

  • Dictionaries:
    • Store key-value pairs.
    • Example : Creating Dictionaries(maps) with Unicode strings in Java-Code Snippet:

,

import java.util.HashMap;

import java.util.Map;

// Create a dictionary (map) with Unicode strings

Map<String, String> unicodeDictionary = new HashMap<>();

unicodeDictionary.put("ሰላም", "Hello"); // Ethiopic (Amharic)

unicodeDictionary.put("مرحبا", "Bonjour"); // Arabic

unicodeDictionary.put("සුබ උදෑසනක්", "Good morning"); // Sinhala

unicodeDictionary.put("你好", "Ni hao"); // Chinese

| 5

6 of 46

Implementing Data Structures with Local Language Strings(3/6):

  • Sets:
    • Sets are unordered collections of unique elements.
    • Example: Creating Sets with Unicode Strings in Java- Code Snippet:

,

import java.util.HashSet;

import java.util.Set;

// Create a set of Unicode strings

Set<String> unicodeSet = new HashSet<>();

unicodeSet.add("ሰላም"); // Ethiopic (Amharic)

unicodeSet.add("مرحبا"); // Arabic

unicodeSet.add("සුබ උදෑසනක්"); // Sinhala

unicodeSet.add("你好"); // Chinese

| 6

7 of 46

Implementing Data Structures with Local Language Strings(4/6):

  • Trees:
    • Trees are hierarchical data structures where each node can have child nodes. Unicode enables you to use local language strings as labels for nodes in a tree
    • Example: Creating Trees with Unicode Strings in Java-Code Snippet:

,

public static class TreeNode {

String data;

TreeNode left;

TreeNode right;

public TreeNode(String data) {

this.data = data;

this.left = null;

this.right = null;

}

}

// Create a tree with Unicode strings

TreeNode root = new TreeNode("ሰላም"); // Ethiopic (Amharic)

root.left = new TreeNode("مرحبا"); // Arabic

root.right = new TreeNode("සුබ උදෑසනක්"); // Sinhala

root.left.left = new TreeNode("你好"); // Chinese

| 7

8 of 46

Implementing Data Structures with Local Language Strings(5/6):

  • Graphs:
    • Graphs consist of vertices and edges that represent relationships or connections. Unicode supports local language strings as labels for vertices and edges in a graph.
    • This enables you to represent relationships or connections labeled in multiple languages.
    • For example, you might create a social network graph where the vertices represent individuals with names in different languages.

,

| 8

9 of 46

Implementing Data Structures with Local Language Strings(6/6):

  • Graphs:
    • Example: Creating Graphs with Unicode Strings in Java-Code Snippet:

,

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

// Create a graph with Unicode strings

Map<String, List<String>> graph = new HashMap<>();

// Add vertices

addVertex(graph, "ሰላም"); // Ethiopic (Amharic)

addVertex(graph, "مرحبا"); // Arabic

addVertex(graph, "සුබ උදෑසනක්"); // Sinhala

addVertex(graph, "你好"); // Chinese

// Add edges

addEdge(graph, "ሰላም", "مرحبا");

addEdge(graph, "مرحبا", "සුබ උදෑසනක්");

addEdge(graph, "مرحبا", "你好");

| 9

10 of 46

Unicode Collation Algorithm for Sorting of Unicode Strings:

  • The Unicode Collation Algorithm (UCA) is a set of rules defined by the Unicode Consortium for sorting Unicode strings.
    • Code Point Comparison
    • Diacritic Sensitivity
    • Case Sensitivity
    • Linguistic Context
    • Collation Element
    • Collation Rules

| 10

11 of 46

How to Sort Strings Using the Unicode Collation Algorithm (UCA)(1/2):

Example: Sorting of Strings Using UCA in Java:

,

import java.text.Collator;

import java.util.Arrays;

import java.util.Locale;

public class UnicodeCollationExample {

public static void main(String[] args) {

// Create a Collator instance for the desired locale

Collator collator = Collator.getInstance(Locale.getDefault());

// Array of Unicode strings to be sorted

String[] strings = {

"अनुच्छेद", // Hindi: Paragraph

"Éclair", // French: Pastry

"Österreich", // German: Austria

"中国", // Chinese: China

"Россия", // Russian: Russia

"日本", // Japanese: Japan

"ሉሲ" //Ethiopian: Lucy

};

// Sort the strings using the collator

Arrays.sort(strings, collator);

// Print the sorted strings

for (String str : strings) {

System.out.println(str);

}

}

}

| 11

12 of 46

How to Sort Strings Using the Unicode Collation Algorithm (UCA)(2/2):

Example: Sorting of Strings Using UCA in Java -Output:

,

Output:

Éclair

Österreich

Россия

ሉሲ

अनुच्छेद

中国

日本

| 12

13 of 46

Some of Commonly Used Sorting Algorithms in Data Structures and Algorithms(1/4):

  • The following are some of the sorting algorithms commonly used in data structures:
    • Bubble Sort
    • Selection Sort
    • Insertion Sort:
    • Merge Sort
    • Quick Sort
    • Heap Sort
    • Tim Sort

| 13

14 of 46

Some of Commonly Used Sorting Algorithms in Data Structures and Algorithms(2/4):

  • Bubble Sort:
    • For Unicode strings, you can use Unicode-aware comparison functions to compare and swap strings correctly based on their Unicode code points or collation order.
    • Additionally, consider using Unicode normalization to ensure consistent handling of composed and decomposed characters during the comparison process.
  • Selection Sort:
    • Similar to Bubble Sort, use Unicode-aware comparison functions to compare and select the appropriate string based on Unicode code points or collation order.
    • Apply Unicode normalization to ensure consistent sorting results.

| 14

15 of 46

Some of Commonly Used Sorting Algorithms in Data Structures and Algorithms(3/4):

  • Insertion Sort:
    • Utilize Unicode-aware comparison functions to compare and insert strings into their correct positions within the sorted portion of the list.
    • Normalize Unicode strings using Unicode normalization to maintain consistency during the insertion process.
  • Merge Sort:
    • When merging Unicode string lists during the merge step, employ Unicode-aware comparison functions to merge the lists based on Unicode code points or collation order.
    • Normalize Unicode strings to ensure consistent results during the merging process.

| 15

16 of 46

Some of Commonly Used Sorting Algorithms in Data Structures and Algorithms(4/4):

  • Heap Sort:
    • Use Unicode-aware comparison functions or collation keys when building the heap and extracting elements to maintain the correct sorting order.
    • Normalize Unicode strings during the heap construction phase to ensure consistent results.
  • Tim Sort:
    • Tim Sort is a hybrid sorting algorithm that works well with various types of data, including Unicode strings.
    • Employ Unicode-aware comparison functions or collation keys during the sorting process to ensure accurate sorting of Unicode strings.
    • Normalize Unicode strings for consistent sorting results.

| 16

17 of 46

Example: Unicode-aware Bubble Sort in Java(1/3):

,

import java.text.Collator;

import java.util.Arrays;

import java.util.Locale;

public class UnicodeAwareBubbleSort {

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

unicodeAwareBubbleSort(strings);

// Print the sorted strings

for (String string : strings) {

System.out.println(string);

}

}

| 17

18 of 46

Example: Unicode-aware Bubble Sort in Java(2/3):

,

public static void unicodeAwareBubbleSort(String[] strings) {

int n = strings.length;

for (int i = 0; i < n; i++) {

// Flag to check if any swaps occurred in this pass

boolean swapped = false;

for (int j = 0; j < n - i - 1; j++) {

// Compare and swap if necessary using Unicode-aware comparison

if (compareStrings(strings[j], strings[j + 1]) > 0) {

String temp = strings[j];

strings[j] = strings[j + 1];

strings[j + 1] = temp;

swapped = true;

}

}

// If no swaps occurred in this pass, the array is already sorted

if (!swapped) {

break;

}

}

}

| 18

19 of 46

Example: Unicode-aware Bubble Sort in Java(3/3):

,

public static int compareStrings(String str1, String str2) {

// Compare the strings using the default locale-specific collator

Collator collator = Collator.getInstance(Locale.getDefault());

return collator.compare(str1, str2);

}

}

Output:

καθολική-αποδοχή-δοκιμή.ευ

универсальное-принятие-тест.москва

تجربة-القبول-الشامل.موريتانيا

सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन

সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি

ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා

უნივერსალური-თავსობადობის-ტესტი.გე

普遍适用测试.我爱你

| 19

20 of 46

Example: Unicode-aware Selection Sort in Java(1/3):

,

import java.text.Normalizer;

import java.util.Arrays;

public class UnicodeAwareSelectionSort {

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

selectionSort(strings);

// Print the sorted strings

for (String string : strings) {

System.out.println(string);

}

}

| 20

21 of 46

Example: Unicode-aware Selection Sort in Java(2/3):

,

public static void selectionSort(String[] arr) {

int n = arr.length;

for (int i = 0; i < n; i++) {

int minIndex = i;

for (int j = i + 1; j < n; j++) {

// Compare using Unicode-aware comparison

if (compareStrings(arr[j], arr[minIndex]) < 0) {

minIndex = j;

}

}

if (minIndex != i) {

// Swap elements

String temp = arr[i];

arr[i] = arr[minIndex];

arr[minIndex] = temp;

}

}

}

| 21

22 of 46

Example: Unicode-aware Selection Sort in Java(3/3):

,

public static int compareStrings(String str1, String str2) {

// Normalize strings for consistent comparison

String str1Norm = Normalizer.normalize(str1, Normalizer.Form.NFD);

String str2Norm = Normalizer.normalize(str2, Normalizer.Form.NFD);

// Use case-insensitive comparison using Unicode code points

return str1Norm.compareToIgnoreCase(str2Norm);

}

}

Output:

καθολική-αποδοχή-δοκιμή.ευ

универсальное-принятие-тест.москва

تجربة-القبول-الشامل.موريتانيا

सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन

সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি

ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා

უნივერსალური-თავსობადობის-ტესტი.გე

普遍适用测试.我爱你

| 22

23 of 46

Example: Unicode-aware Insertion Sort in Java(1/3):

,

import java.text.Normalizer;

import java.util.Arrays;

import java.util.Comparator;

public class StringComparator {

public static String[] insertionSort(String[] arr) {

int n = arr.length;

for (int i = 1; i < n; i++) {

String key = arr[i];

int j = i - 1;

while (j >= 0 && compareStrings(arr[j], key) > 0) {

arr[j + 1] = arr[j];

j--;

}

arr[j + 1] = key;

}

return arr;

}

| 23

24 of 46

Example: Unicode-aware Insertion Sort in Java(2/3):

,

public static int compareStrings(String str1, String str2) {

// Normalize strings for consistent comparison

String str1Norm = Normalizer.normalize(str1, Normalizer.Form.NFD);

String str2Norm = Normalizer.normalize(str2, Normalizer.Form.NFD);

// Use case-insensitive comparison using Unicode code points

return Character.codePointAt(str1Norm, 0) - Character.codePointAt(str2Norm, 0);

}

| 24

25 of 46

Example: Unicode-aware Insertion Sort in Java(3/3):

,

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

Arrays.sort(strings, Comparator.comparing(String::new, (str1, str2) -> compareStrings(str1, str2)));

for (String item : strings) {

System.out.println(item);

}

}

}

| 25

26 of 46

Example: Unicode-aware Merge Sort in Java(1/3):

,

import java.text.Normalizer;

import java.util.Arrays;

public class UStringMergeSorter {

public static int compareStrings(String str1, String str2) {

// Normalize strings for consistent comparison

String str1Norm = Normalizer.normalize(str1, Normalizer.Form.NFD);

String str2Norm = Normalizer.normalize(str2, Normalizer.Form.NFD);

// Use case-insensitive comparison using Unicode code points

return Character.compare(str1Norm.charAt(0), str2Norm.charAt(0));

}

public static String[] mergeSort(String[] arr) {

if (arr.length <= 1) {

return arr;

}

int mid = arr.length / 2;

String[] leftHalf = Arrays.copyOfRange(arr, 0, mid);

String[] rightHalf = Arrays.copyOfRange(arr, mid, arr.length);

leftHalf = mergeSort(leftHalf);

rightHalf = mergeSort(rightHalf);

return merge(leftHalf, rightHalf);

}

| 26

27 of 46

Example: Unicode-aware Merge Sort in Java(2/3):

,

public static String[] merge(String[] left, String[] right) {

String[] merged = new String[left.length + right.length];

int i = 0, j = 0, k = 0;

while (i < left.length && j < right.length) {

if (compareStrings(left[i], right[j]) <= 0) {

merged[k++] = left[i++];

} else {

merged[k++] = right[j++];

}

}

while (i < left.length) {

merged[k++] = left[i++];

}

while (j < right.length) {

merged[k++] = right[j++];

}

return merged;

}

| 27

28 of 46

Example: Unicode-aware Merge Sort in Java(3/3):

,

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

String[] sortedStrings = mergeSort(strings);

for (String item : sortedStrings) {

System.out.println(item);

}

}

}

| 28

29 of 46

Example: Unicode-aware Quick Sort in Java(1/3):

,

import java.text.Normalizer;

import java.util.ArrayList;

import java.util.List;

public class UStringQuickSorter {

public static int compareStrings(String str1, String str2) {

// Normalize strings for consistent comparison

String str1Norm = Normalizer.normalize(str1, Normalizer.Form.NFD);

String str2Norm = Normalizer.normalize(str2, Normalizer.Form.NFD);

// Use case-insensitive comparison using Unicode code points

return Character.compare(str1Norm.charAt(0), str2Norm.charAt(0));

}

public static List<String> quickSort(List<String> arr) {

if (arr.size() <= 1) {

return arr;

}

String pivot = arr.get(arr.size() / 2);

| 29

30 of 46

Example: Unicode-aware Quick Sort in Java(2/3):

,

List<String> left = new ArrayList<>();

List<String> equal = new ArrayList<>();

List<String> right = new ArrayList<>();

for (String item : arr) {

int comparison = compareStrings(item, pivot);

if (comparison < 0) {

left.add(item);

} else if (comparison == 0) {

equal.add(item);

} else {

right.add(item);

}

}

left = quickSort(left);

right = quickSort(right);

List<String> sortedList = new ArrayList<>();

sortedList.addAll(left);

sortedList.addAll(equal);

sortedList.addAll(right);

return sortedList;

}

| 30

31 of 46

Example: Unicode-aware Quick Sort in Java(3/3):

,

public static void main(String[] args) {

List<String> strings = new ArrayList<>();

strings.add("সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি"); // Bangla (Bengali)

strings.add("универсальное-принятие-тест.москва"); // Cyrillic

strings.add("सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन"); // Devanagari

strings.add("უნივერსალური-თავსობადობის-ტესტი.გე"); // Georgian

strings.add("καθολική-αποδοχή-δοκιμή.ευ"); // Greek

strings.add("تجربة-القبول-الشامل.موريتانيا"); // Arabic

strings.add("普遍适用测试.我爱你"); // Simplified Chinese

strings.add("ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා"); // Sinhala

List<String> sortedStrings = quickSort(strings);

for (String item : sortedStrings) {

System.out.println(item);

}

}

}

| 31

32 of 46

Example: Unicode-aware Heap Sort in Java(1/3):

,

import java.text.Normalizer;

import java.util.Arrays;

public class UStringHeapSorter {

public static int compareStrings(String str1, String str2) {

// Normalize strings for consistent comparison

String str1Norm = Normalizer.normalize(str1, Normalizer.Form.NFD);

String str2Norm = Normalizer.normalize(str2, Normalizer.Form.NFD);

// Use case-insensitive comparison using Unicode code points

return Character.compare(str1Norm.charAt(0), str2Norm.charAt(0));

}

public static void heapify(String[] arr, int n, int i) {

int largest = i;

int left = 2 * i + 1;

int right = 2 * i + 2;

if (left < n && compareStrings(arr[left], arr[largest]) > 0) {

largest = left;

}

if (right < n && compareStrings(arr[right], arr[largest]) > 0) {

largest = right;

}

| 32

33 of 46

Example: Unicode-aware Heap Sort in Java(2/3):

,

if (largest != i) {

String temp = arr[i];

arr[i] = arr[largest];

arr[largest] = temp;

heapify(arr, n, largest);

}

}

public static String[] heapSort(String[] arr) {

int n = arr.length;

// Build max heap

for (int i = n / 2 - 1; i >= 0; i--) {

heapify(arr, n, i);

}

// Heap sort

for (int i = n - 1; i > 0; i--) {

String temp = arr[0];

arr[0] = arr[i];

arr[i] = temp;

heapify(arr, i, 0);

}

return arr;

}

| 33

34 of 46

Example: Unicode-aware Heap Sort in Java(3/3):

,

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

strings = heapSort(strings);

for (String item : strings) {

System.out.println(item);

}

}

}

| 34

35 of 46

Example: Unicode-aware Tim Sort in Java(1/3):

,

import java.text.Normalizer;

import java.util.Arrays;

public class UStringTimSorter {

public static int compareStrings(String str1, String str2) {

// Normalize strings for consistent comparison

String str1Norm = Normalizer.normalize(str1, Normalizer.Form.NFD);

String str2Norm = Normalizer.normalize(str2, Normalizer.Form.NFD);

// Use case-insensitive comparison using Unicode code points

return Character.compare(str1Norm.charAt(0), str2Norm.charAt(0));

}

public static void insertionSort(String[] arr, int left, int right) {

for (int i = left + 1; i <= right; i++) {

String key = arr[i];

int j = i - 1;

while (j >= left && compareStrings(arr[j], key) > 0) {

arr[j + 1] = arr[j];

j--;

}

arr[j + 1] = key;

}

}

| 35

36 of 46

Example: Unicode-aware Tim Sort in Java(2/3):

,

public static void merge(String[] arr, int left, int mid, int right) {

int len1 = mid - left + 1;

int len2 = right - mid;

String[] leftArr = Arrays.copyOfRange(arr, left, mid + 1);

String[] rightArr = Arrays.copyOfRange(arr, mid + 1, right + 1);

int i = 0, j = 0, k = left;

while (i < len1 && j < len2) {

if (compareStrings(leftArr[i], rightArr[j]) <= 0) {

arr[k] = leftArr[i];

i++;

} else {

arr[k] = rightArr[j];

j++;

}

k++;

}

while (i < len1) {

arr[k] = leftArr[i];

i++;

k++;

}

| 36

37 of 46

Example: Unicode-aware Tim Sort in Java(2/3):

,

while (j < len2) {

arr[k] = rightArr[j];

j++;

k++;

}

}

public static String[] timSort(String[] arr) {

int n = arr.length;

int minRun = 32;

for (int i = 0; i < n; i += minRun) {

insertionSort(arr, i, Math.min(i + minRun - 1, n - 1));

}

int size = minRun;

while (size < n) {

for (int start = 0; start < n; start += size * 2) {

int mid = Math.min(start + size - 1, n - 1);

int end = Math.min(start + size * 2 - 1, n - 1);

merge(arr, start, mid, end);

}

| 37

38 of 46

Example: Unicode-aware Tim Sort in Java(3/3):

,

size *= 2;

}

return arr;

}

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

strings = timSort(strings);

for (String item : strings) {

System.out.println(item);

}

}

}

| 38

39 of 46

Example: Binary Searches in Unicode Strings in Java:(1/3):

,

import java.util.Arrays;

public class LexicographicBinarySearch {

public static int binarySearchUnicode(String[] arr, String target) {

int low = 0;

int high = arr.length - 1;

while (low <= high) {

int mid = (low + high) / 2;

String midVal = arr[mid];

if (target.equals(midVal)) {

return mid;

} else if (target.compareTo(midVal) < 0) {

high = mid - 1;

} else {

low = mid + 1;

}

}

return -1; // Target not found

}

| 39

40 of 46

Example: Binary Searches in Unicode Strings in Java:(2/3):

,

public static void main(String[] args) {

String[] strings = {

"تجربة-بريد-الكتروني@تجربة-القبول-الشامل.موريتانيا", // Arabic Unicode@IDN.IDN email.

"Էլփոստ-թեստ@համընդհանուր-ընկալում-թեստ.հայ", // Armenian Unicode@IDN.IDN email.

"ই-মেইল-পরীক্ষা@সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangala Unicode@IDN.IDN email.

"ფოსტის-ტესტი@უნივერსალური-თავსობადობისტესტი.გე", // Georgian Unicode@IDN.IDN email.

"מבחן-דואר-אלקטרוני@מבחן-קבלה-אוניברסלי.קום", // Hebrew Unicode@IDN.IDN email.

"மின்னஞ்சல்-சோதனை@பொது-ஏற்பு-சோதனை.சிங்கப்பூர்", // Tamil Unicode@IDN.IDN email.

"อีเมลทดสอบ@ยูเอทดสอบ.ไทย", // Thai Unicode@IDN.IDN email.

"ኢሜይል-ሙከራ@ሁለንአቀፍ-ተቀባይነት-ሙከራ.com", // Ethiopic Unicode@IDN.ASCII email.

"အီးမေးလ်စမ်းသပ်ချက်@အလုံးစုံလက်ခံမှုစမ်းသပ်ချက်.com" // Myanmar Unicode@IDN.ASCII email.

};

| 40

41 of 46

Example: Binary Searches in Unicode Strings in Java:(3/3):

,

String targetString = "ই-মেইল-পরীক্ষা@সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি"; // Target string to search

Arrays.sort(strings); // Sorting in lexicographic order

int result = binarySearchUnicode(strings, targetString);

if (result != -1) {

System.out.println(targetString + " found at index " + result);

} else {

System.out.println(targetString + " not found");

}

}

}

Output:

ই-মেইল-পরীক্ষা@সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি found at index 3

| 41

42 of 46

Searching Through Unicode Text (Normalization + Collation)

  • When searching through Unicode data, it's important to consider both normalization and collation to ensure accurate and effective matching:
    • Normalization: Unicode normalization involves transforming Unicode strings into a standardized form that eliminates any variations in character representation.
    • Collation: Collation is the process of ordering strings based on a predefined set of rules or a collation algorithm.

| 42

43 of 46

Example: Searching Unicode Text in Java:(1/2):

,

import java.text.Collator;

import java.text.Normalizer;

import java.util.Locale;

public class StringSearchExample {

public static void main(String[] args) {

String[] strings = {

"সর্বজনীন-স্বীকৃ তির-পরীক্ষা.ভারি", // Bangla (Bengali)

"универсальное-принятие-тест.москва", // Cyrillic

"सार्वभौमिक-स्र्ीकृति-परीक्षण.संगठन", // Devanagari

"უნივერსალური-თავსობადობის-ტესტი.გე", // Georgian

"καθολική-αποδοχή-δοκιμή.ευ", // Greek

"تجربة-القبول-الشامل.موريتانيا", // Arabic

"普遍适用测试.我爱你", // Simplified Chinese

"ඉ-තැපැල්-පිරික්සුම@විශ්ව-සම්මුති-පිරික්සුම.ලංකා" // Sinhala

};

String searchQuery = "καθολικη-αποδοχη-δοκιμη.ευ";

// Normalize the search query using Unicode normalization form NFD

String normalizedSearchQuery = Normalizer.normalize(searchQuery,

Normalizer.Form.NFD);

| 43

44 of 46

Example: Searching Unicode Text in Java:(2/2):

,

// Create a Collator instance for the default locale with strength PRIMARY

Collator collator = Collator.getInstance(Locale.getDefault());

collator.setStrength(Collator.PRIMARY);

boolean found = false;

for (String str : strings) {

String normalizedStr = Normalizer.normalize(str, Normalizer.Form.NFD);

if (collator.compare(normalizedSearchQuery, normalizedStr) == 0) {

found = true;

break;

}

}

if (found) {

System.out.println("Search text” + searchQuery + “ is found in the array.");

} else {

System.out.println("Search text” + searchQuery + “ is not found in the array.");

}

}

}

Output: Search text καθολικη-αποδοχη-δοκιμη.ευ is found in the array.

| 44

45 of 46

Reference:

  • Unicode Consortium. (n.d.). Home. Retrieved October 21, 2023, from https://home.unicode.org/.
  • Stack Overflow. (n.d.). Where Developers Learn, Share, & Build Careers. Retrieved October 21, 2023, from https://stackoverflow.com/.
  • Python Software Foundation. (n.d.). Unicode HOWTO. Retrieved October 21, 2023, from https://docs.python.org/3/howto/unicode.html.

| 45

46 of 46

Author:

  • Dessalegn Mequanint Yehuala, dessalegn.mequanint@aau.edu.et.

| 46