1 of 37

Universal Acceptance (UA) Micro-Learning Module: Module 8- Advanced Topics in Internationalized Domain Names(IDNs).

Instructor Guide

1st Edition.

���© 2024 Creative Commons License - Attribution 4.0 International (CC BY 4.0).

Universal Acceptance

2 of 37

Advanced Topics in IDNs UA Micro-Learning Module Objectives:

  • In this module, we will delve into various advanced concepts and techniques related to IDNs, with a focus on understanding and addressing the limitations of the IDNA2008 protocol.
  • At the end of this module, students should be able to:
    • Understand the limitations of the IDNA2008 protocol in practical terms;
    • Articulate key components of Label Generation Rules (LGRs);
    • Trace the evolution of LGR formats;
    • Grasp the concept of variant labels and their definition and use in LGRs; and
    • Demonstrating proficiency in applying the concepts in real-world scenarios- write and execute codes to perform label verification and variant identification.

| 2

3 of 37

Note About the Utilization of Unicode String Literals :

  • The Unicode string literals utilized in the code examples necessitate input methods, whether virtual or physical keyboards, suited to the script from which the literals are derived. In the event that these input methods are unavailable, alternatives such as language translation tools can be used.
  • Throughout the module, Unicode string literals derived from various scripts have been used in the examples code, spanning from lesser-known to more widely used ones. This approach aims to broaden learners' exposure to a diverse range of scripts..
  • The module provides the meanings of Unicode string literals used in the example codes in English, along with their transliterations, to aid learners in accurately pronouncing them.
  • Instructors are free to use Unicode literal strings derived from a script of their choice rather than the ones included in the examples code.

| 3

4 of 37

Understanding and Addressing IDNA2008 Limitations:

  • While IDNA2008 has made significant improvements over its predecessor, IDNA2003, there are still some limitations to be aware of:
    • Contextual Variants.
      • Uppercase and lowercase letters,
      • Different representations of diacritics,
      • Ligatures (characters that are formed by combining two or more letters).
      • These limitations can lead to confusion and homograph attacks.
    • Limited Character Set.
    • Complexity of Implementation.
    • Compatibility Challenges.
    • Limited Validation.
    • Limited Support by DNS Infrastructure.
    • Language-Specific Rules.

| 4

5 of 37

Introduction to Label Generation Rules (LGRs):

  • The primary objective of LGRs is to provide a comprehensive and well-defined set of guidelines:
  • These guidelines help determine the validity of individual label variants within a specific script or language context.
  • To achieve this, LGRs encompass a range of rules that address various aspects:
    • character variants.
    • context-dependent variants.
    • script-specific rules.
    • and other considerations unique to the language or script in question.
  • By establishing these rules, LGRs ensure consistency and predictability in the handling of IDNs across different DNS systems and applications.
  • The primary goals of LGRs are to:
    • Ensure the stability and predictability of IDNs.
    • Promote linguistic diversity and inclusivity.
    • Enhance security and prevent abuse.

| 5

6 of 37

Evolution of LGR Formats: From Text-Based to XML-Based Standardization (RFCs 3743, 4690, and 7940):

  • The transition from text-based to XML-based standardization:
    • marks a significant milestone.
  • The transition involves:
    • introduction of standardized formats: structured and machine readable formats.
    • Prior to this evolution, LGRs were primarily defined using text-based formats.
  • IETF introduced a series of RFCs that established XML-based standards for LGRs:
    • RFC 3743: the concept of Repertoire.
    • RFC 4690: introduced an XML-based format called the "IDL" (Implementation Definition Language).
    • RFC 7940: This specification extended the IDL format and introduced the concept of "LGR" as a comprehensive XML-based representation of a Label Generation Rule set
      • Defined the schema and elements for describing the repertoire, rules, metadata, and other relevant information within an LGR.

| 6

7 of 37

Advantages of XML-based LGR format:

  • Structure and Organization: It allows for the hierarchical organization of rules, classes, variants, and other elements, making it easier to navigate and process the LGR information.
  • Machine-Readable: It allows for automated parsing, validation, and transformation of LGR data using existing XML processing tools and libraries.
  • Extensibility: Its extensibility enables the representation of more complex rules and language-specific considerations that may be necessary for specific scripts or languages.
  • Internationalization Support: supports Unicode characters and encoding schemes, making it suitable for representing LGR data that encompasses a wide range of scripts and languages.
  • Interoperability: It enables the exchange of LGR data between registries, registrars, and other entities involved in IDN management, fostering consistency and compatibility across implementations.
  • Versioning and Updates: It facilitates the tracking of changes, additions, and modifications to LGRs over time, ensuring that the most up-to-date rules and guidelines are used.

| 7

8 of 37

Example: Section-by-Section Illustration of an LGR Definition(1/3):

  • XML Declaration:

    • This line indicates that the document is an XML file with version 1.0 and encoded in UTF-8.
  • LGR Element:

    • xmlns’: Specifies the XML namespace for LGR elements.
    • xmlns:xsi’ : Declares the XML Schema Instance namespace.
    • xsi:schemaLocation’: Points to the schema location for validation.
    • repertoire’: Indicates the XML namespace for Unicode.

<lgr xmlns="urn:ietf:params:xml:ns:lgr-1.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:ietf:params:xml:ns:lgr-1.0 https://www.iana.org/assignments/lgr/lgr-1.0.xsd"

repertoire="urn:ietf:params:xml:ns:unicode-1.0">

<?xml version="1.0" encoding="UTF-8"?>

| 8

9 of 37

Example: Section-by-Section Illustration of an LGR Definition(2/3):

  • Metadata Section:

    • version’: Specifies the version of the LGR.
    • language’: Indicates the language (Amharic in this case).
    • description’: Provides a description of the Label Generation Rules.
    • author’: Identifies the author of the LGR.

<!-- Metadata Element –>

<metadata>

<version>1.0</version>

<language>am</language>

<description>Label Generation Rules for Ethiopic script (Amharic)</description>

<author>አበበ ከበደ</author>

</metadata>

| 9

10 of 37

Example: Section-by-Section Illustration of an LGR Definition(3/3):

  • Rules Section:

    • This section defines various classes of characters (consonants, vowels, punctuation, digits, etc.) with corresponding rules specifying the character ranges.

<rules>

<!-- Classes for different character types -->

<class id="consonant">

<description>Ethiopic consonants</description>

<rule>[ሀ-፼]</rule>

</class>

<!-- Additional classes for vowels, punctuation, digits, etc. -->

</rules>

| 10

11 of 37

Example: Full LGR Definition for Ethiopic Script(1/2):

  • Rules Section:

<?xml version="1.0" encoding="UTF-8"?>

<lgr xmlns="urn:ietf:params:xml:ns:lgr-1.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:ietf:params:xml:ns:lgr-1.0 https://www.iana.org/assignments/lgr/lgr-1.0.xsd"

repertoire="urn:ietf:params:xml:ns:unicode-1.0">

<metadata>

<version>1.0</version>

<language>am</language>

<description>Label Generation Rules for Ethiopic script (Amharic) </description>

<author>አበበ ከበደ</author>

</metadata>

<rules>

<class id="consonant">

<description>Ethiopic consonants</description>

<rule>[ሀ-ፖ]</rule>

</class>

<class id="vowel">

<description>Ethiopic vowels</description>

<rule>[አኡኢኣኤእኦ] </rule>

| 11

12 of 37

Example: Full LGR Definition for Ethiopic Script(2/2):

  • Rules Section:

</class>

<class id="punctuation">

<description>Ethiopic punctuation marks</description>

<rule>[፠-፧]</rule>

</class>

<class id="digit">

<description>Ethiopic digits</description>

<rule>[፩-፼]</rule>

</class>

<class id="joiner">

<description>Joiner character</description>

<rule>፡</rule>

</class>

<class id="diacritic">

<description>Ethiopic diacritic characters</description>

<rule>[፡-፦]</rule>

</class>

</rules>

</lgr

| 12

13 of 37

Variant Labels and their Definition in LGRs:

  • Base Label: The base label is the original label for which variant labels are defined. It serves as the starting point for generating the variants.
  • Variant Type: This specifies the type or category of the variant label, such as "variant with diacritics," "script variant," or "transliteration variant."
  • Variant Rules: Variant rules define the transformations or modifications applied to the base label to generate the variant labels.
  • Contextual Rules: Contextual rules define variant transformations based on the surrounding context or neighboring characters.
  • Actions: Actions specify the operations applied to the base label to generate variant labels.
  • Mapping or Mapping Tables: Mapping tables are used to explicitly define the mapping between a base label and its variant labels.
  • Validity and Stability: Variant labels defined in LGRs may have specific validity and stability criteria.

| 13

14 of 37

Supplementary/Complementary Rules to Variant Labels Definition in LGRs:

  • Diacritic Rule: This rule type involves adding or removing diacritics from the base label.
  • Transliteration Rule: Transliteration rules define the transformation of a label from one script or character set to another.
  • Contextual Rule: Contextual rules consider the context or neighboring characters to determine the generation of variant labels.
  • Ligature Rule: Ligature rules define the transformation of a sequence of characters into a ligature or a single character representation.
  • Case Rule: Case rules specify the transformation of a label's case, such as changing between uppercase and lowercase letters.
  • Abbreviation Rule: Abbreviation rules define the transformation of a label into an abbreviated form.
  • Phonetic Rule: Phonetic rules involve the transformation of a label based on phonetic representations.

| 14

15 of 37

Example: Section-by-Section Illustration of an LGR Definition(1/7):

  • Base Label Element:

    • Specifies the base label, which is the primary representation of the label.

<BaseLabel>አማርኛ</BaseLabel>

| 15

16 of 37

Example: Section-by-Section Illustration of a Variant Label Definition(2/7):

  • Variant Labels Element- Diactrics Element:

<VariantLabel>

<Type>Diacritics</Type>

<Rules>

<AddDiacritics>

<!-- No Diacritic Specified -->

<!-- Additional diacritic specifications are empty -->

</AddDiacritics>

</Rules>

</VariantLabel>

| 16

17 of 37

Example: Section-by-Section Illustration of a Variant Label Definition(3/7):

  • Variant Labels Element- Script Variant Element:

<VariantLabel>

<Type>ScriptVariant</Type>

<Rules>

<ScriptVariation>ዐማርኛ</ScriptVariation>

</Rules>

</VariantLabel>

| 17

18 of 37

Example: Section-by-Section Illustration of a Variant Label Definition(4/7):

  • Variant Labels Element- Transliteration Variant Element:

    • The above XML code fragments define variant labels with different types: diacritics, script variation, and transliteration.

<VariantLabel>

<Type>Transliteration</Type>

<Rules>

<Transliteration>Amharic</Transliteration>

</Rules>

</VariantLabel>

| 18

19 of 37

Example: Section-by-Section Illustration of a Variant Label Definition(5/7):

  • Contextual Rules Element:

<ContextualRules>

<ContextRule>

<Condition>አ</Condition>

<Action>

<AddCharacter>ዐ</AddCharacter>

</Action>

</ContextRule>

</ContextualRules>

  • The above XML code fragment defines contextual rules where the addition of a character depends on a specific condition (አ in this case).

| 19

20 of 37

Example: Section-by-Section Illustration of a Variant Label Definition(6/7):

  • Mapping Tables Element:

    • The above XML code fragment defines explicit mappings, associating the base label with its variants.

<MappingTables>

<Mapping>

<Base>አማርኛ</Base>

<Variants>

<Variant>ዐማርኛ</Variant>

<!-- Additional variant mappings -->

</Variants>

</Mapping>

</MappingTables>

| 20

21 of 37

Example: Section-by-Section Illustration of a Variant Label Definition(7/7):

  • Validity and Stability Criteria Elements:

    • Specifies validity criteria for variants, indicating stability (High in this case).

<ValidityCriteria>

<ValidVariant>

<Variant>ዐማርኛ</Variant>

<Stability>High</Stability>

</ValidVariant>

</ValidityCriteria>

| 21

22 of 37

Example: Full XML Code for Defining Label Variants in Ethiopic Script:(1/4):

<LGR>

<!-- Base Label -->

<BaseLabel>አማርኛ</BaseLabel>

<!-- Variant Label with Diacritics -->

<VariantLabel>

<Type>Diacritics</Type>

<Rules>

<!-- Define rules for adding diacritics -->

<AddDiacritics>

<!-- No Diacritic Specified -->

<!-- Additional diacritic specifications are empty -->

</AddDiacritics>

</Rules>

</VariantLabel>

<!-- Variant Label with Script Variation -->

<VariantLabel>

<Type>ScriptVariant</Type>

<Rules>

<!-- Define rules for script variation -->

<ScriptVariation>ዐ ማ ርኛ</ScriptVariation>

| 22

23 of 37

Example: Full XML Code for Defining Label Variants in Ethiopic Script:(2/4):

</Rules>

</VariantLabel>

<!-- Variant Label with Transliteration -->

<VariantLabel>

<Type>Transliteration</Type>

<Rules>

<!-- Define rules for transliteration -->

<Transliteration>Amharic</Transliteration>

</Rules>

</VariantLabel>

<!-- Contextual Rules -->

<ContextualRules>

<!-- Define rules that depend on context -->

<ContextRule>

<Condition>አ</Condition>

<Action>

<!-- Define actions specific to the context -->

<AddCharacter>ዐ</AddCharacter>

</Action>

</ContextRule>

| 23

24 of 37

Example: Full XML Code for Defining Label Variants in Ethiopic Script:(3/4):

</ContextualRules>

<!-- Actions -->

<Actions>

<!-- Define global actions -->

<GlobalAction>

<!-- Define global transformations -->

<SubstituteCharacters>

<!-- Define character substitutions -->

<Substitution>

<Original>አ</Original>

<Replacement>ዐ</Replacement>

</Substitution>

</SubstituteCharacters>

</GlobalAction>

</Actions>

<!-- Mapping Tables -->

<MappingTables>

<!-- Define explicit mappings -->

| 24

25 of 37

Example: Full XML Code for Defining Label Variants in Ethiopic Script:(4/4):

<Mapping>

<Base>አማርኛ</Base>

<Variants>

<Variant>ዐማርኛ</Variant>

<!-- Additional variant mappings -->

</Variants>

</Mapping>

</MappingTables>

<!-- Validity and Stability Criteria -->

<ValidityCriteria>

<!-- Define validity criteria for variants -->

<ValidVariant>

<Variant>ዐማርኛ</Variant>

<Stability>High</Stability>

</ValidVariant>

</ValidityCriteria>

</LGR>

| 25

26 of 37

Application of LGRs: Python Example on Label Verification and Variant Identification(1/3):

import xml.etree.ElementTree as ET

class LGRValidator:

def __init__(self, lgr_data):

self.lgr_tree = ET.ElementTree(ET.fromstring(lgr_data))

self.root = self.lgr_tree.getroot()

self.base_label = self.root.find("BaseLabel").text

self.variant_labels = self._get_variant_labels()

def validate_label(self, label):

if not isinstance(label, str):

raise TypeError("Label must be a string")

return label == self.base_label

def get_variant_labels(self):

return self.variant_labels

def _get_variant_labels(self):

variant_labels = set()

for variant_label in self.root.findall("VariantLabel"):

variant_type_element = variant_label.find("Type")

if variant_type_element is not None and variant_type_element.text:

variant_type = variant_type_element.text

variant = None

| 26

27 of 37

Application of LGRs: Python Example on Label Verification and Variant Identification(2/3):

if variant_type == "Diacritics":

diacritics_element = variant_label.find("Rules/AddDiacritics/Diacritic")

if diacritics_element is not None and diacritics_element.text:

diacritics = [d.text for d in variant_label.findall("Rules/AddDiacritics/Diacritic")]

variant = self._apply_diacritics(label=self.base_label, diacritics=diacritics)

elif variant_type == "ScriptVariant":

script_variation_element = variant_label.find("Rules/ScriptVariation")

if script_variation_element is not None and script_variation_element.text:

script_variation = script_variation_element.text

variant = self._apply_script_variation(label=self.base_label, script_variation=script_variation)

elif variant_type == "Transliteration":

transliteration_element = variant_label.find("Rules/Transliteration")

if transliteration_element is not None and transliteration_element.text:

transliteration = transliteration_element.text

variant = self._apply_transliteration(label=self.base_label, transliteration=transliteration)

if variant and variant != self.base_label:

variant_labels.add(variant)

return variant_labels

| 27

28 of 37

Application of LGRs: Python Example on Label Verification and Variant Identification(3/3):

def _apply_diacritics(self, label, diacritics):

return label + ''.join(diacritics)

def _apply_script_variation(self, label, script_variation):

return script_variation

def _apply_transliteration(self, label, transliteration):

return transliteration

def read_lgr_from_file(file_path):

with open(file_path, 'r', encoding='utf-8') as file:

lgr_data = file.read()

return lgr_data

# Specify the path to your LGR file

lgr_file_path = 'lgr_ethiopic_variant.xml'

lgr_data = read_lgr_from_file(lgr_file_path)

validator = LGRValidator(lgr_data)

label = "አማርኛ"

is_valid = validator.validate_label(label)

print(f"Is '{label}' valid? {is_valid}")

variant_labels = validator.get_variant_labels()

print(f"Variant labels for '{label}':")

for variant_label in variant_labels:

if variant_label != label: # Exclude the base label from the variant labels

print(variant_label)

| 28

29 of 37

Application of LGRs: Java Example on Label Verification and Variant Identification(1/5):

import org.w3c.dom.*;

import javax.xml.parsers.DocumentBuilder;

import javax.xml.parsers.DocumentBuilderFactory;

import java.io.IOException;

import java.io.StringReader;

import java.nio.file.Files;

import java.nio.file.Paths;

import java.util.HashSet;

import java.util.Set;

import org.xml.sax.InputSource;

import org.xml.sax.SAXException;

import javax.xml.parsers.ParserConfigurationException;

class LGRValidator {

private Document lgrDocument;

private String baseLabel;

private Set<String> variantLabels;

public LGRValidator(String lgrData) throws ParserConfigurationException, IOException, SAXException {

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

DocumentBuilder builder = factory.newDocumentBuilder();

| 29

30 of 37

Application of LGRs: Java Example on Label Verification and Variant Identification(2/5):

if (Files.isRegularFile(Paths.get(lgrData))) {

this.lgrDocument = builder.parse(Paths.get(lgrData).toFile());

} else {

this.lgrDocument = builder.parse(new InputSource(new StringReader(lgrData)));

}

this.lgrDocument.getDocumentElement().normalize();

this.baseLabel = this.getNodeTextContent("BaseLabel");

this.variantLabels = this.extractVariantLabels();

}

public boolean validateLabel(String label) {

return label != null && label.equals(baseLabel);

}

public Set<String> getVariantLabels() {

return variantLabels;

}

private Set<String> extractVariantLabels() {

Set<String> variantLabels = new HashSet<>();

NodeList variantLabelNodes = lgrDocument.getElementsByTagName("VariantLabel");

| 30

31 of 37

Application of LGRs: Java Example on Label Verification and Variant Identification(3/5):

for (int i = 0; i < variantLabelNodes.getLength(); i++) {

Element variantLabel = (Element) variantLabelNodes.item(i);

String variantType = this.getNodeTextContent(variantLabel, "Type");

String variant = null;

if ("Diacritics".equals(variantType)) {

variant = applyDiacritics(baseLabel, variantLabel.getElementsByTagName("Diacritic"));

} else if ("ScriptVariant".equals(variantType)) {

variant = getNodeTextContent(variantLabel, "Rules/ScriptVariation");

} else if ("Transliteration".equals(variantType)) {

variant = getNodeTextContent(variantLabel, "Rules/Transliteration");

}

if (variant != null && !variant.equals(baseLabel) && !variantLabels.contains(variant)) {

variantLabels.add(variant);

}

}

return variantLabels;

}

| 31

32 of 37

Application of LGRs: Java Example on Label Verification and Variant Identification(4/5):

private String applyDiacritics(String label, NodeList diacriticNodes) {

StringBuilder result = new StringBuilder(label);

for (int i = 0; i < diacriticNodes.getLength(); i++) {

result.append(getNodeTextContent((Element) diacriticNodes.item(i)));

}

return result.toString();

}

private String getNodeTextContent(String tagName) {

NodeList nodeList = lgrDocument.getElementsByTagName(tagName);

return nodeList.getLength() > 0 ? getNodeTextContent((Element) nodeList.item(0)) : null;

}

private String getNodeTextContent(Element element, String tagName) {

NodeList nodeList = element.getElementsByTagName(tagName);

return nodeList.getLength() > 0 ? getNodeTextContent((Element) nodeList.item(0)) : null;

}

private String getNodeTextContent(Element element) {

return element != null && element.hasChildNodes() ?

| 32

33 of 37

Application of LGRs: Java Example on Label Verification and Variant Identification(5/5):

element.getFirstChild().getTextContent() : null;

}

}

public class LGRValidatorMain {

public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException {

String lgrFilePath = "lgr_ethiopic_variant.xml";

String lgrData = new String(Files.readAllBytes(Paths.get(lgrFilePath)));

LGRValidator validator = new LGRValidator(lgrData);

String label = "አማርኛ";

boolean isValid = validator.validateLabel(label);

System.out.println("Is '" + label + "' valid? " + isValid);

Set<String> variantLabels = validator.getVariantLabels();

System.out.println("Variant labels for '" + label + "':");

for (String variantLabel : variantLabels) {

if (!variantLabel.equals(label)) {

System.out.println(variantLabel);

}

}

}

}

| 33

34 of 37

Application of LGRs: Python and Java Output for the Examples Code:

Both the Python and Java Codes Produce the same Output as below:

Is 'አማርኛ' valid? True

Variant labels for 'አማርኛ':

ዐማርኛ

Amharic

| 34

35 of 37

Reference:

[1]. Hoffman, P., & Dürst, M. (2010, July). Internationalized Domain Names in Applications (IDNA2008) (RFC 5893). Retrieved from [https://www.ietf.org/rfc/] on 2023-12-06.

[2]. Internet Engineering Task Force (IETF). (2023). IDNA2008 - Internationalized Domain Names in Applications. Retrieved from [https://www.ietf.org/] on 2023-12-06.

[3]. Cloudflare, Inc. (2023). Understanding and Addressing IDNA2008 Limitations. Retrieved from [https://developers.cloudflare.com/cloudflare-one/account-limits/] on 2023-12-06.

[4]. Internet Engineering Task Force (IETF). (2018, July). Label Generation Rules (LGR) for the ASCII Scripts (RFC 8195). Retrieved from [https://www.ietf.org/rfc/] on 2023-12-06.

[5]. Unicode Consortium. (2023, December 5). Unicode Technical Standard #46 (UTS 46): IDNA Compatibility Charts. Retrieved from [https://www.unicode.org/] on 2023-12-06.

[6]. Alves, S., & Hoffman, P. (2004). Label Generation Rules: A Framework for Defining Legal Characters in Domain Labels (RFC 3743). Retrieved from https://www.ietf.org/rfc/ on 2023-12-06.

| 35

36 of 37

Reference:

[7]. Hoffman, P., & Blanchet, F. (2006). Internationalized Domain Names - Label Generation Rules: Syntax and Semantics (RFC 4690). Retrieved from https://www.ietf.org/rfc/ on 2023-12-06.

[8]. Blanchet, F., & Bruijn, J. (2016). Internationalized Domain Names - Label Generation Rules (LGR) in an XML-Based Format (RFC 7940). Retrieved from https://www.ietf.org/rfc/ on 2023-12-06.

[9]. Unicode Consortium. (2023, December 5). Unicode Technical Standard #46 (UTS 46): IDNA Compatibility Processing. Retrieved from https://www.unicode.org/] on 2023-12-06.

[10]. Internet Corporation for Assigned Names and Numbers (ICANN). (2013, March 20). Procedure to Develop and Maintain the Label Generation Rules for the Root Zone in Respect of IDNA Labels. Retrieved from https://www.icann.org/en/system/files/files/lgr-procedure-20mar13-en.pdf on 2023-12-26.

| 36

37 of 37

Author:

  • Dessalegn Mequanint Yehuala, dessalegn.mequanint@aau.edu.et.

| 37