1 of 6

2 of 6

Add RegExp pattern syntax and semantics for these set operations:

    • difference/subtraction (in A but not in B)
    • intersection (in both A and B)
    • nested character classes (needed to enable the above)

Note that union (in A or in B) is already supported in limited form (only within a single character class).

RegExp set notation

3 of 6

// Matching non-ASCII digits, to convert them to ASCII digits:

[\p{Decimal_Number}--[0-9]]

// → difference/subtraction + nested character class

// Matching spans of “word/identifier letters” of specific scripts:

[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]

// → intersection + nested character class

// Matching non-script-specific combining marks:

[\p{Nonspacing_Mark}&&[\p{Script=Inherited}\p{Script=Common}]]

// → intersection + nested character class

4 of 6

/…\UnicodeSet{}…/u

5 of 6

ICU, Java, Perl (experimental), Python regex module, .Net, XML Schema, Xerces, Ruby

  • union & nested classes – all
  • intersection – ICU, Java, Perl, Python regex, Xerces, Ruby
  • subtraction/difference – ICU, Perl, Python regex, .Net, XML Schema, Xerces
  • symmetric difference – Perl, Python regex

language/implementation

union

subtraction

intersection

nested classes

symmetric difference

🤷 *

Python regex

Ruby

ECMAScript prior to this proposal

ECMAScript with this proposal

6 of 6