1 of 17

RegExp set notation + properties of strings

for Stage 2

https://github.com/tc39/proposal-regexp-set-notation

(∋ https://github.com/tc39/proposal-regexp-unicode-sequence-properties)

2 of 17

Add RegExp pattern syntax and semantics for these set operations:

difference/subtraction (in A but not in B)
intersection (in both A and B)
nested character classes (needed to enable the above)

Note that union (in A or in B) is already supported in limited form (only within a single character class).

RegExp set notation

3 of 17

New flag 'v' builds on/implies flag 'u'

New flag for difference, subtraction, nested classes

const re = /[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]/v;

Same new flag for properties of strings

const re = /\p{RGI_Emoji_ZWJ_Sequence}/v;

https://github.com/tc39/proposal-regexp-set-notation/issues/2

https://github.com/tc39/proposal-regexp-set-notation/issues/14

New flag for new syntax & semantics

https://github.com/tc39/proposal-regexp-set-notation/issues/2

https://github.com/tc39/proposal-regexp-set-notation/issues/14

Mathias:

We're changing the meaning of `[]`, and this must be explicitly clear, so we need either a flag or a modifier. Flag is the simplest solution. It also addresses the concern w.r.t. properties of strings in character classes.

From discussion with Waldemar:

\Usomething{...} looks like it should be enough, but if someone omits the /u flag, then these are just characters to be matched. It would also be nice to have the outer scope be [...] not {...}. And a prefix on each character class is verbose.

A flag or modifier is a possibility, and may be less likely to be forgotten when the character classes otherwise just use [square brackets]. ES has no regex modifiers yet.

If we want to avoid letters that are modifiers somewhere (see above) or flags in ES regex, then we could pick one from [A-IK-TVWYZafhj-lorvz]. For example, “v is the next u”, and should imply it/build on it.

(Some people also like ‘w’.)

4 of 17

assert(/…/d.hasIndices);

assert(/…/g.global);

assert(/…/i.ignoreCase);

assert(/…/m.multiline);

assert(/…/s.dotAll);

assert(/…/u.unicode);

assert(/…/y.sticky);

// Proposed:

assert(/…/v.unicodeSet);

https://github.com/tc39/proposal-regexp-set-notation/issues/14

New flag + RegExp.prototype getter

5 of 17

// Matching non-ASCII digits, to convert them to ASCII digits:

[\p{Decimal_Number}--[0-9]]

// → difference/subtraction + nested character class

// Matching spans of “word/identifier letters” of specific scripts:

[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]

// → intersection + nested character class

// Matching non-script-specific combining marks:

[\p{Nonspacing_Mark}&&[\p{Script=Inherited}\p{Script=Common}]]

// → intersection + nested character class

https://github.com/tc39/proposal-regexp-set-notation/issues/4

Double punctuation

6 of 17

Reserve syntax for possible future extensions, so that hopefully we can avoid another flag. Example: ~~ is in growing use for symmetric difference

Other ASCII double punctuation:

!! ## $$ %% ** ++ ... ~~

Curly braces:

{ }

https://github.com/tc39/proposal-regexp-set-notation/issues/4

Reserved syntax

7 of 17

Single dash for range, double for subtraction

[\p{Other}--\p{Format}--\p{Control}]

Brackets for range with subtraction or intersection

[\p{Decimal_Number}--[0-9]]

No literal dash; always escape

[[\-\p{Symbol}]--\p{Currency_Symbol}]

[[$-\-]&&\p{Punctuation}]

https://github.com/tc39/proposal-regexp-set-notation/issues/12

Dashes

8 of 17

No issue: single type of operator (left-binding)

[\p{Letter}\p{Mark}\p{Decimal_Number}]

[\p{Other}--\p{Format}--\p{Control}]

Proposed: require brackets when mixing types of operators / but not complement

[\p{Greek}&&[\p{Letter}\p{Mark}\p{Decimal_Number}]]

[[\p{Any}--\p{Other}]\p{Control}]

[[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]

https://github.com/tc39/proposal-regexp-set-notation/issues/6

Operator precedence

9 of 17

[\p{Basic_Emoji}&&\p{Symbol}] valid

[\p{RGI_Emoji}--\p{RGI_Emoji_Flag_Sequence}] 👧🏿 not 🇫🇷

[^\p{Emoji_Keycap_Sequence}\p{Symbol}] SyntaxError

[^\p{Emoji_Keycap_Sequence}&&\p{Symbol}] valid

const reHashtag =

/[#﹟＃][[\p{XID_Continue}\p{RGI_Emoji}[\-+_]]--[#﹟＃]]+/v;

https://github.com/tc39/proposal-regexp-set-notation/issues/3

Operators + properties of strings (both require flag)

https://github.com/tc39/proposal-regexp-set-notation/issues/3

Especially Markus reply 2021-feb-02

... In other words, because the character encoding model does not limit what users think of as "characters" to single code points, implementations that limit a "set of characters" to just single code points are incomplete.
Most characters are encoded as single code points, and thus most character classes, and most UnicodeSet instances, contain only those. However, when you need a way to handle any and all "characters", having to put some of them into an auxiliary structure is very awkward.

JIS X 0213 characters:

\N{LATIN SMALL LETTER OPEN O WITH GRAVE}=0254 0300 → ɔ̀
\N{HIRAGANA LETTER BIDAKUON NGA}=304B 309A → か゚

CLDR exemplar character data

Yoruba: ẹ́ ẹ̀ gb m̀ ...
Slovak: dz dž ch

Property of strings, or character class with properties of strings = alternation, see https://github.com/tc39/proposal-regexp-unicode-sequence-properties

Complement is ok if it is easy to prove just via stable metadata that the contents cannot contain multi-character strings.

See “Validation Algorithm” in https://github.com/tc39/proposal-regexp-set-notation/issues/7

10 of 17

[\p{RGI_Emoji}--(🇧🇪)] 👧🏿 and 🇫🇷 but not 🇧🇪

[a-zA-Z(ch)(m̀)(か゚)(🇦🇺)(🇧🇪)(🇫🇷)] ≍ [a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)]

https://github.com/tc39/proposal-regexp-set-notation/issues/17

Syntax for string literals

We’re also adding string literal syntax to the proposal. String literals can be wrapped in parentheses. This allows you to do things like “match all RGI_Emoji except for the Belgian flag”, or “match all ASCII letters as well as a few additional strings”.

The abbreviated form, with the pipe symbol as a separator, makes the syntax more readable when multiple strings are used. With parentheses and the pipe symbol, string literals look and work a lot like simple alternations outside of character classes.

There was some discussion about alternative syntax, such as with a prefix or with curly braces instead of parentheses. The curly braces follow the long-standing practice of ICU and CLDR. Parentheses look more like a regex alternation (outside of character classes). See https://github.com/tc39/proposal-regexp-set-notation/issues/17

Previously, we had kept string literal syntax out of the initial proposal in an attempt to keep it small, but since then we’ve received strong feedback, including during the TC39 Incubator Call in 2021-April, that this functionality is required to make the proposal minimally useful. It would be too limiting to only allow set operators with pre-defined Unicode properties and character classes.

11 of 17

...

CharacterClass[U, V] ::

[ [lookahead ≠ ^] ClassRanges[?U, ?V] ]

[ ^ ClassRanges[?U, ?V] ]

ClassRanges[U, V] ::

[empty]

[~V] NonemptyClassRanges[?U]

[+V] ClassContents

ClassContents ::

ClassUnion

ClassIntersection

ClassSubtraction

...

https://github.com/tc39/proposal-regexp-set-notation/issues/12

Draft spec changes

12 of 17

(all Stage 1 criteria) ✅
initial spec text ✅

Advance to Stage 2?

(and if so: who wants to be a Stage 3 reviewer?)

Stage 2 entrance criteria

13 of 17

Background slides

14 of 17

/…\USet{…}…/u

Verbose, esp. with multiple character classes
Unfamiliar syntax
{…} for character class
Looks like the \USet{…} should suffice without /u

https://github.com/tc39/proposal-regexp-set-notation/issues/2

Alternatives to new flag

/(?u)…[…]…[…]…/u

Unfamiliar syntax for ES

15 of 17

[\p{Decimal_Number}-[0-9]]

[\p{Script=Khmer}&[\p{Letter}\p{Mark}\p{Number}]]

[\p{Nonspacing_Mark}&[\p{Script=Inherited}\p{Script=Common}]]

Less distinctive, especially with dash
Awkward reserving double punctuation for future syntax if we start with single punctuation now

https://github.com/tc39/proposal-regexp-set-notation/issues/4

Alternative to double punctuation

16 of 17

Always left binding, even with mixed operators; may be surprising at first

[[[\p{scx=Grek}\p{scx=Latn}]--\p{M}]\p{Nd}]

Implicit union tighter than intersection & difference; unclear/confusing precedence among other operators

[[\p{scx=Grek}\p{scx=Latn}]--[\p{M}\p{Nd}]]

Etc.

https://github.com/tc39/proposal-regexp-set-notation/issues/6

Operator precedence: alternatives

17 of 17

// mental model for developers:

\p{Foo}

refers to the Unicode property Foo

with what users perceive as its “characters” such as €, 🚲, 👧🏿, or 🇫🇷