1 of 17

2 of 17

Add RegExp pattern syntax and semantics for these set operations:

    • difference/subtraction (in A but not in B)
    • intersection (in both A and B)
    • nested character classes (needed to enable the above)

Note that union (in A or in B) is already supported in limited form (only within a single character class).

RegExp set notation

3 of 17

New flag 'v' builds on/implies flag 'u'

New flag for difference, subtraction, nested classes

const re = /[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]/v;

Same new flag for properties of strings

const re = /\p{RGI_Emoji_ZWJ_Sequence}/v;

https://github.com/tc39/proposal-regexp-set-notation/issues/2

https://github.com/tc39/proposal-regexp-set-notation/issues/14

New flag for new syntax & semantics

4 of 17

assert(/…/d.hasIndices);

assert(/…/g.global);

assert(/…/i.ignoreCase);

assert(/…/m.multiline);

assert(/…/s.dotAll);

assert(/…/u.unicode);

assert(/…/y.sticky);

// Proposed:

assert(/…/v.unicodeSet);

https://github.com/tc39/proposal-regexp-set-notation/issues/14

New flag + RegExp.prototype getter

5 of 17

// Matching non-ASCII digits, to convert them to ASCII digits:

[\p{Decimal_Number}--[0-9]]

// → difference/subtraction + nested character class

// Matching spans of “word/identifier letters” of specific scripts:

[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]

// → intersection + nested character class

// Matching non-script-specific combining marks:

[\p{Nonspacing_Mark}&&[\p{Script=Inherited}\p{Script=Common}]]

// → intersection + nested character class

https://github.com/tc39/proposal-regexp-set-notation/issues/4

Double punctuation

6 of 17

Reserve syntax for possible future extensions, so that hopefully we can avoid another flag. Example: ~~ is in growing use for symmetric difference

  1. Other ASCII double punctuation:

!! ## $$ %% ** ++ ... ~~

  • Curly braces:

{ }

https://github.com/tc39/proposal-regexp-set-notation/issues/4

Reserved syntax

7 of 17

Single dash for range, double for subtraction

[\p{Other}--\p{Format}--\p{Control}]

Brackets for range with subtraction or intersection

[\p{Decimal_Number}--[0-9]]

No literal dash; always escape

[[\-\p{Symbol}]--\p{Currency_Symbol}]

[[$-\-]&&\p{Punctuation}]

https://github.com/tc39/proposal-regexp-set-notation/issues/12

Dashes

8 of 17

No issue: single type of operator (left-binding)

[\p{Letter}\p{Mark}\p{Decimal_Number}]

[\p{Other}--\p{Format}--\p{Control}]

Proposed: require brackets when mixing types of operators / but not complement

[\p{Greek}&&[\p{Letter}\p{Mark}\p{Decimal_Number}]]

[[\p{Any}--\p{Other}]\p{Control}]

[[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]

https://github.com/tc39/proposal-regexp-set-notation/issues/6

Operator precedence

9 of 17

[\p{Basic_Emoji}&&\p{Symbol}] valid

[\p{RGI_Emoji}--\p{RGI_Emoji_Flag_Sequence}] 👧🏿 not 🇫🇷

[^\p{Emoji_Keycap_Sequence}\p{Symbol}] SyntaxError

[^\p{Emoji_Keycap_Sequence}&&\p{Symbol}] valid

const reHashtag =

/[#﹟#][[\p{XID_Continue}\p{RGI_Emoji}[\-+_]]--[#﹟#]]+/v;

https://github.com/tc39/proposal-regexp-set-notation/issues/3

Operators + properties of strings (both require flag)

10 of 17

[\p{RGI_Emoji}--(🇧🇪)] 👧🏿 and 🇫🇷 but not 🇧🇪

[a-zA-Z(ch)()(゚)(🇦🇺)(🇧🇪)(🇫🇷)] ≍ [a-zA-Z(ch||゚|🇦🇺|🇧🇪|🇫🇷)]

https://github.com/tc39/proposal-regexp-set-notation/issues/17

Syntax for string literals

11 of 17

...

CharacterClass[U, V] ::

[ [lookahead ≠ ^] ClassRanges[?U, ?V] ]

[ ^ ClassRanges[?U, ?V] ]

ClassRanges[U, V] ::

[empty]

[~V] NonemptyClassRanges[?U]

[+V] ClassContents

ClassContents ::

ClassUnion

ClassIntersection

ClassSubtraction

...

https://github.com/tc39/proposal-regexp-set-notation/issues/12

Draft spec changes

12 of 17

  • (all Stage 1 criteria) ✅
  • initial spec text ✅

Advance to Stage 2?

(and if so: who wants to be a Stage 3 reviewer?)

Stage 2 entrance criteria

13 of 17

Background slides

14 of 17

/…\USet{}…/u

  • Verbose, esp. with multiple character classes
  • Unfamiliar syntax
  • {…} for character class
  • Looks like the \USet{…} should suffice without /u

https://github.com/tc39/proposal-regexp-set-notation/issues/2

Alternatives to new flag

/(?u)[][]…/u

  • Unfamiliar syntax for ES

15 of 17

[\p{Decimal_Number}-[0-9]]

[\p{Script=Khmer}&[\p{Letter}\p{Mark}\p{Number}]]

[\p{Nonspacing_Mark}&[\p{Script=Inherited}\p{Script=Common}]]

  • Less distinctive, especially with dash
  • Awkward reserving double punctuation for future syntax if we start with single punctuation now

https://github.com/tc39/proposal-regexp-set-notation/issues/4

Alternative to double punctuation

16 of 17

  1. Always left binding, even with mixed operators; may be surprising at first

[[[\p{scx=Grek}\p{scx=Latn}]--\p{M}]\p{Nd}]

  • Implicit union tighter than intersection & difference; unclear/confusing precedence among other operators

[[\p{scx=Grek}\p{scx=Latn}]--[\p{M}\p{Nd}]]

  • Etc.

https://github.com/tc39/proposal-regexp-set-notation/issues/6

Operator precedence: alternatives

17 of 17

// mental model for developers:

\p{Foo}

refers to the Unicode property Foo

with what users perceive as its “characters” such as €, 🚲, 👧🏿, or 🇫🇷