RegExp set notation + properties of strings
for Stage 2
https://github.com/tc39/proposal-regexp-set-notation
(∋ https://github.com/tc39/proposal-regexp-unicode-sequence-properties)
Add RegExp pattern syntax and semantics for these set operations:
Note that union (in A or in B) is already supported in limited form (only within a single character class).
RegExp set notation
New flag 'v' builds on/implies flag 'u'
New flag for difference, subtraction, nested classes
const re = /[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]/v;
Same new flag for properties of strings
const re = /\p{RGI_Emoji_ZWJ_Sequence}/v;
https://github.com/tc39/proposal-regexp-set-notation/issues/2
https://github.com/tc39/proposal-regexp-set-notation/issues/14
New flag for new syntax & semantics
assert(/…/d.hasIndices);
assert(/…/g.global);
assert(/…/i.ignoreCase);
assert(/…/m.multiline);
assert(/…/s.dotAll);
assert(/…/u.unicode);
assert(/…/y.sticky);
// Proposed:
assert(/…/v.unicodeSet);
https://github.com/tc39/proposal-regexp-set-notation/issues/14
New flag + RegExp.prototype getter
// Matching non-ASCII digits, to convert them to ASCII digits:
[\p{Decimal_Number}--[0-9]]
// → difference/subtraction + nested character class
// Matching spans of “word/identifier letters” of specific scripts:
[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]
// → intersection + nested character class
// Matching non-script-specific combining marks:
[\p{Nonspacing_Mark}&&[\p{Script=Inherited}\p{Script=Common}]]
// → intersection + nested character class
https://github.com/tc39/proposal-regexp-set-notation/issues/4
Double punctuation
Reserve syntax for possible future extensions, so that hopefully we can avoid another flag. Example: ~~ is in growing use for symmetric difference
!! ## $$ %% ** ++ ... ~~
{ }
https://github.com/tc39/proposal-regexp-set-notation/issues/4
Reserved syntax
Single dash for range, double for subtraction
[\p{Other}--\p{Format}--\p{Control}]
Brackets for range with subtraction or intersection
[\p{Decimal_Number}--[0-9]]
No literal dash; always escape
[[\-\p{Symbol}]--\p{Currency_Symbol}]
[[$-\-]&&\p{Punctuation}]
https://github.com/tc39/proposal-regexp-set-notation/issues/12
Dashes
No issue: single type of operator (left-binding)
[\p{Letter}\p{Mark}\p{Decimal_Number}]
[\p{Other}--\p{Format}--\p{Control}]
Proposed: require brackets when mixing types of operators / but not complement
[\p{Greek}&&[\p{Letter}\p{Mark}\p{Decimal_Number}]]
[[\p{Any}--\p{Other}]\p{Control}]
[[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]
https://github.com/tc39/proposal-regexp-set-notation/issues/6
Operator precedence
[\p{Basic_Emoji}&&\p{Symbol}] valid
[\p{RGI_Emoji}--\p{RGI_Emoji_Flag_Sequence}] 👧🏿 not 🇫🇷
[^\p{Emoji_Keycap_Sequence}\p{Symbol}] SyntaxError
[^\p{Emoji_Keycap_Sequence}&&\p{Symbol}] valid
const reHashtag =
/[#﹟#][[\p{XID_Continue}\p{RGI_Emoji}[\-+_]]--[#﹟#]]+/v;
https://github.com/tc39/proposal-regexp-set-notation/issues/3
Operators + properties of strings (both require flag)
[\p{RGI_Emoji}--(🇧🇪)] 👧🏿 and 🇫🇷 but not 🇧🇪
[a-zA-Z(ch)(m̀)(か゚)(🇦🇺)(🇧🇪)(🇫🇷)] ≍ [a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)]
https://github.com/tc39/proposal-regexp-set-notation/issues/17
Syntax for string literals
...
CharacterClass[U, V] ::
[ [lookahead ≠ ^] ClassRanges[?U, ?V] ]
[ ^ ClassRanges[?U, ?V] ]
ClassRanges[U, V] ::
[empty]
[~V] NonemptyClassRanges[?U]
[+V] ClassContents
ClassContents ::
ClassUnion
ClassIntersection
ClassSubtraction
...
https://github.com/tc39/proposal-regexp-set-notation/issues/12
Draft spec changes
Advance to Stage 2?
(and if so: who wants to be a Stage 3 reviewer?)
Stage 2 entrance criteria
Background slides
/…\USet{…}…/u
https://github.com/tc39/proposal-regexp-set-notation/issues/2
Alternatives to new flag
/(?u)…[…]…[…]…/u
[\p{Decimal_Number}-[0-9]]
[\p{Script=Khmer}&[\p{Letter}\p{Mark}\p{Number}]]
[\p{Nonspacing_Mark}&[\p{Script=Inherited}\p{Script=Common}]]
https://github.com/tc39/proposal-regexp-set-notation/issues/4
Alternative to double punctuation
[[[\p{scx=Grek}\p{scx=Latn}]--\p{M}]\p{Nd}]
[[\p{scx=Grek}\p{scx=Latn}]--[\p{M}\p{Nd}]]
https://github.com/tc39/proposal-regexp-set-notation/issues/6
Operator precedence: alternatives
// mental model for developers:
\p{Foo}
refers to the Unicode property Foo
with what users perceive as its “characters” such as €, 🚲, 👧🏿, or 🇫🇷