1 of 10

Intl.Segmenter

2 of 10

Background

3 of 10

Breaking down strings

There are multiple concepts that can treated as the elements of which strings are sequences. By increasing size:

  • code unit: low-level “WTF-16” (potentially ill-formed UTF-16) 16-bit value
  • code point: low-level Unicode codespace value not bound to representation details, representable in ECMAScript as one or two code units
  • grapheme (cluster): potentially locale-dependent “perceived character” (basic unit of written language), consisting of one or more code points
  • word: locale-dependent composition of grapheme clusters into a semantic unit
  • sentence: locale-dependent composition of related words

4 of 10

Breaking down strings programatically

There are use cases for each kind of breakdown, and browsers ship with libraries to support them, but only the lowest level are exposed to ECMAScript source:

  • code unit: string.matchAll(/./sg), [].slice.call(string)
  • code point: Array.from(string)
  • grapheme (cluster): ✗
  • word: ✗
  • sentence: ✗

5 of 10

Breaking down strings with Intl.Segmenter

Taking guidance from UAX 29, we expose the higher-level breakdown by constructing locale-dependent iterators.

  • code unit: string.matchAll(/./sg), [].slice.call(string)
  • code point: Array.from(string)
  • grapheme (cluster): Array.from(new Intl.Segmenter(locale, {granularity: "grapheme"})).segment(string), data => data.segment)
  • word: Array.from(new Intl.Segmenter(locale, {granularity: "word"})).segment(string), data => data.segment)
  • sentence: Array.from(new Intl.Segmenter(locale, {granularity: "sentence"})).segment(string), data => data.segment)

6 of 10

Current Status

7 of 10

What’s new?

  • Moved input string recovery from get %SegmentsPrototype%.string to segment data input�-segmenter.segment(str).string�+[...segmenter.segment(str)][0].input
    • Better matches RegExp (@@matchAll and match objects) #116
    • Sidesteps non-primitive internal slot accessor concern #96

8 of 10

Open issues

  • #120 (non-blocking) Should Intl.Segmenter.prototype.segment(string) return an iterator directly?

9 of 10

Staging

  • 2017 March - Caridy and “TBD from PayPal” volunteer for Stage 3 review
  • 2017 May - stays at Stage 2 because of under-specification
  • 2017 July - stays at Stage 2 because of under-specification
  • 2017 September - promoted to Stage 3
  • 2018 November - “High-level concerns have also been raised.”
  • 2019 January - major API changes announced
  • 2019 March - demoted to Stage 2
  • 2019 October - Philip Dunkel and Mark Cohen volunteered for Stage 3 review
  • 2020 June - stays at Stage 2 due to unresolved internal slot accessor concern

10 of 10

Next steps

Questions?

Feedback?

Stage 3?