Number Parsing Strategies for ICU 61

Ticket: http://bugs.icu-project.org/trac/ticket/13513

Branch: http://source.icu-project.org/repos/icu/branches/shane/numberformat3/

Rietveld: https://codereview.appspot.com/335150043/

Introduction

Current State of Number Parsers

ICU4J 58 (Fast but Buggy)

ICU4J 59 (Robust but Slow)

Proposals for Number Parsing in ICU 61

Modular Greedy

Modular Non-Greedy

Optimization: Lead Chars

How should the "lead chars" optimization be used?

Comparison

Known Issues / Behavior Changes (Java)

Currency Leniency

Modifier Symbols Are Always Required

Modifier Symbols Are Always Accepted

Return Types

Multiplier / MathContext Exception

Examples of Difficult Strings to Parse

Separators

Greedy vs. Non-Greedy

Wide Symbols

Introduction

Parsing arbitrary user-input number strings has many challenges, some of which include:

Current State of Number Parsers

ICU4J 58 (Fast but Buggy)

The old implementation (ICU4J prior to 59) of number parsing is what I call "greedy ad-hoc".  Using a large if statement, it considers several options for tokens at each offset into the string.  It saves state between parse sessions, may backtrack, and has behavior that is difficult to understand and debug.  The main benefit of this implementation is that it is reasonably fast.

ICU4J 59 (Robust but Slow)

The current implementation (ICU4J 59 and 60) uses a state table.  The states include: BEFORE_PREFIX, AFTER_PREFIX, AFTER_INTEGER_DIGIT, AFTER_FRACTION_DIGIT, AFTER_EXPONENT_SEPARATOR, and so forth (see the code file for the full list).  The current state determines what tokens can appear next.  The main advantage of this approach is that it is very resilient to ambiguity, with multiple parse paths taken when there is more than one possibility.  It is also easy to understand and debug.  The disadvantages are that the state table structure is not as easy to extend, and overall the state table approach is much slower than the greedy ad-hoc approach by an average of one order of magnitude.

Proposals for Number Parsing in ICU 61

Modular Greedy

I am proposing a new implementation to launch in ICU4J and ICU4C 61.  I call this one "modular greedy".  The basic approach starts with defining an ordered list of matchers.  At each point in the string, ask each matcher, in order, to attempt to consume the string starting at the current offset.  When a match succeeds, update the offset and go back to the top of the ordered list of matchers.  When no matchers can match anything else, you are done.

The pseudocode for "modular greedy" is something like:

def parse(string, results) -> void:

        return if string is empty  # base case

        for matcher in matchers:

                num_chars_consumed, _ = matcher.match(string, results)

                if num_chars_consumed > 0:

                        parse(string[num_chars_consumed:], results)

                        break

In this pseudo-code, "results" is the output variable.

Here is a bare bones naive implementation of a simple minus sign matcher to illustrate how matchers work:

class MinusSignMatcher implements MatcherInterface:

        def match(string, results) -> (int, bool):

                # Minus sign options: "-" and "**"

                if string[:1] == "-":

                        results.is_negative = true

                        return 1, false  # num_chars_consumed, want_more

elif string[:2] == "**":

                        results.is_negative = true

                        return 2, false

                elif string == "*":

                        # Partial prefix match: ask for another char.

                        return 0, true

                return 0, false  # no match

Modular Non-Greedy

In addition, with the ordered list of matchers, it is easy to supplement a slower, "modular non-greedy" algorithm that handles a few extra corner cases with very high code reuse.  This algorithm no longer lets matchers greedily eat up as many chars as they want.  It essentially tries giving the matcher every possible length, checking each time to see if any of the other matchers are interested in what comes next.  For slightly better efficiency, the matcher is allowed to say whether it is interested in seeing more of the string beyond the given segment.

def parse_non_greedy(string, results) -> void:

  return if string is empty  # base case

  initial = results.clone()

  for matcher in matchers:

    for num_chars in 1 through string.length:

      candidate = initial.clone()

      num_chars_consumed, want_more = matcher.match(string[:num_chars], candidate)

      if num_chars == num_chars_consumed:

        parse_non_greedy(string[num_chars_consumed:], candidate)

        if candidate is better than best:

          results <= candidate

      if not want_more:

        break  # go to next matcher

Since the vast majority of users do not need the non-greedy code path, however, I suggest using greedy as the default.  If users run into trouble with the greedy algorithm for some reason, they can discover the non-greedy option by reading the documentation, and they can flip the switch at that time to see if it solves their problem.

Optimization: Lead Chars

To achieve a modest boost to parsing runtime, each matcher can have a method to compute a set of "lead chars".  The algorithm can look at that set and determine if it needs to call the slower "match" method.  The contract is that the "lead chars" needs to contain a superset of the chars that can actually be accepted by "match".

The downside of this optimization is that it adds to the startup time (constructing your parsing object), so it should be used only when necessary.

How should the "lead chars" optimization be used?

A few options:

  1. Make it come online after the parser is used several times in a row like the self-regulation used in NumberFormatter (how many times to wait?)
  2. Keep it as a knob that the user can manually toggle (probably bad: users will know how to use it)
  3. Remove it entirely from the code base (somewhat simpler code, easier maintenance, etc.)

Comparison

The "modular greedy" algorithm described above should be better for users than either ICU4J 58 or ICU4J 59/60:

  1. The code is intended to be more modular and easier to debug.
  2. Allows for both a fast greedy and a slow non-greedy algorithm on top of the same ordered list of matchers.
  3. Easier to customize: to fulfill a user's request to tweak behavior at runtime, we can add or remove matchers from the ordered list, alleviating the web of "if" statements sprinkled throughout the code.

Here are graphs comparing the runtime and allocations for the new implementation (with and without the "lead chars" optimization) and JDK, ICU4J 58, and ICU4J 59/60.  These benchmarks are computed with Google Caliper and represent the resources required to parse the string "123,456.789" in locale en-US.  The benchmarks were taken on an x86-64 Linux workstation.  The JDK being used is OpenJDK.

NOTE: For the ICU 61 implementation, these benchmarks represent the runtime required to compute the correct binary-coded-decimal internal representation of the number.  Converting from BCD to an output data type like Long or BigDecimal incurs an additional cost.

Known Issues / Behavior Changes (Java)

In my branch, I have the DecimalFormat parsing API hooked up to my new parsing backend.  This section contains all behavior changes that are exhibited via ICU4J's unit test suite.

Currency Leniency

Old behavior: Accept all forms of the locale's currency, but do not accept non-locale currencies unless parseCurrency() is being used.

New behavior: Always accept all currencies.  The performance cost is small because the currency parsing code is very well optimized.  The cost of maintaining a separate structure for the locale's currencies, as in previous versions, would add to construction cost.

Modifier Symbols Are Always Required

Consider the following unit test (note that this is a unit test I added in 59):

In other words, if the affix is empty, I don't automatically assume that the number is negative.  I require some sort of evidence, like a minus sign or accounting-style parentheses.  This also applies to other symbols like percent signs: unless I see the percent sign, I don't scale by 100.  Related: #13114

Modifier Symbols Are Always Accepted

In lenient mode, most symbols are accepted whether or not they appear in the pattern.  This change primarily affects percent signs:

This change does not affect strict mode.

Return Types

The parse method API docs make only one guarantee for the return type: if the number fits inside a Long, then a Long is returned.  Previously, a BigInteger might have been returned when possible.  Now I always unconditionally return a BigDecimal, which I have found to be faster and easier to deal with than BigInteger.  The API spec makes absolutely no guarantees about the return type other than Long, but some users may be expecting certain behavior cases with the BigInteger.

Multiplier / MathContext Exception

You set a MathContext with unlimited precision in conjunction with a multiplier that when inverted produces a nonterminating decimal.

Old behavior: The exception is thrown in .parse()

New behavior: The exception is thrown immediately in .setMultiplier()/.setMathContext()

Examples of Difficult Strings to Parse

Separators

String: 123,456

Is the comma a grouping or decimal separator? (Take hints from the locale)

String: 123,456.789

How about now? (If we are in Europe, do we accept the comma as an American-style grouping separator?)

Greedy vs. Non-Greedy

Pattern: 0'0N'

String: 40N

Greedy parses this as 40 and stops.  Non-greedy can discover the suffix 0N and parse this as 4.  This string is parseable in ICU4J 59/60 and in the non-greedy algorithm for ICU 61.  It is not pareseable in ICU4J 58 or greedy 61.

Wide Symbols

Digit Strings: (0), (1), (2), (3), ...

String: (1)(2)(3)

Backwards compatibility dictates that we should accept symbols like these that are more than one code point in length.