1 of 10

MapVector Design

Prepared by:

Bohdan Kazydub

DRILL-7096

2 of 10

Motivation

Currently Drill has MAP type which corresponds to Hive’s Struct. To be able to read (Hive’s) MAP columns from Hive[1] (as part of Hive complex types support implementation) there is a need to introduce new type to drill - (canonical) MAP which will store key-value pairs. Existing MAP type will be renamed to STRUCT (in scope of [1]).

3 of 10

MapVector Structure[1]

New map vector will contain three value vectors: keys, values and offsets.

  • Value vectors keys and values store key and value values respectively; offsets vector is responsible for storing offsets for rows. The key and value vectors are created on initialization of MapVector based on passed key and value types.
  • keys vector should be of required primitive type.
  • values are either of primitive or complex type.
  • MapVector.Mutator brings no new methods and is responsible for setting Map valueCount for the Map itself and actual valueCounts for its children value vectors (keys, values and offsets).
  • MapVector.Accessor does not introduce new methods as well.
  • To modify/read MapVector MapWriter and MapReader will be used.

[1] the issue is tracked in https://issues.apache.org/jira/browse/DRILL-7096

4 of 10

Element count in map is obtained as difference between next offset value and current one:

size[i] = offsets.get(i + 1) - offsets.get(i)

For example, to the left is presented a scheme for simple maps (each map corresponds to a row)

{1, ‘a’, 2, ‘b’, 5, ‘e’}

{2, ‘b’, 5, ‘e’, 7, ‘g’, 3, ‘c’}

{} (empty map)

{4, ‘d’}

...

5 of 10

MapWriter

To be able to modify MapVector one needs to use MapWriter. MapWriter encapsulates logic to separate rows and individual records (map elements) in the rows.

Fields description

  • container the MapVector to be modified
  • keyWriter/valueWriter writers to modify container’s key and value vectors respectively. Created on top of keys and values vectors on MapWriter creation.
  • currentRow/length designate row and number of elements in map. Used to calculate writer position and offsets.

Methods description:

  • start() starts new row. Should be invoked before writing any data into the map.
  • end() finalizes the row, updates MapVector#offsets.
  • startKeyValuePair() updates writer position for keyWriter and valueWriter. Should be invoked after start() but before writing key or value using keyWriter or valueWriter.
  • endKeyValuePair() finalizes writing entry, increments MapWriter#length counter. Should be invoked after startKeyValuePair() and after writing values (if any) using key-value writers.

6 of 10

MapReader

To be able to access values by key, the reader introduces read(Object key, ValueHolder holder) and read(Object key, ComplexHolder holder) methods which set values into given holder.

This method is going to be used in generated code similarly as is done for obtaining values from arrays by index. To get value by key, one should use square brackets syntax:

SELECT mapcol_with_int_key[25] FROM hive.table_name;

SELECT mapcol_with_string_key[‘abc’] FROM hive.table_name;

(It is possible to use UDFs inside square brackets such as CAST, LOWER etc.)

7 of 10

Issues

  • Calcite does not allow to have keys (i.e. type of keys, passed to square brackets) other than int and string (varchar).
  • Hive currently supports null keys only for ORC tables. For other types the entry with null key is skipped or error (NPE) is thrown. This is debatable whether NULL keys should be supported.

8 of 10

Future improvements

  • Keys should be sorted for better performance, as it enables use of binary search rather than plain iteration search. Depending on the map size this can be an overhead. The best solution is to provide an option to enable/disable key-value sorting for better look-up performance.
  • Key values should be unique. Hive keeps only the last inserted value for given key in its tables (so querying Hive or files generated by Hive is safe in the context of key-uniqueness). However it is desirable to not allow duplicates on Drill’s side. There are following concerns of keeping duplicate keys (apart it not being a “good” map):
    1. a wrong value may be retrieved by key;
    2. keeping redundant key-value pairs in memory;
    3. number of elements obtained by offsets is no longer actual (e.g. if one wants to get map’s size wrong size will be returned).

9 of 10

Q&A

10 of 10

Thanks for attention!