Regular expression tree dictionary layout
Overview
The regexp_tree dictionary lets you map keys to values based on hierarchical regular-expression patterns.
It's optimized for pattern-match lookups (e.g. classifying strings like user agent strings by matching regex patterns) rather than exact key matching.
Use the regular expression tree dictionary with YAMLRegExpTree source
Regular expression tree dictionaries are defined in ClickHouse open-source using the YAMLRegExpTree source which is provided the path to a YAML file containing the regular expression tree.
The dictionary source YAMLRegExpTree represents the structure of a regexp tree. For example:
This config consists of a list of regular expression tree nodes. Each node has the following structure:
- regexp: the regular expression of the node.
- attributes: a list of user-defined dictionary attributes. In this example, there are two attributes:
nameandversion. The first node defines both attributes. The second node only defines attributename. Attributeversionis provided by the child nodes of the second node.- The value of an attribute may contain back references, referring to capture groups of the matched regular expression. In the example, the value of attribute
versionin the first node consists of a back-reference\1to capture group(\d+[\.\d]*)in the regular expression. Back-reference numbers range from 1 to 9 and are written as$1or\1(for number 1). The back reference is replaced by the matched capture group during query execution.
- The value of an attribute may contain back references, referring to capture groups of the matched regular expression. In the example, the value of attribute
- child nodes: a list of children of a regexp tree node, each of which has its own attributes and (potentially) children nodes. String matching proceeds in a depth-first fashion. If a string matches a regexp node, the dictionary checks if it also matches the nodes' child nodes. If that is the case, the attributes of the deepest matching node are assigned. Attributes of a child node overwrite equally named attributes of parent nodes. The name of child nodes in YAML files can be arbitrary, e.g.
versionsin above example.
Regexp tree dictionaries only allow access using the functions dictGet, dictGetOrDefault, and dictGetAll. For example:
In this case, we first match the regular expression \d+/tclwebkit(?:\d+[\.\d]*) in the top layer's second node.
The dictionary then continues to look into the child nodes and finds that the string also matches 3[12]/tclwebkit.
As a result, the value of attribute name is Android (defined in the first layer) and the value of attribute version is 12 (defined in the child node).
With a sophisticated YAML configuration file, you can use a regexp tree dictionaries as a user agent string parser. ClickHouse supports uap-core and you can see how to use it in the functional test 02504_regexp_dictionary_ua_parser
Collecting attribute values
Sometimes it is useful to return values from multiple regular expressions that matched, rather than just the value of a leaf node. In these cases, the specialized dictGetAll function can be used. If a node has an attribute value of type T, dictGetAll will return an Array(T) containing zero or more values.
By default, the number of matches returned per key is unbounded. A bound can be passed as an optional fourth argument to dictGetAll. The array is populated in topological order, meaning that child nodes come before parent nodes, and sibling nodes follow the ordering in the source.
Example:
Result:
Matching modes
Pattern matching behavior can be modified with certain dictionary settings:
regexp_dict_flag_case_insensitive: Use case-insensitive matching (defaults tofalse). Can be overridden in individual expressions with(?i)and(?-i).regexp_dict_flag_dotall: Allow '.' to match newline characters (defaults tofalse).
Use regular expression tree dictionary in ClickHouse Cloud
The YAMLRegExpTree source works in ClickHouse Open Source but not in ClickHouse Cloud.
To use regexp tree dictionaries in ClickHouse Cloud, first create a regexp tree dictionary from a YAML file locally in ClickHouse Open Source, then dump this dictionary into a CSV file using the dictionary table function and the INTO OUTFILE clause.
The content of csv file is:
The schema of dumped file is:
id UInt64: the id of the RegexpTree node.parent_id UInt64: the id of the parent of a node.regexp String: the regular expression string.keys Array(String): the names of user-defined attributes.values Array(String): the values of user-defined attributes.
To create the dictionary in ClickHouse Cloud, first create a table regexp_dictionary_source_table with below table structure:
Then update the local CSV by
You can see how to Insert Local Files for more details. After we initialize the source table, we can create a RegexpTree by table source: