Regular Expressions

Search Shortcut cmd + k | ctrl + k

Documentation / SQL / Functions

Regular Expressions

DuckDB 提供了模式匹配操作符 (LIKE, SIMILAR TO, GLOB), 以及通过函数支持正则表达式。

正则表达式语法

DuckDB 使用 RE2 库作为其正则表达式引擎。有关正则表达式的语法，请参阅 RE2 文档。

Functions

所有函数都接受一组可选的选项。

Name	Description
`regexp_extract(string, pattern[, group = 0][, options])`	如果 `string` 包含正则表达式 `pattern`，则返回由可选参数 `group` 指定的捕获组；否则返回空字符串。`group` 必须是一个常量值。如果未提供 `group`，则默认为 0。可以设置一组可选的 `options`。
`regexp_extract(string, pattern, name_list[, options])`	如果 `string` 包含正则表达式 `pattern`，则返回捕获组作为具有来自 `name_list` 的相应名称的结构体；否则，返回具有相同键和空字符串作为值的结构体。
`regexp_extract_all(string, regex[, group = 0][, options])`	在`string`中查找不重叠的`regex`出现，并返回`group`的对应值。
`regexp_full_match(string, regex[, options])`	如果整个`string`与`regex`匹配，则返回`true`。
`regexp_matches(string, pattern[, options])`	如果 `string` 包含正则表达式 `pattern`，则返回 `true`，否则返回 `false`。
`regexp_replace(string, pattern, replacement[, options])`	如果 `string` 包含正则表达式 `pattern`，则将匹配部分替换为 `replacement`。
`regexp_split_to_array(string, regex[, options])`	是 `string_split_regex` 的别名。将 `string` 沿着 `regex` 分割。
`regexp_split_to_table(string, regex[, options])`	将`string`沿着`regex`分割，并为每个部分返回一行。

`regexp_extract(string, pattern[, group = 0][, options])`

描述	如果 `string` 包含正则表达式 `pattern`，则返回由可选参数 `group` 指定的捕获组；否则返回空字符串。`group` 必须是一个常量值。如果没有给出 `group`，则默认为 0。可以设置一组可选的 `options`。
示例	`regexp_extract('abc', '([a-z])(b)', 1)`
结果	`a`

`regexp_extract(string, pattern, name_list[, options])`

描述	如果 `string` 包含正则表达式 `pattern`，则返回捕获组作为具有来自 `name_list` 的相应名称的结构体；否则，返回具有相同键和空字符串作为值的结构体。可以设置一组可选的 `options`。
示例	`regexp_extract('2023-04-15', '(\d+)-(\d+)-(\d+)', ['y', 'm', 'd'])`
结果	`{'y':'2023', 'm':'04', 'd':'15'}`

`regexp_extract_all(string, regex[, group = 0][, options])`

描述	在`string`中查找`regex`的非重叠出现，并返回`group`的对应值。可以设置一组可选的`options`。
示例	`regexp_extract_all('Peter: 33, Paul:14', '(\w+):\s*(\d+)', 2)`
结果	`[33, 14]`

`regexp_full_match(string, regex[, options])`

描述	如果整个`string`与`regex`匹配，则返回`true`。可以设置一组可选的`options`。
示例	`regexp_full_match('anabanana', '(an)*')`
结果	`false`

`regexp_matches(string, pattern[, options])`

描述	如果 `string` 包含正则表达式 `pattern`，则返回 `true`，否则返回 `false`。可以设置一组可选的 `options`。
示例	`regexp_matches('anabanana', '(an)*')`
Result	`true`

`regexp_replace(string, pattern, replacement[, options])`

描述	如果 `string` 包含正则表达式 `pattern`，则将匹配部分替换为 `replacement`。可以设置一组可选的 `options`。
示例	`regexp_replace('hello', '[lo]', '-')`
结果	`he-lo`

`regexp_split_to_array(string, regex[, options])`

描述	`string_split_regex` 的别名。沿着 `regex` 分割 `string`。可以设置一组可选的 `options`。
示例	`regexp_split_to_array('hello world; 42', ';? ')`
结果	`['hello', 'world', '42']`

`regexp_split_to_table(string, regex[, options])`

描述	沿着`regex`分割`string`，并为每个部分返回一行。可以设置一组可选的`options`。
示例	`regexp_split_to_table('hello world; 42', ';? ')`
结果	三行：`'hello'`, `'world', '42'`

regexp_matches 函数类似于 SIMILAR TO 操作符，但它不要求整个字符串都匹配。相反，如果字符串仅包含模式（除非使用特殊标记 ^ 和 $ 将正则表达式锚定到字符串的开头和结尾），regexp_matches 会返回 true。以下是一些示例：

SELECT regexp_matches('abc', 'abc');       -- true
SELECT regexp_matches('abc', '^abc$');     -- true
SELECT regexp_matches('abc', 'a');         -- true
SELECT regexp_matches('abc', '^a$');       -- false
SELECT regexp_matches('abc', '.*(b|d).*'); -- true
SELECT regexp_matches('abc', '(b|c).*');   -- true
SELECT regexp_matches('abc', '^(b|c).*');  -- false
SELECT regexp_matches('abc', '(?i)A');     -- true
SELECT regexp_matches('abc', 'A', 'i');    -- true

正则表达式函数的选项

正则表达式函数支持以下options。

选项	描述
`'c'`	区分大小写的匹配
`'i'`	不区分大小写的匹配
`'l'`	匹配字面量而不是正则表达式标记
`'m'`, `'n'`, `'p'`	对换行敏感的匹配
`'g'`	全局替换，仅适用于 `regexp_replace`
`'s'`	非换行敏感匹配

例如：

SELECT regexp_matches('abcd', 'ABC', 'c'); -- false
SELECT regexp_matches('abcd', 'ABC', 'i'); -- true
SELECT regexp_matches('ab^/$cd', '^/$', 'l'); -- true
SELECT regexp_matches(E'hello\nworld', 'hello.world', 'p'); -- false
SELECT regexp_matches(E'hello\nworld', 'hello.world', 's'); -- true

使用 `regexp_matches`

regexp_matches 操作符在可能的情况下会被优化为 LIKE 操作符。为了获得最佳性能，如果适用，应传递 'c' 选项（区分大小写的匹配）。请注意，默认情况下，RE2 库不会将 . 字符匹配到换行符。

原始	优化后的等效
`regexp_matches('hello world', '^hello', 'c')`	`prefix('hello world', 'hello')`
`regexp_matches('hello world', 'world$', 'c')`	`suffix('hello world', 'world')`
`regexp_matches('hello world', 'hello.world', 'c')`	`LIKE 'hello_world'`
`regexp_matches('hello world', 'he.*rld', 'c')`	`LIKE '%he%rld'`

使用 `regexp_replace`

regexp_replace 函数可用于将字符串中与正则表达式模式匹配的部分替换为替换字符串。符号 \d（其中 d 是表示组的数字）可用于在替换字符串中引用正则表达式中捕获的组。请注意，默认情况下，regexp_replace 仅替换正则表达式的第一个匹配项。要替换所有匹配项，请使用全局替换（g）标志。

一些使用 regexp_replace 的示例：

SELECT regexp_replace('abc', '(b|c)', 'X');        -- aXc
SELECT regexp_replace('abc', '(b|c)', 'X', 'g');   -- aXX
SELECT regexp_replace('abc', '(b|c)', '\1\1\1\1'); -- abbbbc
SELECT regexp_replace('abc', '(.*)c', '\1e');      -- abe
SELECT regexp_replace('abc', '(a)(b)', '\2\1');    -- bac

使用 `regexp_extract`

regexp_extract 函数用于提取与正则表达式模式匹配的字符串的一部分。可以使用 group 参数提取模式中的特定捕获组。如果未指定 group，则默认为 0，提取与整个模式的第一个匹配项。

SELECT regexp_extract('abc', '.b.');           -- abc
SELECT regexp_extract('abc', '.b.', 0);        -- abc
SELECT regexp_extract('abc', '.b.', 1);        -- (empty)
SELECT regexp_extract('abc', '([a-z])(b)', 1); -- a
SELECT regexp_extract('abc', '([a-z])(b)', 2); -- b

regexp_extract 函数还支持一个 name_list 参数，它是一个字符串的 LIST。使用 name_list，regexp_extract 将返回相应的捕获组作为 STRUCT 的字段：

SELECT regexp_extract('2023-04-15', '(\d+)-(\d+)-(\d+)', ['y', 'm', 'd']);

{'y': 2023, 'm': 04, 'd': 15}

SELECT regexp_extract('2023-04-15 07:59:56', '^(\d+)-(\d+)-(\d+) (\d+):(\d+):(\d+)', ['y', 'm', 'd']);

{'y': 2023, 'm': 04, 'd': 15}

SELECT regexp_extract('duckdb_0_7_1', '^(\w+)_(\d+)_(\d+)', ['tool', 'major', 'minor', 'fix']);

Binder Error: Not enough group names in regexp_extract

如果列名的数量少于捕获组的数量，则只返回前几个组。如果列名的数量多于捕获组的数量，则会生成一个错误。

Limitations

正则表达式仅支持9个捕获组：\1, \2, \3, …, \9。不支持两位或更多位数的捕获组。