XPath and XQuery Functions and Operators 4.0

1 Introduction

Changes in 4.0 ⬇

Use the arrows to browse significant changes since the 3.1 version of this specification.
Sections with significant changes are marked Δ in the table of contents. New functions introduced in this version are marked ➕ in the table of contents.

The purpose of this document is to define functions and operators for inclusion in XPath 4.0, XQuery 4.0, and XSLT 4.0. The exact syntax used to call these functions and operators is specified in [XML Path Language (XPath) 4.0], [XQuery 4.0: An XML Query Language] and [XSL Transformations (XSLT) Version 4.0].

This document defines three classes of functions:

General purpose functions, available for direct use in user-written queries, stylesheets, and XPath expressions, whose arguments and results are values defined by the [XQuery and XPath Data Model (XDM) 3.1].
Constructor functions, used for creating instances of a datatype from values of (in general) a different datatype. These functions are also available for general use; they are named after the datatype that they return, and they always take a single argument.
Functions that specify the semantics of operators defined in [XML Path Language (XPath) 4.0] and [XQuery 4.0: An XML Query Language]. These exist for specification purposes only, and are not intended for direct calling from user-written code.

[XML Schema Part 2: Datatypes Second Edition] defines a number of primitive and derived datatypes, collectively known as built-in datatypes. This document defines functions and operations on these datatypes as well as the other types (for example, nodes and sequences of nodes) defined in Section 2.7 Schema Information ^DM31 of the [XQuery and XPath Data Model (XDM) 3.1]. These functions and operations are available for use in [XML Path Language (XPath) 4.0], [XQuery 4.0: An XML Query Language] and any other host language that chooses to reference them. In particular, they may be referenced in future versions of XSLT and related XML standards.

[XSD 1.1 Part 2] adds to the datatypes defined in [XML Schema Part 2: Datatypes Second Edition]. It introduces a new derived type xs:dateTimeStamp, and it incorporates as built-in types the two types xs:yearMonthDuration and xs:dayTimeDuration which were previously XDM additions to the type system. In addition, XSD 1.1 clarifies and updates many aspects of the definitions of the existing datatypes: for example, it extends the value space of xs:double to allow both positive and negative zero, and extends the lexical space to allow +INF; it modifies the value space of xs:Name to permit additional Unicode characters; it allows year zero and disallows leap seconds in xs:dateTime values; and it allows any character string to appear as the value of an xs:anyURI item. Implementations of this specification may support either XSD 1.0 or XSD 1.1 or both.

In some cases, this specification references XSD for the semantics of operations such as the effect of matching using regular expressions, or conversion of atomic items to strings. In most such cases there is no intended technical difference between the XSD 1.0 and XSD 1.1 specifications, but the 1.1 version often provides clearer explanations and sometimes also corrects technical errors. In such cases this specification often chooses to reference the XSD 1.1 specification. This should not be taken as implying that it is necessary to invoke an XSD 1.1 processor.

References to specific sections of some of the above documents are indicated by cross-document links in this document. Each such link consists of a pointer to a specific section followed a superscript specifying the linked document. The superscripts have the following meanings: XQ [XQuery 4.0: An XML Query Language], XT [XSL Transformations (XSLT) Version 4.0], XP [XML Path Language (XPath) 4.0], and DM [XQuery and XPath Data Model (XDM) 4.0].

1.9 Terminology

Changes in 4.0 ⬇ ⬆

The term atomic value has been replaced by atomic item. [Issue 1337 PR 1361 2 August 2024]

The terminology used to describe the functions and operators on types defined in [XML Schema Part 2: Datatypes Second Edition] is defined in the body of this specification. The terms defined in this section are used in building those definitions.

Note:

Following in the tradition of [XML Schema Part 2: Datatypes Second Edition], the terms type and datatype are used interchangeably.

1.9.5 Properties of functions

This section is concerned with the question of whether two calls on a function, with the same arguments, may produce different results.

In this section the term function, unless otherwise specified, applies equally to function definitions^XP (which can be the target of a static function call) and function items^DM (which can be the target of a dynamic function call).

[Definition] An execution scope is a sequence of calls to the function library during which certain aspects of the state are required to remain invariant. For example, two calls to fn:current-dateTime within the same execution scope will return the same result. The execution scope is defined by the host language that invokes the function library. In XSLT, for example, any two function calls executed during the same transformation are in the same execution scope (except that static expressions, such as those used in use-when attributes, are in a separate execution scope).

The following definition explains more precisely what it means for two function calls to return the same result:

[Definition] Two values $V1 and $V2 are defined to be identical if they contain the same number of items and the items are pairwise identical. Two items are identical if and only if one of the following conditions applies:

Both items are atomic items, of precisely the same type, and the values are equal as defined using the eq operator, using the Unicode codepoint collation when comparing strings.
Both items are nodes, and represent the same node.
Both items are maps, both maps have the same number of entries, and for every entry E₁ in the first map there is an entry E₂ in the second map such that the keys of E₁ and E₂ are the same key, and the corresponding values V₁ and V₂ are identical.
Both items are arrays, both arrays have the same number of members, and the members are pairwise identical.
Both items are function items, neither item is a map or array, and the two function items have the same function identity. The concept of function identity is explained in Section 7.1 Function Items^DM.

Some functions produce results that depend not only on their explicit arguments, but also on the static and dynamic context.

[Definition] A function definition^XP may have the property of being context-dependent: the result of such a function depends on the values of properties in the static and dynamic evaluation context of the caller as well as on the actual supplied arguments (if any). A function definition may be context-dependent for some arities in its arity range, and context-independent for others: for example fn:name#0 is context-dependent while fn:name#1 is context-independent.

[Definition] A function definition^XP that is not context-dependent is called context-independent.

The main categories of context-dependent functions are:

Functions that explicitly deliver the value of a component of the static or dynamic context, for example fn:static-base-uri, fn:default-collation, fn:position, or fn:last.
Functions with an optional parameter whose default value is taken from the static or dynamic context of the caller, usually either the context value (for example, fn:node-name) or the default collation (for example, fn:index-of).
Functions that use the static context of the caller to expand or disambiguate the values of supplied arguments: for example fn:doc expands its first argument using the static base URI of the caller, and xs:QName expands its first argument using the in-scope namespaces of the caller.

[Definition] A function is focus-dependent if its result depends on the focus^XP31 (that is, the context item, position, or size) of the caller.

[Definition] A function that is not focus-dependent is called focus-independent.

Note:

Some functions depend on aspects of the dynamic context that remain invariant within an execution scope, such as the implicit timezone. Formally this is treated in the same way as any other context dependency, but internally, the implementation may be able to take advantage of the fact that the value is invariant.

Note:

User-defined functions in XQuery and XSLT may depend on the static context of the function definition (for example, the in-scope namespaces) and also in a limited way on the dynamic context (for example, the values of global variables). However, the only way they can depend on the static or dynamic context of the caller — which is what concerns us here — is by defining optional parameters whose default values are context-dependent.

Note:

Because the focus is a specific part of the dynamic context, all focus-dependent functions are also context-dependent. A context-dependent function, however, may be either focus-dependent or focus-independent.

A function definition that is context-dependent can be used as the target of a named function reference, can be partially applied, and can be found using fn:function-lookup. The principle in such cases is that the static context used for the function evaluation is taken from the static context of the named function reference, partial function application, or the call on fn:function-lookup; and the dynamic context for the function evaluation is taken from the dynamic context of the evaluation of the named function reference, partial function application, or the call of fn:function-lookup. These constructs all deliver a function item^DM having a captured context based on the static and dynamic context of the construct that created the function item. This captured context forms part of the closure of the function item.

The result of a dynamic call to a function item never depends on the static or dynamic context of the dynamic function call, only (where relevant) on the the captured context held within the function item itself.

The fn:function-lookup function is a special case because it is potentially dependent on everything in the static and dynamic context. This is because the static and dynamic context of the call to fn:function-lookupform the captured context of the function item that fn:function-lookup returns.

[Definition] A function that is guaranteed to produce identical results from repeated calls within a single execution scope if the explicit and implicitimplicit arguments are identical is referred to as deterministic.

[Definition] A function that is not deterministic is referred to as nondeterministic.

All functions defined in this specification are deterministic unless otherwise stated. Exceptions include the following:

[Definition] Some functions (such as fn:distinct-values, fn:unordered, map:keys, and map:for-each) produce results in an implementation-defined or implementation-dependent order. In such cases two calls with the same arguments are not guaranteed to produce the results in the same order. These functions are said to be nondeterministic with respect to ordering.
Some functions (such as fn:analyze-string, fn:parse-xml, fn:parse-xml-fragment, fn:parse-html, and fn:json-to-xml) construct a tree of nodes to represent their results. There is no guarantee that repeated calls with the same arguments will return the same identical node (in the sense of the is operator). However, if non-identical nodes are returned, their content will be the same in the sense of the fn:deep-equal function. Such a function is said to be nondeterministic with respect to node identity.
Some functions (such as fn:doc and fn:collection) create new nodes by reading external documents. Such functions are guaranteed to be deterministic with the exception that an implementation is allowed to make them nondeterministic as a user option.

Where the results of a function are described as being (to a greater or lesser extent) implementation-defined or implementation-dependent, this does not by itself remove the requirement that the results should be deterministic: that is, that repeated calls with the same explicit and implicit arguments must return identical results.

[Definition] The function fn:concat is defined to be variadic: it accepts any number of arguments. No other function has this property.

6 Regular expressions

The functions described in this section make use of a regular expression syntax for pattern matching. The syntax and semantics of regular expressions are defined in this section.

6.1 Regular expression syntax

Changes in 4.0 ⬇ ⬆

Regular expressions can include comments (starting and ending with #) if the c flag is set. [Issue 999 PR 1022 20 February 2024]
Word boundaries can be matched. Lookahead and lookbehind assertions are supported. Assertions (including ^ and $) can no longer be followed by a quantifier. [Issues 998 1006 PR 1856]

The regular expression syntax used by these functions is defined in terms of the regular expression syntax specified in XSD 1.1 (see [XSD 1.1 Part 2]), which in turn is based on the established conventions of languages such as Perl. However, because XML Schema uses regular expressions only for validity checking, it omits some facilities that are widely used with other languages. XPath, therefore, extends the XML Schema regular expression syntax to reinstate some of these capabilities.

Note:

Implementers should consult [UTS #18] for information on using regular expression processing on Unicode characters.

The regular expression syntax and semantics are identical to those defined in [XSD 1.1 Part 2] with the additions described in the following subsections.

Note:

In [XSD 1.1 Part 2] there are no substantive technical changes to the syntax or semantics of regular expressions relative to [XML Schema Part 2: Datatypes Second Edition], but a number of errors and ambiguities have been resolved. For example, the rules for the interpretation of hyphens within square brackets in a regular expression have been clarified; and the semantics of regular expressions are no longer tied to a specific version of Unicode.

XSD 1.1 is therefore used as the specification baseline, even for processors that only support XSD 1.0.

6.1.1 Processing model for regular expressions

As well as extending the XSD 1.1 syntax for regular expressions, this specification also extends the processing model.

In XSD, a regular expression is defined to denote a set of strings, and the only functionality offered is to test whether a string matches a regular expression: that is, whether it is a member of the set of strings denoted by the regular expression.

In this specification, matching a string S against a regular expression delivers a more complex outcome.

First some terminology:

[Definition] A string of length N has N+1character positions: one immediately before each character in the string, and one after the last character. In interfaces where character positions are exposed, they are numbered from 1 to N+1.
[Definition] A segment of a string S is a sequence of zero or more contiguous characters starting at a given character position within S. Segments of a string are uniquely identified by their start position and length. The sequence of characters making up a segment is referred to as the string value of the segment.
[Definition] The end position of a segment is the start position of the segment plus its length.
[Definition] The end position of a segment is the start position of the segment plus its length.

The operation of matching a string S against a regular expression delivers:

A set of matching segments. The string S as a whole is said to match the regular expression if the set of matching segments is non-empty.
For each matching segmentM, a collection of captured groups. This is a mapping from positive integers to segments. The integer is called the group number, and corresponds to the ordinal sequence of opening parentheses of capturing subexpressions within the regular expression, as explained below. The corresponding segment is always a segment of S, but in the case of capturing expressions within lookahead assertions, it is not necessarily a segment of M.

The semantics of particular constructs in a regular expression are affected by a set of flags. The available flags and their effect are defined in 6.2 Flags.

The different functions available, such as fn:replace and fn:tokenize, are defined in terms of this outcome. For example:

The function fn:matches returns true if the set of matching segments is non-empty.
The function fn:replace replaces matching segments of the input string with a replacement string.
The function fn:tokenize returns the segments of the input string that appear between the matching segments.

In principle the set of segments that match a regular expression can be determined by enumerating all the segments of the input string and examining each one independently to establish whether it matches. In practice, however:

If several matching segments have the same starting position, then only one of them is returned. This is chosen as follows:
- In the case of a choice (operator "|") the first matching branch is chosen.
- In the case of a repetition with a greedy quantifier (for example "+" or "*") the longest matching segment is chosen.
- In the case of a repetition with a reluctant quantifier (for example "+?" or "*?") the shortest matching segment is chosen.
A matching segment is not included in the result if it overlaps an earlier matching segment: specifically, a segment with start position S₁ is excluded if there is a segment that has start position S₀ and length L₀, where S₀ < S₁ < S₀+L₀.

Note:

Two segments can be adjacent: that is, the start position of one segment can be equal to the end position of the previous segment. This is true even when the second segment is zero-length (the two segments are not considered to be overlapping, even though they have the same end position). This means, for example, that the regular expression a*(?=x) has two non-overlapping matches against the string aaax, one at position 1 and the other at position 4.

[Definition] The disjoint matching segments obtained by applying a regular expression R to a string S in the presence of a set of flags F are the segments of S that match R (using flags F), after elimination of overlapping segments.

The semantics of a regular expression are thus defined by stating which segments of an input string it matches, and what the captured groups corresponding to this match are. This is defined recursively for each construct that may appear within a regular expression, in terms of the outcome of applying its subexpressions.

For constructs defined in XSD 1.1 (branch, piece, NormalChar, charClass), XSD defines a set of strings denoted by the construct. The corresponding semantics for this specification are that the segments matched by such a construct are the segments whose string value is contained in this set.

For constructs added to the XSD 1.1 baseline by this specification, the semantics are defined in the sections that follow.

6.1.3 Regular expression grammar

The grammar for regular expressions is summarized here. Rules that differ from their definition in XSD 1.1 are marked with the character § against their names.

In these rules the notation【abc】matches any of the characters 'a', 'b', or 'c', while 【0➜9】 matches any character whose Unicode codepoint is within a given range, and ¬【abc】 matches any character other than 'a', 'b', or 'c'. These symbols are used in place of the more conventional notation to allow special characters such as square brackets and hyphens to appear directly without escaping. Within the lenticular brackets, all characters other than ➜ (including hyphen and backslash) represent themselves.

regExp              ::= branch ( '|' branch )*
branch              ::= piece*
piece               ::= (atom quantifier?) | assertion
§quantifier         ::= ( 【?*+】 | ( '{' quantity '}' ) ) '?'?
quantity            ::= quantRange | quantMin | QuantExact
quantRange          ::= QuantExact ',' QuantExact
quantMin            ::= QuantExact ','
QuantExact          ::= 【0➜9】+
§atom               ::= NormalChar | charClass | ( '(' regExp ')' ) | backReference 
NormalChar          ::= ¬【.\?*+{}()|[]】	
charClass           ::= SingleCharEsc | charClassEsc | charClassExpr | WildcardEsc
charClassExpr       ::= '[' charGroup ']'
charGroup           ::= ( posCharGroup | negCharGroup ) ( '-' charClassExpr )?
posCharGroup        ::= ( charGroupPart )+
negCharGroup        ::= '^' posCharGroup  
charGroupPart       ::= singleChar | charRange | charClassEsc
singleChar          ::= SingleCharEsc | SingleCharNoEsc
charRange           ::= singleChar '-' singleChar
SingleCharNoEsc     ::= ¬【\[]】
charClassEsc        ::= ( MultiCharEsc | catEsc | complEsc )
§SingleCharEsc      ::= '\' 【nrt\|.?*+(){}$-[]^#】	
catEsc              ::= '\p{' charProp '}'
complEsc            ::= '\P{' charProp '}'
charProp            ::= IsCategory | IsBlock
IsCategory          ::= Letters | Marks | Numbers | Punctuation 
                        | Separators | Symbols | Others
Letters             ::= 'L' 【ultmo】?
Marks               ::= 'M' 【nce】?
Numbers             ::= 'N' 【dlo】?
Punctuation         ::= 'P' 【cdseifo】?
Separators          ::= 'Z' 【slp】?
Symbols             ::= 'S' 【mcko】?
Others              ::= 'C' 【cfon】?
IsBlock             ::= 'Is' 【a➜zA➜Z0➜9-】+
MultiCharEsc        ::= '\' 【sSiIcCdDwW】
WildcardEsc         ::= '.'
§assertion          ::= startOfString | endOfString | wordBoundary 
                        | positiveLookahead | negativeLookahead 
                        | positiveLookbehind | negativeLookbehind
§startOfString      ::= '^'
§endOfString        ::= '$'
§wordBoundary       ::= '\b' | '\B'
§backReference      ::= '\' 【1➜9】【0➜9】*
§positiveLookahead  ::= '(?=' regExp ')' 
                        | '(*positive_lookahead:' regExp ')'
§negativeLookahead  ::= '(?!' regExp ')' 
                        | '(*negative_lookahead:' regExp ')'
§positiveLookbehind ::= '(?<=' simpleRegExp ')' 
                        | '(*positive_lookbehind:' simpleRegExp ')'
§negativeLookbehind ::= '(?<!' simpleRegExp ')' 
                        | '(*negative_lookbehind:' simpleRegExp ')'
§simpleRegExp       ::= simplePiece ( '|' simplePiece )*
§simplePiece        ::= (NormalChar | charClass)*

regExp              ::= branch ( '|' branch )*
branch              ::= piece*
piece               ::= (atom quantifier?) | assertion
§quantifier         ::= ( 【?*+】 | ( '{' quantity '}' ) ) '?'?
quantity            ::= quantRange | quantMin | QuantExact
quantRange          ::= QuantExact ',' QuantExact
quantMin            ::= QuantExact ','
QuantExact          ::= 【0➜9】+
§atom               ::= NormalChar | charClass | ( '(' '?:'? regExp ')' ) | backReference 
NormalChar          ::= ¬【.\?*+{}()|[]】	
charClass           ::= SingleCharEsc | charClassEsc | charClassExpr | WildcardEsc
charClassExpr       ::= '[' charGroup ']'
charGroup           ::= ( posCharGroup | negCharGroup ) ( '-' charClassExpr )?
posCharGroup        ::= ( charGroupPart )+
negCharGroup        ::= '^' posCharGroup  
charGroupPart       ::= singleChar | charRange | charClassEsc
singleChar          ::= SingleCharEsc | SingleCharNoEsc
charRange           ::= singleChar '-' singleChar
SingleCharNoEsc     ::= ¬【\[]】
charClassEsc        ::= ( MultiCharEsc | catEsc | complEsc )
§SingleCharEsc      ::= '\' 【nrt\|.?*+(){}$-[]^#】	
catEsc              ::= '\p{' charProp '}'
complEsc            ::= '\P{' charProp '}'
charProp            ::= IsCategory | IsBlock
IsCategory          ::= Letters | Marks | Numbers | Punctuation 
                        | Separators | Symbols | Others
Letters             ::= 'L' 【ultmo】?
Marks               ::= 'M' 【nce】?
Numbers             ::= 'N' 【dlo】?
Punctuation         ::= 'P' 【cdseifo】?
Separators          ::= 'Z' 【slp】?
Symbols             ::= 'S' 【mcko】?
Others              ::= 'C' 【cfon】?
IsBlock             ::= 'Is' 【a➜zA➜Z0➜9-】+
MultiCharEsc        ::= '\' 【sSiIcCdDwW】
WildcardEsc         ::= '.'
§assertion          ::= startOfString | endOfString | wordBoundary 
                        | positiveLookahead | negativeLookahead 
                        | positiveLookbehind | negativeLookbehind
§startOfString      ::= '^'
§endOfString        ::= '$'
§wordBoundary       ::= '\b' | '\B'
§backReference      ::= '\' 【1➜9】【0➜9】*
§positiveLookahead  ::= '(?=' regExp ')' 
                        | '(*positive_lookahead:' regExp ')'
§negativeLookahead  ::= '(?!' regExp ')' 
                        | '(*negative_lookahead:' regExp ')'
§positiveLookbehind ::= '(?<=' simpleRegExp ')' 
                        | '(*positive_lookbehind:' simpleRegExp ')'
§negativeLookbehind ::= '(?<!' simpleRegExp ')' 
                        | '(*negative_lookbehind:' simpleRegExp ')'
§simpleRegExp       ::= simplePiece ( '|' simplePiece )*
§simplePiece        ::= (NormalChar | charClass)*

This grammar applies to the regular expression after removal of whitespace and comments if enabled by the x and c flags respectively: see 6.2 Flags.

XSD 1.1 defines additional rules to disambiguate this grammar.

6.1.4 Reluctant Quantifiersquantifiers

Reluctant quantifiers are supported. They are indicated by a ? following a quantifier. Specifically:

X?? matches X, once or not at all
X*? matches X, zero or more times
X+? matches X, one or more times
X{n}? matches X, exactly n times
X{n,}? matches X, at least n times
X{n,m}? matches X, at least n times, but not more than m times

Quantifiers that are not reluctant are referred to as greedy.

When a quantifier appears at the outermost level of a regular expression, the distinction between greedy and reluctant quantifiers affects the set of matching segments delivered by the matching operation. With a greedy quantifier, the longest matching segment at a given start position is returned; with a reluctant quantifier, the shortest matching segment at a given start position is returned.

When a quantifier appears within a subexpression, the quantified subexpression matches the shortest possible substring consistent with the match as a whole succeeding if the quantifier is reluctant, or the longest possible substring consistent with the match as a whole succeeding if the quantifier is greedy.

Note:

Reluctant quantifiers have no effect on the results of the boolean fn:matches function, since this function is only interested in discovering whether a matching segment exists, regardless of its start position and length.

6.1.5 Captured Groupsgroups

The regular expression syntax defined by [XML Schema Part 2: Datatypes Second Edition] allows a regular expression to contain parenthesized subexpressions, but attaches no special significance to them. Some operations associated with regular expressions (for example, back-references, and the fn:replace function) allow access to the parts of the input string that matched a parenthesized subexpression (called captured groups).

[Definition] A left parenthesis is recognized as a capturing left parenthesis provided it is not immediately followed by ? or * (see below), is not within a character group (square brackets), and is not escaped with a backslash. The sub-expression enclosed by a capturing left parenthesis and its matching right parenthesis is referred to as a capturing subexpression.

More specifically, the capturing subexpression enclosed by the Nth capturing left parenthesis within the regular expression (determined by its character position in left-to-right order, and counting from one) is referred to as the Nth capturing subexpression.

For example, in the regular expression A(BC(?:D(EF(GH[()])))), the subexpression BC(?:D(EF(GH[()]))) is capturing subexpression 1, the string subexpression EF(GH[()]) is capturing subexpression 2, and the subexpression GH[()] is capturing subexpression 3.

When, in the course of evaluating a regular expression, a particular segment of the input matches a capturing subexpression, that segment becomes available as a captured group. The segment matched by the Nth capturing subexpression is referred to as the Nth captured group. By convention, the segment captured by the entire regular expression is treated as captured group 0 (zero).

When a capturing subexpression is matched more than once (because it is within a construct that allows repetition), then only the last substring that it matched will be captured. Note that this rule is not sufficient in all cases to ensure an unambiguous result, especially in cases where (a) the regular expression contains nested repeating constructs, and/or (b) the repeating construct matches a zero-length string. In such cases it is implementation-dependent which substring is captured. For example given the regular expression (a*)+ and the input string "aaaa", an implementation might legitimately capture either "aaaa" or a zero length string as the content of the captured subgroup.

Parentheses that are required to group terms within the regular expression, but which are not required for capturing of substrings, can be represented using the syntax (?:xxxx).

In the absence of back-references (see below), the presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be counted by operations (such as fn:replace and back-references) that number the capturing sub-expressions within a regular expression.

6.1.6 Back-References Back-references

Back-references are allowed outside a character class expression. A back-reference is an additional kind of atom. The construct \N where N is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the resulting number NN is such that the back-reference is preceded by the opening parenthesis of the NNth capturing left parenthesis. The regular expression is invalid if a back-reference refers to a capturing sub-expression that does not exist or whose closing right parenthesis occurs after the back-reference.

A back-reference with number N matches a string that is the same as the value of the Nth captured substring.

For example, the regular expression ('|").*\1 matches a sequence of characters delimited either by an apostrophe at the start and end, or by a quotation mark at the start and end.

If no string has been matched by the Nth capturing sub-expression, the back-reference is interpreted as matching a zero-length string.

Note:

Within a character class expression, \ followed by a digit is invalid. Some other regular expression languages interpret this as an octal character reference.

6.1.7 Unicode Block Namesblock names

A regular expression that uses a Unicode block name that is not defined in the version(s) of Unicode supported by the processor (for example \p{IsBadBlockName}) is deemed to be invalid [err:FORX0002].

Note:

XSD 1.0 does not say how this situation should be handled; XSD 1.1 says that it should be handled by treating all characters as matching.

6.1.8 Assertions

Assertions (sometimes called zero-width assertions) test whether a particular condition applies at the current position in the input string (resulting in either a match or a no-match), but they do not cause any change to the current position.

Assertions fall into the following categories:

The startOfString assertion ^ tests whether the current position is at the start of the string.
The endOfString assertion $ tests whether the current position is at the end of the string.
The boundary assertions \b and \B test whether the current position is at the start or end of a word.
The positive and negative lookahead assertions test whether there is (or is not) a substring starting at the current position that matches a given regular expression.
The positive and negative lookbehind assertions test whether there is (or is not) a substring ending at the current position that matches a given regular expression.

An assertion must not be followed by a quantifier.

Note:

Previous versions of this specification allowed a quantifier to follow the startOfString and endOfString assertions, though this served no practical purpose. Processors may provide an option to allow quantifiers to be used in this situation in order to preserve backward compatibility.

6.1.8.2 Boundary Assertionsassertions

The assertion \b matches at any position where one of the following conditions is true:

The current position is the start of the string, the string is not empty, and the first character in the string matches \w.
The current position is the end of the string, the string is not empty, and the last character in the string matches \w.
The character before the current position matches \w and the character after the current position matches \W.
The character before the current position matches \W and the character after the current position matches \w.

Informally, \b matches if the current position is the start or end of a word, where a word is defined as a sequence of consecutive characters other than codepoints in Unicode groups P (punctuation), Z (separator), or C (other).

The assertion \B matches at any position where \b does not match.

Note:

\b can be rewritten to an equivalent form in terms of lookbehind and lookahead assertions:

( (*positive_lookbehind:\w)(*positive_lookahead:\W) ) | ( (*positive_lookbehind:\W)(*positive_lookahead:\w) )

A similar rewrite is possible for \B.

6.3 Functions using regular expressions

Function	Meaning
`fn:matches`	Returns `true` if the supplied string matches a given regular expression.
`fn:replace`	Returns a string produced from the input string by replacing any segments that match a given regular expression with a supplied replacement string, provided either literally, or by invoking a supplied function.
`fn:tokenize`	Returns a sequence of strings constructed by splitting the input wherever a separator is found; the separator is any substring that matches a given regular expression.
`fn:analyze-string`	Analyzes a string using a regular expression, returning an XML structure that identifies which parts of the input string matched or failed to match the regular expression, and in the case of matched substrings, which substrings matched each capturing group in the regular expression.

6.3.2 fn:replace

Changes in 4.0 ⬇ ⬆

The $action argument is new in 4.0. [ 18 July 2023]
It is now permitted for the regular expression to match a zero-length string. [ PR 1856]

Summary

Returns a string produced from the input string by replacing any segments that match a given regular expression with a supplied replacement string, provided either literally, or by invoking a supplied function.

Signature

`fn:replace`(
`$value`	`as` `xs:string?`,
`$pattern`	`as` `xs:string`,
`$replacement`	`as` `xs:string?`	`:=` `()`,
`$flags`	`as` `xs:string?`	`:=` `''`,
`$action`	`as` `(fn(xs:untypedAtomic, xs:untypedAtomic*) as item()?)?`	`:=` `()`
) `as` `xs:string`

Properties

This function is deterministic, context-independent, and focus-independent.

Rules

If $value is the empty sequence, it is interpreted as the zero-length string.

If the $flags argument is omitted or if it is an empty sequence, the effect is the same as setting $flags to a zero-length string. Flags are defined in 6.2 Flags.

The string $value is matched against the regular expression $pattern, using the supplied $flags, to obtain a set of disjoint matching segments. A replacement string R for each of these segments (say M) is determined by the values of the $replacement and/or $action arguments, by applying the first of the following rules that applies:

If the $action argument is present and is not an empty sequence, R is obtained by calling the $action function.
The first argument to the $action function is the string to be replaced, provided as xs:untypedAtomic.
The second argument to the $action function provides the captured groups as an xs:untypedAtomic sequence. The Nth item in this sequence is the string value of the segment captured by the Nth capturing subexpression. If the Nth capturing subexpression was not matched, the Nth item will be the zero-length string.
Note that the rules for function coercion mean that the function actually supplied for the $action parameter may be an arity-1 function: the second argument does not need to be declared if it is not used.
The replacement string R is obtained by applying the fn:string to the result of the function call.
If $replacement is absent or empty, R is a zero-length string.
If the q flag is present, R is the value of $replacement.
Otherwise, the value of $replacement is processed as follows.
Within the supplied $replacement string, a variable marker $N (where N is an unsigned integer) may be used to refer to the Nth captured group associated with M. The replacement string R is obtained by replacing each of these variable markers with the string value of the relevant captured group. The variable marker $0 refers to the substring captured by the regular expression as a whole.
A literal $ character within the replacement string must be written as \$, and a literal \ character must be written as \\.
More specifically, the rules are as follows, where S is the number of capturing subexpressions in the regular expression, and N is the decimal number formed by taking all the digits that consecutively follow the $ character in $replacement:
1. If N=0, then the variable is replaced by the string value of M.
2. If 1<=N<=S, then the variable marker is replaced by the string value of the Nth captured group associated with M. If the Nth parenthesized sub-expression was not matched, then the variable marker is replaced by the zero-length string.
3. If S<N<=9, then the variable marker is replaced by the zero-length string.
4. Otherwise (if N>S and N>9), the last digit of N is taken to be a literal character to be included “as is” in the replacement string, and the rules are reapplied using the number N formed by stripping off this last digit.
  For example, if the replacement string is "$23" and there are 5 substrings, the result contains the value of the substring that matches the second capturing subexpression, followed by the digit 3.

The function returns the xs:string that is obtained by replacing each of the disjoint matching segments of $value with the corresponding value of R.

Error Conditions

A dynamic error is raised [err:FORX0002] if the value of $pattern is invalid according to the rules described in section 6.1 Regular expression syntax.

A dynamic error is raised [err:FORX0001] if the value of $flags is invalid according to the rules described in section 6.2 Flags.

In the absence of the q flag, a dynamic error is raised [err:FORX0004] if the value of $replacement contains a dollar sign ($) character that is not immediately followed by a digit 0-9 and not immediately preceded by a backslash (\).

In the absence of the q flag, a dynamic error is raised [err:FORX0004] if the value of $replacement contains a backslash (\) character that is not part of a \\ pair, unless it is immediately followed by a dollar sign ($) character.

A dynamic error is raised [err:FORX0005] if both the $replacement and $action arguments are supplied, and neither is an empty sequence.

Notes

If the input string contains no substring that matches the regular expression, the result of the function is a single string identical to the input string.

If two overlapping substrings of $value both match the $pattern, then only the first one (that is, the one whose first character comes first in the $value string) is replaced.

If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative. For example:

 replace("abcd", "(ab)|(a)", "[1=$1][2=$2]") returns "[1=ab][2=]cd"

The rules for disjoint matching segments allow a zero-length matching segment to immediately follow a non-zero-length matching segment (they are not considered to overlap). This means, for example, that the regular expression .* will typically produce two matches: one matching segment containing all the characters in the input string, and a second zero-length matching seqment at the end position of the string.

Examples

Expression:	`replace("abracadabra", "bra", "*")`
Result:	"acada"
Expression:	`replace("abracadabra", "a.a", "")`
Result:	"*"
Expression:	`replace("abracadabra", "a.?a", "")`
Result:	"cbra"
Expression:	`replace("abracadabra", "a", "")`
Result:	"brcdbr"
Expression:	`replace("abracadabra", "a(.)", "a$1$1")`
Result:	"abbraccaddabbra"
Expression:	`replace("AAAA", "A+", "b")`
Result:	"b"
Expression:	`replace("AAAA", "A+?", "b")`
Result:	"bbbb"
Expression:	`replace("In the beginning was the Word", "\b", "\|")`
Result:	"\|In\| \|the\| \|beginning\| \|was\| \|the\| \|Word\|"
Expression:	`replace("abcd!", "[a-z](?=.*(.)$)", "$0$1")`
Result:	"a!b!c!d!"
Expression:	`replace("darted", "^(.?)d(.)$", "$1c$2")`
Result:	"carted" (The first `d` is replaced.)
Expression:	replace("abracadabra", "bra", action := fn { "*" })
Result:	"acada"
Expression:	replace( "abracadabra", "bra", action := upper-case#1 )
Result:	"aBRAcadaBRA"
Expression:	replace("Chapter 9", "[0-9]+", action := fn { . + 1 })
Result:	"Chapter 10"
Expression:	replace( "LHR to LAX", "\b[A-Z]{3}\b", action := { 'LAX': 'Los Angeles', 'LHR': 'London' } )
Result:	"London to Los Angeles"
Expression:	replace( "57°43′30″", "([0-9]+)°([0-9]+)′([0-9]+)″", action := fn($s, $groups) { string($groups[1] + $groups[2] ÷ 60 + $groups[3] ÷ 3600) \|\| '°' } )
Result:	"57.725°"

6.3.4 fn:analyze-string

Changes in 4.0 ⬇ ⬆

The output of the function is extended to allow the represention of captured groups found within lookahead assertions. [ PR 1856]
It is now permitted for the regular expression to match a zero-length string. [ PR 1856]

Summary

Analyzes a string using a regular expression, returning an XML structure that identifies which parts of the input string matched or failed to match the regular expression, and in the case of matched substrings, which substrings matched each capturing group in the regular expression.

Signature

`fn:analyze-string`(
`$value`	`as` `xs:string?`,
`$pattern`	`as` `xs:string`,
`$flags`	`as` `xs:string?`	`:=` `""`
) `as` `element(fn:analyze-string-result)`

Properties

This function is nondeterministic, context-independent, and focus-independent.

Rules

If the $flags argument is omitted or if it is an empty sequence, the effect is the same as setting $flags to a zero-length string. Flags are defined in 6.2 Flags.

If $value is the empty sequence the function behaves as if $value were the zero-length string.

The function returns an element node whose local name is analyze-string-result. This element and all its descendant elements have the namespace URI http://www.w3.org/2005/xpath-functions. The namespace prefix is implementation-dependent. The children of this element are a sequence of fn:match and fn:non-match elements. This sequence is formed by breaking the $value string into a sequence of strings, returning any substring that matches $pattern as the content of an fn:match element, and any intervening substring as the content of an fn:non-match element.

More specifically, the function starts by matching the regular expression against the string, using the supplied $flags, to obtain the disjoint matching segments. For each such segment it constructs an fn:match child, whose string value is the string value of the segment. Before, between, or after these fn:match elements, as required to ensure that the string value of the fn:analyze-string-result element is the the same as $value, it inserts fn:non-match elements. The content of an fn:non-match element is always a single (non-empty) text node, and two fn:non-match elements never appear as adjacent siblings.

The captured groups for each disjoint matching segment are represented using fn:group or fn:lookahead-group children of the corresponding fn:match element. Groups captured by a subexpression within a lookahead assertion are referred to as lookahead groups; those not within a lookahead assertion are called ordinary groups.

The content of a fn:match element is in general:

A sequence of text nodes and fn:group element children, whose string-values when concatenated comprise the string value of the matching segment, followed by
A sequence of zero or more fn:lookahead-group elements, representing the lookahead groups

The string value of an fn:match element may be empty.

An fn:group element with a nr attribute having the integer value N identifies the substring captured by an ordinary group, specifically the string value of the Nth captured group. For each ordinary capturing subexpression there will be at most one corresponding fn:group element in each fn:match element in the result.

By contrast, lookahead groups are represented by fn:lookahead-group elements, which (if they appear at all) must follow all text node and fn:group element children of the fn:match element. These groups may overlap the matching and non-matching substrings, and indeed may overlap each other. They must appear in ascending numerical order of group number. The attributes of the fn:lookahead-group element are as follows:

nr: the group number, based on the position of the capturing subexpression that captured the group;
value: the string value of the segment that was captured;
position: the one-based start position of the segment within the input string.

If the function is called twice with the same arguments, it is implementation-dependent whether the two calls return the same element node or distinct (but deep equal) element nodes. In this respect it is nondeterministic with respect to node identity.

The base URI of the element nodes in the result is implementation-dependent.

A schema is defined for the structure of the returned element: see C.1 Schema for the result of fn:analyze-string.

The result of the function will always be such that validation against this schema would succeed. However, it is implementation-defined whether the result is typed or untyped, that is, whether the elements and attributes in the returned tree have type annotations that reflect the result of validating against this schema.

Error Conditions

A dynamic error is raised [err:FORX0002] if the value of $pattern is invalid according to the rules described in section 6.1 Regular expression syntax.

A dynamic error is raised [err:FORX0001] if the value of $flags is invalid according to the rules described in section 6.2 Flags.

Notes

It is recommended that a processor that implements schema awareness should return typed nodes. The concept of “schema awareness”, however, is a matter for host languages to define and is outside the scope of the function library specification.

The declarations and definitions in the schema are not automatically available in the static context of the fn:analyze-string call (or of any other expression). The contents of the static context are host-language defined, and in some host languages are implementation-defined.

The schema defines the outermost element, analyze-string-result, in such a way that mixed content is permitted. In fact the element will only have element nodes (match and non-match) as its children, never text nodes. Although this might have originally been an oversight, defining the analyze-string-result element with mixed="true" allows it to be atomized, which is potentially useful (the atomized value will be the original input string), and the capability has therefore been retained for compatibility with the 3.0 version of this specification.

Examples

In the following examples, the result document is shown in serialized form, with whitespace between the element nodes. This whitespace is not actually present in the result.
Expression:	`analyze-string("The cat sat on the mat.", "\w+")`
Result:	<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions"> <match>The</match> <non-match> </non-match> <match>cat</match> <non-match> </non-match> <match>sat</match> <non-match> </non-match> <match>on</match> <non-match> </non-match> <match>the</match> <non-match> </non-match> <match>mat</match> <non-match>.</non-match> </analyze-string-result> (with whitespace added for legibility)
Expression:	analyze-string("08-12-03", "^(\d+)\-(\d+)\-(\d+)$")
Result:	<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions"> <match> <group nr="1">08</group>-<group nr="2">12</group>-<group nr="3">03</group> </match> </analyze-string-result> (with whitespace added for legibility)
Expression:	analyze-string("A1,C15,,D24, X50,", "([A-Z])([0-9]+)")
Result:	<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions"> <match> <group nr="1">A</group> <group nr="2">1</group> </match> <non-match>,</non-match> <match> <group nr="1">C</group> <group nr="2">15</group> </match> <non-match>,,</non-match> <match> <group nr="1">D</group> <group nr="2">24</group> </match> <non-match>, </non-match> <match> <group nr="1">X</group> <group nr="2">50</group> </match> <non-match>,</non-match> </analyze-string-result> (with whitespace added for legibility)
Expression:	analyze-string("Chapter 5", "(Chapter\|Appendix)(?=\s+([0-9]+))")
Result:	<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions"> <match> <group nr="1">Chapter</group> <lookahead-group nr="2" value="5" position="9"/> </match> <non-match> 5</non-match> </analyze-string-result> (with whitespace added for legibility)
Expression:	analyze-string("There we go", "\b(?=\w+)")
Result:	<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions"> <match><lookahead-group nr="1" value="There" position="1"/></match> <non-match>There </non-match> <match><lookahead-group nr="1" value="we" position="7"/></match> <non-match>we </non-match> <match><lookahead-group nr="1" value="go" position="10"/></match> <non-match>go</non-match> </analyze-string-result> (with whitespace added for legibility)

17 Higher-order functions

17.1 Processing function items

The functions included in this section operate on function items, that is, values referring to a function.

[Definition] Functions that accept functions among their arguments, or that return functions in their result, are described in this specification as higher-order functions.

Note:

Some functions such as fn:parse-json allow the option of supplying a callback function for example to define exception behavior. Where this is not essential to the use of the function, the function has not been classified as higher-order for this purpose; in applications where function items cannot be created, these particular options will not be available.

Function	Meaning
`fn:function-lookup`	Returns a function item having a given name and arity, if there is one.
`fn:function-name`	Returns the name of the function identified by a function item.
`fn:function-arity`	Returns the arity of the function identified by a function item.
`fn:function-identity`	Returns a string representing the identity of a function item.
`fn:function-annotations`	Returns the annotations of the function item.

17.1.5 fn:function-annotations

Changes in 4.0 ⬇ ⬆

Changes the function to return a sequence of key-value pairs rather than a map. [Issue 36 PR 710 17 September 2023]
Changes the function to return a sequence of key-value pairs rather than a map. [Issue 1391 PR 1393 19 August 2024]

Summary

Returns the annotations of the function item.

Signature

`fn:function-annotations`(
`$function`	`as` `fn(*)`
) `as` `map(xs:QName, xs:anyAtomicType)`

Properties

This function is deterministic, context-independent, and focus-independent.

Rules

The fn:function-annotations function returns the annotations of $function as a sequence of single-entry maps maps, each associating the name of a function annotation with the value of the annotation. Note that several annotations on a function can share the same name. The order of the annotations is retained.

The result is a sequence of single-entry maps, each being an instance of map(xs:QName, xs:anyAtomicType*). If a function (for example, a built-in function) has no annotations, the result of the function is an empty sequence.

For each annotation, a map is returned, with a single entry. The key of the map entry is the name of the annotation as an xs:QName. The value of the entry is the the value of the annotation as a sequence of atomic items. If the annotation has no values, the associated value is an empty sequence.

Notes

In the common case where the annotation names are all unique, the result of the function can readily be converted into single map by applying the function map:merge.

Examples

Expression:	function-annotations(true#0)
Result:	()
Expression:	declare %private function local:inc($c) { $c + 1 }; function-annotations(local:inc#1)
Result:	{ QName("http://www.w3.org/2012/xquery", "private"), () }
Expression:	let $old := %local:deprecated('0.1', '0.2') fn() {} let $ann := function-annotations($old) return map:of-pairs($ann)
Result:	{ QName("http://www.w3.org/2005/xquery-local-functions", "deprecated"): ("0.1", "0.2") }

XPath and XQuery Functions and Operators 4.0

W3C Editor's Draft 23 February 2026

Abstract

Status of this Document

Dedication

1 Introduction

1.9 Terminology

1.9.5 Properties of functions

6 Regular expressions

6.1 Regular expression syntax

6.1.1 Processing model for regular expressions

6.1.3 Regular expression grammar

6.1.4 Reluctant Quantifiersquantifiers

6.1.5 Captured Groupsgroups

6.1.6 Back-References Back-references

6.1.7 Unicode Block Namesblock names

6.1.8 Assertions

6.1.8.2 Boundary Assertionsassertions

6.3 Functions using regular expressions

6.3.2 fn:replace

6.3.4 fn:analyze-string

17 Higher-order functions

17.1 Processing function items

17.1.5 fn:function-annotations

D Glossary (Non-Normative)

XPath and XQuery Functions and Operators 4.0

W3C Editor's Draft 23 February 2026

Abstract

Status of this Document

Dedication

1 Introduction

1.9 Terminology

1.9.5 Properties of functions

6 Regular expressions

6.1 Regular expression syntax

6.1.1 Processing model for regular expressions

6.1.3 Regular expression grammar

6.1.4 Reluctant Quantifiersquantifiers

6.1.5 Captured Groupsgroups

6.1.6 Back-ReferencesBack-references

6.1.7 Unicode Block Namesblock names

6.1.8 Assertions

6.1.8.2 Boundary Assertionsassertions

6.3 Functions using regular expressions

6.3.2 fn:replace

6.3.4 fn:analyze-string

17 Higher-order functions

17.1 Processing function items

17.1.5 fn:function-annotations

D Glossary (Non-Normative)

6.1.6 Back-References Back-references