XML Path Language (XPath) 4.0 WG Review Draft

A XPath 4.0 Grammar

A.3 Lexical structure

Changes in 4.0 ⬇ ⬆

The rules for tokenization have been largely rewritten. In some cases the revised specification may affect edge cases that were handled in different ways by different 3.1 processors, which could lead to incompatible behavior. [Issue 327 PR 519 30 May 2023]

This section describes how an XPath 4.0 text is tokenized prior to parsing.

All keywords are case sensitive. Keywords are not reserved—that is, any lexical QName may duplicate a keyword except as noted in A.4 Reserved Function Names.

Tokenizing an input string is a process that follows the following rules:

[Definition: An ordinary production rule is a production rule in A.1 EBNF that is not annotated ws:explicit.]
[Definition: A literal terminal is a token appearing as a string in quotation marks on the right-hand side of an ordinary production rule.]
Note:
Strings that appear in other production rules do not qualify. For example, BracedURILiteral does not quality because it appears only in URIQualifiedName, and "0x" does not qualify because it appears only in HexIntegerLiteral.
The literal terminals in XPath 4.0 are: !!=#$()*+,...///::::=<<<<===!>=>=?>>>=>>????[@[]{|||}×÷-->ancestorancestor-or-selfandarrayasatattributecastcastablechildcommentdescendantdescendant-or-selfdivdocument-nodeelementelseempty-sequenceenumeqeveryexceptfnfollowingfollowing-or-selffollowing-siblingfollowing-sibling-or-selfforfunctiongegtidivifininstanceintersectisitemitemskeykeysleletltmapmembermodnamespacenamespace-nodenenodeoforotherwisepairsparentprecedingpreceding-or-selfpreceding-sibling-or-selfprocessing-instructionrecordreturnsatisfiesschema-attributeschema-elementselfsometextthentotreatunionvaluevalues
[Definition: A variable terminal is an instance of a production rule that is not itself an ordinary production rule but that is named (directly) on the right-hand side of an ordinary production rule.]
The variable terminals in XPath 4.0 are: BinaryIntegerLiteral DecimalLiteral DoubleLiteral HexIntegerLiteral IntegerLiteralNCNameQName StringLiteral StringTemplate URIQualifiedName Wildcard
[Definition: A complex terminal is a variable terminal whose production rule references, directly or indirectly, an ordinary production rule.]
The complex terminals in XPath 4.0 are: StringTemplate
Note:
The significance of complex terminals is that at one level, a complex terminal is treated as a single token, but internally it may contain arbitrary expressions that must be parsed using the full EBNF grammar.
Tokenization is the process of splitting the supplied input string into a sequence of terminals, where each terminal is either a literal terminal or a variable terminal (which may itself be a complex terminal). Tokenization is done by repeating the following steps:
1. Starting at the current position, skip any whitespace and comments.
2. If the current position is not the end of the input, then return the longest literal terminal or variable terminal that can be matched starting at the current position, regardless whether this terminal is valid at this point in the grammar. If no such terminal can be identified starting at the current position, or if the terminal that is identified is not a valid continuation of the grammar rules, then a syntax error is reported.
  Note:
  Here are some examples showing the effect of the longest token rule:
  - The expression map{a:b} is a syntax error. Although there is a tokenization of this string that satisfies the grammar (by treating a and b as separate expressions), this tokenization does not satisfy the longest token rule, which requires that a:b is interpreted as a single QName.
  - The expression 10 div3 is a syntax error. The longest token rule requires that this be interpreted as two tokens ("10" and "div3") even though it would be a valid expression if treated as three tokens ("10", "div", and "3").
  - The expression $x-$y is a syntax error. This is interpreted as four tokens, ("$", "x-", "$", and "y").
  Note:
  The lexical production rules for variable terminals have been designed so that there is minimal need for backtracking. For example, if the next terminal starts with "0x", then it can only be either a HexIntegerLiteral or an error; if it starts with "`" (and not with "```") then it can only be a StringTemplate or an error.
  This convention, together with the rules for whitespace separation of tokens (see A.3.2 Terminal Delimitation) means that the longest-token rule does not normally result in any need for backtracking. For example, suppose that a variable terminal has been identified as a StringTemplate by examining its first few characters. If the construct turns out not to be a valid StringTemplate, an error can be reported without first considering whether there is some shorter token that might be returned instead.
Tokenization unambiguously identifies the boundaries of the terminals in the input, and this can be achieved without backtracking or lookahead. However, tokenization does not unambiguously classify each terminal. For example, it might identify the string "div" as a terminal, but it does not resolve whether this is the operator symbol div, or an NCName or QName used as a node test or as a variable or function name. Classification of terminals generally requires information about the grammatical context, and in some cases requires lookahead.
Note:
Operationally, classification of terminals may be done either in the tokenizer or the parser, or in some combination of the two. For example, according to the EBNF, the expression "parent::x" is made up of three tokens, "parent", "::", and "x". The name "parent" can be classified as an axis name as soon as the following token "::" is recognized, and this might be done either in the tokenizer or in the parser. (Note that whitespace and comments are allowed both before and after "::".)
In the case of a complex terminal, identifying the end of the complex terminal typically involves invoking the parser to process any embedded expressions. Tokenization, as described here, is therefore a recursive process. But other implementations are possible.

Note:

Previous versions of this specification included the statement: When tokenizing, the longest possible match that is consistent with the EBNF is used.

Different processors are known to have interpreted this in different ways. One interpretation, for example, was that the expression 10 div-3 should be split into four tokens (10, div, -, 3) on the grounds that any other tokenization would give a result that was inconsistent with the EBNF grammar. Other processors report a syntax error on this example.

This rule has therefore been rewritten in version 4.0. Tokenization is now entirely insensitive to the grammatical context; div-3 is recognized as a single token even though this results in a syntax error. For some implementations this may mean that expressions that were accepted in earlier releases are no longer accepted in 4.0.

A.3.3 Less-Than and Greater-Than Characters

The operator symbols <, <=, >, >=, <<, >>, =>, ->, =!>, and =?> have alternative representations using the characters U+FF1C (FULL-WIDTH LESS-THAN SIGN, ＜) and U+FF1E (FULL-WIDTH GREATER-THAN SIGN, ＞) in place of U+003C (LESS-THAN SIGN, <) and U+003E (GREATER-THAN SIGN, >) . The alternative tokens are respectively ＜, ＜=, ＞, ＞=, ＜＜, ＞＞, =＞, -＞, =!＞, and =?＞. In order to avoid visual confusion these alternatives are not shown explicitly in the grammar.

This option is provided to improve the readability of XPath expressions embedded in XML-based host languages such as XSLT; it enables these operators to be depicted using characters that do not require escaping as XML entities or character references.

XML Path Language (XPath) 4.0 WG Review Draft

W3C Editor's Draft 23 February 2026

Abstract

Status of this Document

Dedication

A XPath 4.0 Grammar

A.3 Lexical structure

A.3.3 Less-Than and Greater-Than Characters