View Old View New View Both View Only Previous Next

This draft contains only sections that have differences from the version that it modified.

W3C

XML Path Language (XPath) 4.0 WG Review Draft

W3C Editor's Draft 23 February 2026

This version:
https://qt4cg.org/specifications/xpath-40/
Most recent version of XPath:
https://qt4cg.org/specifications/xpath-40/
Most recent Recommendation of XPath:
https://www.w3.org/TR/2017/REC-xpath-31-20170321/
Editor:
Michael Kay, Saxonica <mike@saxonica.com>

Please check the errata for any errors or issues reported since publication.

See also translations.

This document is also available in these non-normative formats: XML.


Abstract

XPath 4.0 is an expression language that allows the processing of values conforming to the data model defined in [XQuery and XPath Data Model (XDM) 4.0]. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree. As well as modeling the tree structure of XML, the data model also includes atomic items, function items, maps, arrays, and sequences. This version of XPath supports JSON as well as XML, and adds many new functions in [XQuery and XPath Functions and Operators 4.0].

XPath 4.0 is a superset of XPath 3.1. A detailed list of changes made since XPath 3.1 can be found in I Change Log.

Status of this Document

This is a draft prepared by the QT4CG (officially registered in W3C as the XSLT Extensions Community Group). Comments are invited.

Dedication

The publications of this community group are dedicated to our co-chair, Michael Sperberg-McQueen (1954–2024).

Michael was central to the development of XML and many related technologies. He brought a polymathic breadth of knowledge and experience to everything he did. This, combined with his indefatigable curiosity and appetite for learning, made him an invaluable contributor to our project, along with many others. We have lost a brilliant thinker, a patient teacher, and a loyal friend.


A XPath 4.0 Grammar

A.3 Lexical structure

Changes in 4.0  

  1. The rules for tokenization have been largely rewritten. In some cases the revised specification may affect edge cases that were handled in different ways by different 3.1 processors, which could lead to incompatible behavior.   [Issue 327 PR 519 30 May 2023]

This section describes how an XPath 4.0 text is tokenized prior to parsing.

All keywords are case sensitive. Keywords are not reserved—that is, any lexical QName may duplicate a keyword except as noted in A.4 Reserved Function Names.

Tokenizing an input string is a process that follows the following rules:

  • [Definition: An ordinary production rule is a production rule in A.1 EBNF that is not annotated ws:explicit.]

  • [Definition: A literal terminal is a token appearing as a string in quotation marks on the right-hand side of an ordinary production rule.]

    Note:

    Strings that appear in other production rules do not qualify. For example, BracedURILiteral does not quality because it appears only in URIQualifiedName, and "0x" does not qualify because it appears only in HexIntegerLiteral.

    The literal terminals in XPath 4.0 are: !!=#$()*+,...///::::=<<<<===!>=>=?>>>=>>????[@[]{|||}×÷-->ancestorancestor-or-selfandarrayasatattributecastcastablechildcommentdescendantdescendant-or-selfdivdocument-nodeelementelseempty-sequenceenumeqeveryexceptfnfollowingfollowing-or-selffollowing-siblingfollowing-sibling-or-selfforfunctiongegtidivifininstanceintersectisitemitemskeykeysleletltmapmembermodnamespacenamespace-nodenenodeoforotherwisepairsparentprecedingpreceding-or-selfpreceding-sibling-or-selfprocessing-instructionrecordreturnsatisfiesschema-attributeschema-elementselfsometextthentotreatunionvaluevalues

  • [Definition: A variable terminal is an instance of a production rule that is not itself an ordinary production rule but that is named (directly) on the right-hand side of an ordinary production rule.]

    The variable terminals in XPath 4.0 are: BinaryIntegerLiteralDecimalLiteralDoubleLiteralHexIntegerLiteralIntegerLiteralNCNameQNameStringLiteralStringTemplateURIQualifiedNameWildcard

  • [Definition: A complex terminal is a variable terminal whose production rule references, directly or indirectly, an ordinary production rule.]

    The complex terminals in XPath 4.0 are: StringTemplate

    Note:

    The significance of complex terminals is that at one level, a complex terminal is treated as a single token, but internally it may contain arbitrary expressions that must be parsed using the full EBNF grammar.

  • Tokenization is the process of splitting the supplied input string into a sequence of terminals, where each terminal is either a literal terminal or a variable terminal (which may itself be a complex terminal). Tokenization is done by repeating the following steps:

    1. Starting at the current position, skip any whitespace and comments.

    2. If the current position is not the end of the input, then return the longest literal terminal or variable terminal that can be matched starting at the current position, regardless whether this terminal is valid at this point in the grammar. If no such terminal can be identified starting at the current position, or if the terminal that is identified is not a valid continuation of the grammar rules, then a syntax error is reported.

      Note:

      Here are some examples showing the effect of the longest token rule:

      • The expression map{a:b} is a syntax error. Although there is a tokenization of this string that satisfies the grammar (by treating a and b as separate expressions), this tokenization does not satisfy the longest token rule, which requires that a:b is interpreted as a single QName.

      • The expression 10 div3 is a syntax error. The longest token rule requires that this be interpreted as two tokens ("10" and "div3") even though it would be a valid expression if treated as three tokens ("10", "div", and "3").

      • The expression $x-$y is a syntax error. This is interpreted as four tokens, ("$", "x-", "$", and "y").

      Note:

      The lexical production rules for variable terminals have been designed so that there is minimal need for backtracking. For example, if the next terminal starts with "0x", then it can only be either a HexIntegerLiteral or an error; if it starts with "`" (and not with "```") then it can only be a StringTemplate or an error.

      This convention, together with the rules for whitespace separation of tokens (see A.3.2 Terminal Delimitation) means that the longest-token rule does not normally result in any need for backtracking. For example, suppose that a variable terminal has been identified as a StringTemplate by examining its first few characters. If the construct turns out not to be a valid StringTemplate, an error can be reported without first considering whether there is some shorter token that might be returned instead.

  • Tokenization unambiguously identifies the boundaries of the terminals in the input, and this can be achieved without backtracking or lookahead. However, tokenization does not unambiguously classify each terminal. For example, it might identify the string "div" as a terminal, but it does not resolve whether this is the operator symbol div, or an NCName or QName used as a node test or as a variable or function name. Classification of terminals generally requires information about the grammatical context, and in some cases requires lookahead.

    Note:

    Operationally, classification of terminals may be done either in the tokenizer or the parser, or in some combination of the two. For example, according to the EBNF, the expression "parent::x" is made up of three tokens, "parent", "::", and "x". The name "parent" can be classified as an axis name as soon as the following token "::" is recognized, and this might be done either in the tokenizer or in the parser. (Note that whitespace and comments are allowed both before and after "::".)

  • In the case of a complex terminal, identifying the end of the complex terminal typically involves invoking the parser to process any embedded expressions. Tokenization, as described here, is therefore a recursive process. But other implementations are possible.

Note:

Previous versions of this specification included the statement: When tokenizing, the longest possible match that is consistent with the EBNF is used.

Different processors are known to have interpreted this in different ways. One interpretation, for example, was that the expression 10 div-3 should be split into four tokens (10, div, -, 3) on the grounds that any other tokenization would give a result that was inconsistent with the EBNF grammar. Other processors report a syntax error on this example.

This rule has therefore been rewritten in version 4.0. Tokenization is now entirely insensitive to the grammatical context; div-3 is recognized as a single token even though this results in a syntax error. For some implementations this may mean that expressions that were accepted in earlier releases are no longer accepted in 4.0.

A.3.3 Less-Than and Greater-Than Characters

The operator symbols <, <=, >, >=, <<, >>, =>, ->, =!>, and =?> have alternative representations using the characters U+FF1C (FULL-WIDTH LESS-THAN SIGN, ) and U+FF1E (FULL-WIDTH GREATER-THAN SIGN, ) in place of U+003C (LESS-THAN SIGN, <) and U+003E (GREATER-THAN SIGN, >) . The alternative tokens are respectively , <=, , >=, <<, >>, =>, ->, =!>, and =?>. In order to avoid visual confusion these alternatives are not shown explicitly in the grammar.

This option is provided to improve the readability of XPath expressions embedded in XML-based host languages such as XSLT; it enables these operators to be depicted using characters that do not require escaping as XML entities or character references.