ASF: XSLTC Predicate Handling

	XSLTC Design
XSLTC Predicate Handling

XSLTC Predicate Handling

Definition

According to Michael Kay's "XSLT Programmer's Reference" page 736, a predicate is "An expression used to filter which nodes are selected by a particular step in a path expression, or to select a subset of the nodes in a node-set. A Boolean expression selects the nodes for which the predicate is true; a numeric expression selects the node at the position given by the value of the expression, for example '[1]' selects the first node.". Note that a predicate containing a boolean expression can return zero, one or more nodes, while a predicate containing a numeric expression can return only zero or one node.

(top)

Examples

I'll list a few examples that I can refer back to later on in this document. All examples will use this XML document:

    <?xml version="1.0"?>
    <doc>
      <foo location="Drumcondra">
        <bar name="Cat and Cage"/>
        <bar name="Fagan's"/>
        <bar name="Gravedigger's"/>
        <bar name="Ivy House"/>
      <foo>
      <foo location="Town">
        <bar name="Peter's Pub"/>
        <bar name="Grogan's"/>
        <bar name="Hogans's"/>
        <bar name="Brogan's"/>
      </foo>
    </doc>

Here are some examples of a predicate with boolean expressions:

    <xsl:for-each select="//bar[contains(@name,'ogan')]">
    <xsl:for-each select="//bar[parent::*/@location = 'Drumcondra']">
    <xsl:for-each select="//bar[@name = 'Cat and Cage']">

The first two select more than one node, while the last selects only one. The last expression could select more nodes if the input document was different. Now, here are a few examples of predicates with numeric expressions:

    <xsl:value-of select="//bar[1]">
    <xsl:value-of select="/doc/foo[2]/bar[1]">
    <xsl:value-of select="/doc/foo[2]/bar">

The last expression will return more than one node, but the step that contains the predicate returns only one (the second <foo> element).

The above are the basic types of predicates. These can be grouped to create a predicate pipeline, where the first predicate reduces the node-set that the second predicate filters, and so on. Here are some examples:

    A: <for-each select="//bar[contains(@name,'ogan')][2]">
    C: <for-each select="//bar[2][contains(@name,'ogan')]">
    B: <for-each select="//bar[position() > 3][2]">

It is easier to figure out which nodes these expressions should return if one goes through the steps and predicates one by one. In expression A: we first get all <bar> elements from the whole document. Then the first predicate selects from that node-set only those elements that have a @name attribute that contains "ogan", and we're left with these elements:

        <bar name="Grogan's">
        <bar name="Hogans's">
        <bar name="Brogan's">

And finally, the last predicate then selects the second of those elements:

        <bar name="Hogans's">

Expression B: contains the same predicates as A:, but the resulting node set if completely different. We start off with the same set of <bar> elements, but we apply the "[2]" predicate first, and end up with this element:

        <bar name="Fagan's">

Fagan's is the bar where the Irish Taoiseach (prime minister) drinks his pints, but its name does not contain the string "ogan", so the resulting node-set is empty.

The third expressions also starts off with all <bar> elements, applies the predicate "[position() > 3]", and reduces the node set to these:

        <bar name="Ivy House">
        <bar name="Peter's Pub">
        <bar name="Grogan's">
        <bar name="Hogans's">
        <bar name="Brogan's">

The last predicate "[2]" is applied to this node-set and set is further reduced to:

        <bar name="Peter's Pub">

(top)

Categories

From the examples in the last chapter we can try to categorize predicate chains/pipelines to simplify our implementation. We can speed up processing significantly if we can avoid using a data-structure (iterator) to represent the intermediate step between predicates. The goal of setting up these categories is to pinpoint those cases where an intermediate iterator has to be used and when it can be avoided.

(top)

Single predicate expressions

Expressions containing just a single predicate have no intermediate step and there is no need for any extra iterator. The expression inside the predicate can be applied directly to the original iterator. We call this category SIMPLE_CONTEXT.

(top)

Expressions containing only non-position predicates

Predicate-order is significant when the predicate-chain contains one or more predicate with an expression similar to "position() > 3" or "2". This is because the position() and last() explicitly refer to the intermediate step between applying each predicate. The expression:

    <xsl:for-each select="//bar[contains(@name,'ogan')][parent::*/@location = 'Town']">

has two predicates that can be applied in any order and still produce the desired node-set. Such predicates can be merged to:

    <xsl:for-each select="//bar[contains(@name,'ogan') & (parent::*/@location = 'Town')]">

We call this category NO_CONTEXT.

(top)

Expressions containing position predicates

A predicate-chain, whose predicates' expressions contain any use of the position() or last() functions require some way of representing the intermediate step in an iterator. The first predicate is applied to the original node-set, and the resulting node-set must then be stored in some other iterator, from which the second predicate can get the current position from the iterator's getPosition() and getLast() methods. We call this category GENERAL_CONTEXT

‌

(top)

Expressions containing one position predicate

There is one expection from the GENERAL_CONTEXT category. If the predicate-chain contains only one position-predicate, and that predicate is the very first one, then that predicate can call the iterator that contains the first node-set directly. Just look:

    <xsl:for-each select="//bar[2][parent::*/@location = 'Drumcondra']">

The [2] predicate can be applied to the original iterator for the //bar step. And so can the [parent::*/@location = 'Drumcondra'] predicate as well. This is only the case when the position predicate is first in the predicate chain. These types of predicate chains belong in the NO_CONTEXT category.

(top)

Design details

Predicates are handled quite differently in step expressions and step patterns. Step expressions are not implemented with the various contexts in mind and use a specialised iterator to wrap the code for each predicate. Step patterns are more complicated and CPU (or should I say JVM?) exhaustive. Step patterns containing predicates are analysed to determine context type and compiled accordingly.

(top)

Predicates and Step expressions

The basic behaviour for a predicate is to compile a filter. This filter is an auxiliary class that implements the org.apache.xalan.xsltc.dom.CurrentNodeListFilter interface. The Step or StepPattern that uses the predicate will create a org.apache.xalan.xsltc.dom.CurrentNodeListFilter. This iterator contains the nodes that pass through the predicate. The compiled filter is used by the iterator to determine which nodes that should be included. The org.apache.xalan.xsltc.dom.CurrentNodeListFilter interface contains only a single method:

    public interface CurrentNodeListFilter {
    public abstract boolean test(int node, int position, int last, int current,
                                 AbstractTranslet translet, NodeIterator iter);
    }

The code that is compiled into the test() method is the code for the predicate's expression. The Predicate class compiles the filter class and a test() method skeleton, while some sub-class of the Expression class compiles the actual code that goes into this method.

The iterator is initialised with a filter that implements this interface:

    public CurrentNodeListIterator(NodeIterator source, 
				   CurrentNodeListFilter filter,
				   int currentNode,
				   AbstractTranslet translet) {
	this(source, !source.isReverse(), filter, currentNode, translet);
    }

The iterator will use its source iterator to provide it with the initial node-set. Each node that is returned from this set is passed through the filter before returned by the next() method. Note that the source iterator can also be a current node-list iterator (if two or more predicates are chained together).

(top)

Optimisations in Step expressions

Node-value iterators

Some simple predicates that test for node values are handled by the NodeValueIterator class at runtime. These are:

    A: foo[@attr = <value>]
    B: foo[bar = <value>]
    C: foo/bar[. = <value>]

The first case is handled by creating an iterator that represents foo/@attr, then passing this iterator and a test-value to a NodeValueIterator. The <value> is an expression that is compiled and passed to the iterator as a string. It does not have to be a literal string as the string value is found at runtime. The last two cases are similarly handled by creating an iterator for foo/bar and passing that and the test-value to a NodeValueIterator.

Nth descendant iterators

The Step class is also optimised for position-predicates that are applied to descendant iterators:

    <xsl:for-each select="//bar[3]">

Such step/predicate combinations are handled by the internal DOM's inner class NthDescendantIterator.

Nth position iterators

Similarly, the Step class is optimised for position-predicates that are applied to basic steps:

    <xsl:for-each select="bar[3]">

Such step/predicate combinations are handled by the internal DOM's inner class NthPositionIterator.

Node test

The predicate class contains a method that tells you if it is a boolean test:

    public boolean isBooleanTest();

This can be, but it currently is not, used by the Step class to compile in optimised code. Some work to be done here!

‌

(top)

Predicates and StepPatterns

Using predicates in patterns is slow on any XSLT processor, and XSLTC is no exception. This is why the predicate context is carefully analysed by the StepPattern class, so that the compiled code is specialised to handle the specific predicate(s) in use. First we should consider the basic step pattern.

Basic pattern handling

All patterns are grouped (by the Mode class) according to their kernel -node type. The kernel node-type is node-type of the last step in a pattern:

    <xsl:template match="foo/bar/baz">  ...  <xsl:template>

In this case the type for elements <baz> is the kernel type. This step is not compiled as a step pattern. The node type is passed to the Mode class and is used to place the remainder of the pattern code inside the big switch() statement in the translet's applyTemplates() method. The whole pattern is then reduced to:

    match="foo/bar"

The StepPattern representing the <bar> element test is compiled under the appropriate case: section of the switch() statement. The code compiled for the step pattern is basically just a call to the DOM's getType() method and a test for the desired node type. There are two special cases for:

    <xsl:template match="foo/*/baz">  ...  <xsl:template>
    <xsl:template match="foo/*@[2]">  ...  <xsl:template>

In the first case we call isElement() and in the second case we call isAttribute(), instead of getType().

Patterns with predicates

The typeCheck() method of the StepPattern invokes a method that analyses the predicates of the step pattern and determines their context. (Note that this method needs to be updated to handle the exception to the GENERAL_CONTEXT metioned in the section Expressions containing one position predicate earlier in this document.) The translate() method of the StepPattern class contains a switch() statement that calls methods that are tailored for compiling code for the various predicate contexts.

(top)