Efficient XPath Queries

This is something I get asked about quite a bit as I had a misspent youth with XSLT…


One of my pet hates is people always using the search identifier ‘//’ in XPath queries.  It’s the SELECT * of XML, and shouldn’t be used unless you actually want to ‘search’ your document.


If you’re performing SQL you’d SELECT fields explicitly rather than SELECT * wouldn’t you? 🙂


because:



  1. If the schema changes (new fields inserted) then your existing code has less chance of breaking
  2. It performs better less server pre-processing and catalog lookups
  3. More declarative and the code is easy to read and maintain

With XML (and the standard DOM-style parsers) you’re working on a document tree, and accessing nodes loaded into that tree. 


Consider the following XML fragment as an example:


Say your car dealership sells new cars and current prices are serialised in an xml document:


<xml>
 <Sales>
  <Cars>
   <Car Make=”Ford” Model=”Territory” />
   <Car Make=”Ford” Model=”Focus” />
  </Cars>
 </Sales>
</xml>


In order to get all cars you can easily use the following XPath: ‘//Car’.  This searches from the root to find all Car elements (and finds 2).


A more efficient way would be ‘/*/*/Car’ as we know Cars only exist at the 3rd level in the document


A yet more efficient way would be ‘/Sales/Cars/Car’ as this makes a direct hit on the depth and parent elements in the document.


You can also mix and match with ‘/*//Car’ to directly access the parts of the DOM you’re certain of and search on the parts you’re not.


Now lets say you go into the used car business and refactor your XML format as follows:


<xml>
 <Sales Type=”New”>
  <Cars>
   <Car Make=”Ford” Model=”Territory” />
   <Car Make=”Ford” Model=”Focus” />
  </Cars>
 </Sales>
 <Sales Type=”Used”>
  <Cars>
   <Car Make=”Honda” Model=”Civic” />
   <Car Make=”Mazda” Model=”3″ />
  </Cars>
 </Sales>
</xml>


If you want to get all Cars (new and used) you could still use any of the XPaths above.  If you want to isolate the New from the used, then you’re going to have to make some changes in any case.


‘//Car’ is obviously going to pick up 4 elements


‘/Sales[@Type=’New’]/Cars/Car’ is probably the most efficient in this case but it will vary based on the complexity of the condition (in []) and the complexity and size of the document.


It’s important to note that the effects of optimising your XPath queries won’t really be felt until you’re operating with:



  • Large documents (n mb+)
  • Deep documents (n levels deep) – n is variable based on the document size 
  • Heavy load and high demand for throughput of requests

This means don’t expect effecient XPaths to solve all your problems, but they shouldn’t be a limiting factor in a business application.  The other thing to say is that if your XPath queries are getting really complicated then your schema is probably in need of some attention as well.