XPath (XML Path Language) is a query language for selecting nodes from XML documents. It provides a powerful, concise syntax to navigate XML structure, extract data, and test conditions. XPath is used in XSLT transformations, XML Schema assertions, web scraping, and many XML processing tasks.
This comprehensive tutorial covers XPath from basics to advanced techniques. You'll learn path expressions, axes, predicates, functions, and real-world patterns. By the end, you'll be able to write efficient XPath queries for any XML document.
What is XPath?
XPath treats an XML document as a tree of nodes. Each element, attribute, and text value is a node. XPath expressions navigate this tree to select specific nodes.
Sample XML Document
We'll use this XML throughout the tutorial:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book id="1" category="fiction">
<title lang="en">The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<year>1925</year>
<price>10.99</price>
</book>
<book id="2" category="fiction">
<title lang="en">1984</title>
<author>George Orwell</author>
<year>1949</year>
<price>8.99</price>
</book>
<book id="3" category="programming">
<title lang="en">Clean Code</title>
<author>Robert C. Martin</author>
<year>2008</year>
<price>45.99</price>
</book>
<book id="4" category="programming">
<title lang="en">Design Patterns</title>
<author>Gang of Four</author>
<year>1994</year>
<price>54.99</price>
</book>
</bookstore>XPath Node Types:
- •Element nodes:
<book>,<title>,<author> - •Attribute nodes:
id="1",category="fiction" - •Text nodes: "The Great Gatsby", "10.99"
- •Comment nodes:
<!-- comments --> - •Document node: The root of the entire document
Basic Path Expressions
XPath uses path expressions similar to file system paths. / separates path steps, and you navigate from the document root or current node.
Absolute Paths (from root)
| XPath Expression | Selects |
|---|---|
| /bookstore | Root <bookstore> element |
| /bookstore/book | All <book> elements (4 books) |
| /bookstore/book/title | All <title> elements (4 titles) |
| /bookstore/book[1] | First <book> element only |
Relative Paths
Relative paths start from the current node (no leading /).
| XPath Expression | Selects (from current node) |
|---|---|
| book | All <book> children |
| book/title | All <title> grandchildren |
| . | Current node |
| .. | Parent node |
Descendant Selector //
// selects nodes anywhere in the document, regardless of depth.
| XPath Expression | Selects |
|---|---|
| // book | All <book> elements anywhere |
| // title | All <title> elements at any depth |
| // book/title | All <title> that are children of <book> |
Wildcard *
* matches any element node.
| XPath Expression | Selects |
|---|---|
| /bookstore/* | All children of <bookstore> (all books) |
| // book/* | All children of any <book> (titles, authors, etc.) |
| //* | All elements in the document |
Predicates: Filtering Results
Predicates filter node selections using conditions inside square brackets [].
Position Predicates
| XPath Expression | Selects |
|---|---|
| // book[1] | First book |
| // book[last()] | Last book (4th book) |
| // book[position()<3] | First 2 books |
| // book[position()>2] | Books 3 and 4 |
⚠Important: XPath uses 1-based indexing
Unlike most programming languages (0-indexed), XPath counts from 1.[1] is the first element, not the second.
Attribute Predicates
| XPath Expression | Selects |
|---|---|
| // book[@id] | All books with an id attribute |
| // book[@id='1'] | Book with id="1" |
| // book[@category='fiction'] | All fiction books |
| // title[@lang='en'] | All English titles |
Value Comparisons
| XPath Expression | Selects |
|---|---|
| // book[price>20] | Books with price > 20 |
| // book[price<=10] | Books with price ≤ 10 |
| // book[year>2000] | Books published after 2000 |
| // book[author='George Orwell'] | Books by George Orwell |
Multiple Conditions (AND / OR)
<!-- AND: Both conditions must be true --> //book[price>20 and @category='programming'] // Result: Books that are programming AND price > 20 // Matches: Clean Code ($45.99), Design Patterns ($54.99) <!-- OR: Either condition can be true --> //book[year<1950 or year>2000] // Result: Books published before 1950 OR after 2000 // Matches: The Great Gatsby (1925), 1984 (1949), Clean Code (2008) <!-- Complex: Combine multiple conditions --> //book[@category='fiction' and price<10] // Result: Fiction books under $10 // Matches: 1984 ($8.99)
XPath Axes: Navigating Relationships
Axes define the relationship between the current node and the nodes you want to select. They provide precise control over navigation.
Common Axes
| Axis | Description | Example |
|---|---|---|
| child:: | Direct children (default) | child::book |
| descendant:: | All descendants (children, grandchildren, etc.) | descendant::title |
| parent:: | Parent node | parent::bookstore |
| ancestor:: | All ancestors (parent, grandparent, etc.) | ancestor::* |
| following-sibling:: | Siblings after current node | following-sibling::book |
| preceding-sibling:: | Siblings before current node | preceding-sibling::book |
| attribute:: | Attributes of current node | attribute::id |
Axis Shortcuts
XPath provides shortcuts for commonly used axes:
| Shortcut | Full Form | Description |
|---|---|---|
| book | child::book | child:: is default |
| @id | attribute::id | @ selects attributes |
| // title | /descendant-or-self::node()/child::title | // descendant shortcut |
| . | self::node() | Current node |
| .. | parent::node() | Parent node |
<!-- Example: From a <title> node, find the parent book's price --> <!-- Starting from: <title>The Great Gatsby</title> --> <!-- Method 1: Using parent axis --> parent::book/price // Result: <price>10.99</price> <!-- Method 2: Using .. shortcut --> ../price // Result: <price>10.99</price> <!-- Example: Find all books after the first one --> //book[1]/following-sibling::book // Result: books with id 2, 3, 4
XPath Functions
XPath includes built-in functions for string manipulation, numeric operations, and boolean logic.
String Functions
<!-- contains(): Check if string contains substring -->
//book[contains(author, 'Orwell')]
// Result: Books by authors containing "Orwell"
<!-- starts-with(): Check string prefix -->
//book[starts-with(title, 'The')]
// Result: "The Great Gatsby"
<!-- string-length(): Get string length -->
//book[string-length(title) > 15]
// Result: Books with titles longer than 15 characters
<!-- substring(): Extract substring -->
substring(//book[1]/title, 1, 3)
// Result: "The" (first 3 characters)
<!-- concat(): Concatenate strings -->
concat(//book[1]/author, ' - ', //book[1]/title)
// Result: "F. Scott Fitzgerald - The Great Gatsby"
<!-- normalize-space(): Remove extra whitespace -->
normalize-space(' Clean Code ')
// Result: "Clean Code"
<!-- translate(): Character replacement -->
translate(//book[1]/title, 'aeiou', 'AEIOU')
// Result: "ThE grEAt gAtsby" (vowels to uppercase)Numeric Functions
<!-- sum(): Add values --> sum(//book/price) // Result: 120.96 (10.99 + 8.99 + 45.99 + 54.99) <!-- count(): Count nodes --> count(//book) // Result: 4 <!-- number(): Convert to number --> //book[number(year) > 2000] // Result: Books after year 2000 <!-- floor(), ceiling(), round() --> floor(45.99) // Result: 45 ceiling(45.99) // Result: 46 round(45.99) // Result: 46
Boolean Functions
<!-- not(): Logical NOT --> //book[not(@category='fiction')] // Result: Programming books (non-fiction) <!-- true() / false(): Boolean literals --> //book[price > 20 and true()] // Result: Books over $20 <!-- boolean(): Convert to boolean --> //book[boolean(@id)] // Result: All books with id attribute
Node Functions
<!-- name(): Get element name --> //book[1]/*[name()='title'] // Result: <title> element <!-- position(): Current position --> //book[position() mod 2 = 0] // Result: Even-positioned books (2nd, 4th) <!-- last(): Last position --> //book[position() = last()] // Result: Last book <!-- text(): Get text content --> //book[1]/title/text() // Result: "The Great Gatsby"
Real-World XPath Patterns
Find Most Expensive Books
<!-- Books more expensive than $50 --> //book[price > 50] // Result: Design Patterns ($54.99) <!-- Top 2 most expensive books --> //book[price >= //book[position()=1]/price or position() <= 2] // (More complex: requires XPath 2.0 for sorting)
Group by Category
<!-- All fiction books --> //book[@category='fiction'] <!-- All programming books --> //book[@category='programming'] <!-- Count books per category --> count(//book[@category='fiction']) // Result: 2
Complex Filtering
<!-- Fiction books under $10 --> //book[@category='fiction' and price < 10] // Result: 1984 ($8.99) <!-- Books from 1900s (1900-1999) --> //book[year >= 1900 and year < 2000] // Result: The Great Gatsby (1925), 1984 (1949), Design Patterns (1994) <!-- Books with specific title pattern --> //book[contains(title, 'Code') or contains(title, 'Pattern')] // Result: Clean Code, Design Patterns
Web Scraping Pattern
XPath is commonly used with Selenium, BeautifulSoup, or Scrapy for web scraping.
<!-- Extract all product prices --> //div[@class='product']//span[@class='price'] <!-- Find "Add to Cart" button for specific product --> //div[contains(text(),'iPhone 15')]//button[text()='Add to Cart'] <!-- Get all links in navigation menu --> //nav[@id='main-menu']//a/@href <!-- Extract table data --> //table[@id='results']//tr[position()>1]/td[2] <!-- Find element by partial text --> //button[contains(text(), 'Submit')]
XPath Best Practices
Use Specific Paths When Possible
Prefer /bookstore/book/title over // title for better performance
Avoid Overly Complex Expressions
Break complex queries into multiple steps or use XSLT variables
Use Predicates to Filter Early
// book[@category='fiction']/title is more efficient than filtering later
Test XPath in Browser DevTools
Chrome/Firefox console: $x("// book[@category='fiction']")
Handle Namespaces Properly
XML with namespaces requires namespace-aware XPath queries
Don't Rely on Position Alone
// book[3] breaks if document structure changes. Prefer // book[@id='3']
Avoid //* in Production
Selecting all elements is slow on large documents
Related Tools & Resources
External References
Official Documentation & Standards
- W3C XPath 3.1 Specification - Official XPath standard
- MDN XPath Documentation - Comprehensive XPath reference
- W3Schools XPath Tutorial - Beginner-friendly XPath guide
- XPath Cheatsheet - Quick reference guide
Conclusion
XPath is an essential tool for working with XML documents. Whether you're parsing data, transforming documents with XSLT, scraping websites, or validating XML schemas, XPath provides a powerful and concise way to navigate and query XML structure.
Start with basic path expressions and gradually master predicates, axes, and functions. Practice on real XML documents to build intuition. Use browser DevTools to test queries interactively. With XPath in your toolkit, you can efficiently extract, transform, and validate XML data in any project.