Skip to main content

Overview

XPath Injection is an attack technique targeting applications that use user-supplied input to construct XPath (XML Path Language) queries for searching or accessing XML documents. If the input is not properly sanitized, attackers can inject special characters (like ', " , =, |, ) to alter the XPath expression. This can allow them to bypass access controls, extract sensitive information from the XML document, or cause denial of service by crafting complex queries. 📜💉

Business Impact

XPath Injection can lead to:
  • Information Disclosure: Attackers can retrieve parts of the XML document they shouldn’t have access to, potentially exposing sensitive configuration data, user details, or business information.
  • Authentication/Authorization Bypass: If XPath queries are used to check credentials or permissions stored in XML, injection can bypass these checks.
  • Data Structure Discovery: Attackers can infer the structure of the underlying XML document.

Reference Details

CWE ID: CWE-643 OWASP Top 10 (2021): A03:2021 - Injection Severity: High (depending on the data stored in XML)

Framework-Specific Analysis and Remediation

This vulnerability is not tied to a specific web framework but rather to the XML parsing and XPath query libraries used. The core issue is string concatenation to build queries with untrusted input. Defenses include:
  1. Strict Input Validation: Validate user input against an expected format (e.g., allow only alphanumeric characters if searching for a username).
  2. Escaping/Quoting: Carefully escape quotes within user input before embedding it in string literals within the XPath query. Single quotes (') are often replaced with ',"'", and double quotes (") with ,"", within the appropriate string literal. However, this is complex and error-prone.
  3. Parameterized Queries (If Supported): Some libraries might support parameterized XPath queries or variable bindings, which is the most robust solution. This separates the query structure from the user data.

  • Python
  • Java
  • .NET(C#)
  • PHP
  • Node.js
  • Ruby

Framework Context

Using libraries like lxml or Python’s built-in xml.etree.ElementTree and constructing XPath query strings manually.Searching an XML user database based on a username from a GET request.
# views/xml_search.py
from lxml import etree
from django.http import HttpResponse

# Assume tree is loaded from an XML file:
# <users>
#   <user><name>alice</name><role>user</role></user>
#   <user><name>admin</name><role>admin</role></user>
# </users>
tree = etree.parse('users.xml')

def search_user(request):
    username = request.GET.get('user')
    if not username:
        return HttpResponse("Username required", status=400)

    # DANGEROUS: username is directly inserted into the XPath string literal.
    # Input: user = "' or '1'='1"
    # XPath becomes: /users/user[name='' or '1'='1']/role
    # This selects ALL user roles.
    xpath_query = f"/users/user[name='{username}']/role"
    try:
        results = tree.xpath(xpath_query)
        roles = [role.text for role in results]
        return HttpResponse(f"Roles: {', '.join(roles)}")
    except Exception as e:
        return HttpResponse(f"Error: {e}", status=500)

Vulnerable Scenario 2: Product Lookup

Looking up a product by ID, where the ID might contain quotes.
# views/product_lookup.py
# Assume product_tree is loaded:
# <products>
#   <product id="A'123"><name>Widget</name><price>10</price></product>
#   <product id="B456"><name>Gadget</name><price>20</price></product>
# </products>
product_tree = etree.parse('products.xml')

def get_product_price(request):
    product_id = request.GET.get('id')
    # DANGEROUS: If product_id contains a quote, it breaks the query.
    # Input: id = "A'123" -> /products/product[@id='A'123']/price (Syntax Error)
    # Input: id = "' or @id='"
    # XPath: /products/product[@id='' or @id='']/price (Might return multiple/wrong results)
    xpath_query = f"/products/product[@id='{product_id}']/price"
    try:
        results = product_tree.xpath(xpath_query)
        prices = [price.text for price in results]
        return HttpResponse(f"Prices: {', '.join(prices)}")
    except Exception as e:
        return HttpResponse(f"Error: {e}", status=500)

Mitigation and Best Practices

Avoid constructing XPath queries via string formatting if possible. If you must, strictly validate the input format first. For embedding in string literals, carefully escape quotes. The safest approach for lxml is often to use parameterized XPath queries if the structure allows, or find elements by tag and filter in Python code.

Secure Code Example

# views/xml_search.py (Secure - Parameterized or Escaping)
from lxml import etree
from django.http import HttpResponse
import re # For validation

tree = etree.parse('users.xml')

# Option 1: Strict Validation (if username format is known)
def search_user_validated(request):
    username = request.GET.get('user')
    # SECURE: Validate input strictly. Only allow alphanumeric.
    if not username or not re.fullmatch(r'[a-zA-Z0-9]+', username):
         return HttpResponse("Invalid username format", status=400)

    xpath_query = f"/users/user[name='{username}']/role" # Now safe due to validation
    results = tree.xpath(xpath_query)
    # ...

# Option 2: Escaping (complex and potentially fragile)
def escape_xpath_literal(text):
    if "'" in text and '"' in text:
        # Contains both, escape using concat() trick
        parts = text.split("'")
        return "concat('" + "', \"'\" , '".join(parts) + "')"
    elif "'" in text:
        return f'"{text}"' # Use double quotes if only single quotes are present
    else:
        return f"'{text}'" # Default to single quotes

def search_user_escaped(request):
    username = request.GET.get('user')
    if not username: return HttpResponse("Username required", status=400)

    # SECURE: Escape the username for use within an XPath literal.
    safe_username_literal = escape_xpath_literal(username)
    xpath_query = f"/users/user[name={safe_username_literal}]/role"
    results = tree.xpath(xpath_query)
    # ...

# Option 3: Parameterized XPath (using lxml extensions - may require specific setup)
# This is often the most robust method if available/applicable.
# Requires defining variables and passing them separately. Example structure:
# results = tree.xpath("/users/user[name=$uname]/role", uname=username)

Testing Strategy

Identify all inputs used in XPath queries. Submit values containing single quotes ('), double quotes ("), pipe (|), equals (=), spaces, and XPath expressions like ' or '1'='1 or '] | /* | /foo[bar='. Observe if the query logic changes, unexpected data is returned, or errors occur.