Locating Data in DP2 Using XPath

XPath, short for XML Path Language, is a language designed for locating information within XML and HTML documents. It facilitates the navigation through elements and attributes, proving indispensable for precise data retrieval in data extraction tasks.

Basics of XPath

XPath expressions allow for the selection of nodes, including elements, attributes, text, and more. Below are basic XPath expressions and their functions:

  • //element: Selects all nodes named element within the document.

  • /element: Targets all element nodes directly under the root node.

  • element[@attribute]: Finds all element nodes with a specific attribute.

  • element[@attribute='value']: Chooses all element nodes where the attribute equals value.

  • element/text(): Retrieves the text content of element nodes.

  • element/child::node(): Selects the child nodes of element.

Advanced Usage

XPath’s capabilities extend to using logical operators (and, or), axes (ancestor, descendant, following-sibling), and functions (contains(), starts-with(), not()) for crafting complex queries.

  1. Logical Operators:

    //input[@type='submit' or @type='button']
    

    This selects all input elements with a type attribute of either ‘submit’ or ‘button’.

  2. Using Axes:

    //div/ancestor::form
    

    This expression finds form ancestors of div elements.

  3. Applying Functions:

    //h2[contains(text(),'News')]
    

    It selects h2 elements containing the text ‘News’.

Applying XPath in DP2

In DP2, XPath expressions are specified as data selectors for precise data extraction. For instance:

{
  "elements": {
    "postTitle": {
      "col": "//div[contains(@class, 'post-title')]/text()",
      "type": "string"
    },
    "link": {
      "col": "//a/@href",
      "type": "string"
    }
  }
}

Here, postTitle is configured to extract text from div elements with ‘post-title’ class, and link extracts the href attribute from all links.

Here are some additional practical examples of using XPath to extract specific types of information:

Extracting Drug Information in detail_step

  • Select <td> elements with the class ‘drug-name’:

    //td[@class='drug-name']
    
  • Select <td> elements with the class ‘approval-number’:

    //td[@class='approval-number']
    
  • Select <td> elements with the class ‘company-name’:

    //td[@class='company-name']
    
  • Select <a> elements within <div> elements with the class ‘attachments’:

    //div[@class='attachments']/a
    
  • Select <p> elements within <div> elements with the class ‘reference’:

    //div[@class='reference']/p
    

Extract the Total Number of Pages in totalpage_step

  • Locate the last link in pagination (excluding “Next” and “Last”):

    //div[@class='pagination']//a[5]
    
  • Extract total page count from <span> within pagination info:

    //div[@class='pagination']/span[@class='page-info']
    
  • Extract total page count from pagination script in static pages:

    //div[@class='pagination']/script
    

Extracting the Category in category_step

  • Extracting Category Name from Parent Box Element:

    //div[contains(@class, 'p_parentBox')]/a/text()
    
  • Extracting Category Name from Form Middle Content:

    //div[contains(@class, 'formMiddleContent482')]//a/text()
    
  • Extracting Category Name from Web Component Menu:

    //div[@class='w-com-menu-in']/ul/li/div/a/text()
    

Extract Page Information in list_step

  • Select <li> elements within <ul> elements with the class ‘category-list’:

    //ul[@class='category-list']/li
    
  • Select <div> elements within <div> elements with the class ‘category-grid’:

    //div[@class='category-grid']/div[@class='category-cell']
    
  • Select <a> elements within <div> elements with the class ‘category-sidebar’:

    //div[@class='category-sidebar']//a
    
  • Select <a> elements within <div> elements with the class ‘category-tags’:

    //div[@class='category-tags']//a
    
  • Select <div> elements within <div> elements with the class ‘category-waterfall’:

    //div[@class='category-waterfall']/div[@class='category-item']
    

Summary

XPath serves as a robust tool for data location and extraction in DP2, enabling precise targeting of document elements. When setting up DP2 configurations, thorough testing of XPath expressions is crucial to ensure they precisely target the intended elements.