Locating Data in DP2 Using XPath
XPath, short for XML Path Language, is a language designed for locating information within XML and HTML documents. It facilitates the navigation through elements and attributes, proving indispensable for precise data retrieval in data extraction tasks.
Basics of XPath
XPath expressions allow for the selection of nodes, including elements, attributes, text, and more. Below are basic XPath expressions and their functions:
//element: Selects all nodes namedelementwithin the document./element: Targets allelementnodes directly under the root node.element[@attribute]: Finds allelementnodes with a specific attribute.element[@attribute='value']: Chooses allelementnodes where the attribute equalsvalue.element/text(): Retrieves the text content ofelementnodes.element/child::node(): Selects the child nodes ofelement.
Advanced Usage
XPath’s capabilities extend to using logical operators (and, or), axes (ancestor, descendant, following-sibling), and functions (contains(), starts-with(), not()) for crafting complex queries.
Logical Operators:
//input[@type='submit' or @type='button']
This selects all
inputelements with atypeattribute of either ‘submit’ or ‘button’.Using Axes:
//div/ancestor::form
This expression finds
formancestors ofdivelements.Applying Functions:
//h2[contains(text(),'News')]
It selects
h2elements containing the text ‘News’.
Applying XPath in DP2
In DP2, XPath expressions are specified as data selectors for precise data extraction. For instance:
{
"elements": {
"postTitle": {
"col": "//div[contains(@class, 'post-title')]/text()",
"type": "string"
},
"link": {
"col": "//a/@href",
"type": "string"
}
}
}
Here, postTitle is configured to extract text from div elements with ‘post-title’ class, and link extracts the href attribute from all links.
Here are some additional practical examples of using XPath to extract specific types of information:
Extracting Drug Information in detail_step:
Select
<td>elements with the class ‘drug-name’://td[@class='drug-name']
Select
<td>elements with the class ‘approval-number’://td[@class='approval-number']
Select
<td>elements with the class ‘company-name’://td[@class='company-name']
Select
<a>elements within<div>elements with the class ‘attachments’://div[@class='attachments']/a
Select
<p>elements within<div>elements with the class ‘reference’://div[@class='reference']/p
Extract the Total Number of Pages in totalpage_step:
Locate the last link in pagination (excluding “Next” and “Last”):
//div[@class='pagination']//a[5]
Extract total page count from
<span>within pagination info://div[@class='pagination']/span[@class='page-info']
Extract total page count from pagination script in static pages:
//div[@class='pagination']/script
Extracting the Category in category_step:
Extracting Category Name from Parent Box Element:
//div[contains(@class, 'p_parentBox')]/a/text()
Extracting Category Name from Form Middle Content:
//div[contains(@class, 'formMiddleContent482')]//a/text()
Extracting Category Name from Web Component Menu:
//div[@class='w-com-menu-in']/ul/li/div/a/text()
Extract Page Information in list_step:
Select
<li>elements within<ul>elements with the class ‘category-list’://ul[@class='category-list']/li
Select
<div>elements within<div>elements with the class ‘category-grid’://div[@class='category-grid']/div[@class='category-cell']
Select
<a>elements within<div>elements with the class ‘category-sidebar’://div[@class='category-sidebar']//a
Select
<a>elements within<div>elements with the class ‘category-tags’://div[@class='category-tags']//a
Select
<div>elements within<div>elements with the class ‘category-waterfall’://div[@class='category-waterfall']/div[@class='category-item']
Resource Link
Explore XPath further with this handy resource:
This cheat sheet offers a quick reference for XPath syntax and functions, ideal for quick consultations during practical applications.
Summary
XPath serves as a robust tool for data location and extraction in DP2, enabling precise targeting of document elements. When setting up DP2 configurations, thorough testing of XPath expressions is crucial to ensure they precisely target the intended elements.