DP2 Beginner's Guide

1. Introduction

  • 1. Introduction to DP2

2. System Architecture

  • DP2 System Architecture and Component

3. User Interface

  • DP2 User Interface Overview

4. Steps and Extraction Guides

  • Step Configuration ① - Category Step
  • Jexter Configuration - Extract Category Information
  • Step Configuration ② - Total Page Step
  • Jexter Configuration - Get Total Pages
  • Step Configuration ③ - List Step
  • Jexter Configuration - Extract List Page Information
  • Step Configuration ④ - Detail Step
  • Extracting Drug Information in detail_step
  • Step Configuration ⑤ - Attachment Step

5. API Configuration

  • Configuring API Settings to Save Data

6. Post Configuration

  • Example Overview
  • Configuring Basic Information
  • Setting Request Method and Data
  • Defining Cookies and Headers
  • Additional Configurations
  • Complete Configuration Example

7. Data Management

  • Batch Processing, Deletion, and Export
  • Understanding Data Flow and Task Management
  • Fundamentals of MongoDB Querying for DP2

8. Monitoring and Logging

  • Monitoring and Logging DP2 Activities

9. Real-world Applications

  • Real-world Scenarios and Use Cases

10. Simplifying Data Extraction with Jexter

  • Simplifying Data Extraction with Jexter I
  • Simplifying Data Extraction with Jexter:Parent
  • Simplifying Data Extraction with Jexter III
  • Total Row: Defining the Total Number of Drugs to Extract
  • Parent: Streamlining the XPath Configuration
  • Elements: Specifying Drug Information to Extract
  • Combining the Three Aspects for Efficient Drug Data Extraction
  • prefix, postfix, and default
  • Execution Order of Extraction with the Jexter in DP2
  • Comprehensive Example
  • Conclusion

11. Querying Techniques

  • Locating Data in DP2 Using XPath
  • jq and JMESPath in DP2

12. Tips and Troubleshooting

  • Tricks and Tips for DP2 (Continuously Updated)
  • Troubleshooting and Frequently Asked Questions (FAQ) in DP2

13. Special Case Collection

  • Special Cases: Avoiding Duplicate Keys
  • Special Case: Extracting ‘category_id’ from DP2 Jexter System
  • Special Case: Iterative Parameter Configuration
  • Select Attachments
  • Select Drug Category Links
    • Sample Web Page Structure
    • XPath Expression
    • Data Extraction Configuration
    • Example Extraction Results
    • Result Explanation
DP2 Beginner's Guide
  • Select Drug Category Links
  • View page source

Select Drug Category Links

This tutorial will demonstrate how to use Jexter configurations to handle specific HTML structures, with a particular emphasis on selecting links. This process is basically introduced in our DP2 for Beginners guide, while in this tutorial, we will explore a special case, namely how to filter links.

Sample Web Page Structure

Assume the target webpage’s HTML structure is as follows:

<div class="com_main bg3">
  <div>
    <div>
      <ul>
        <li>All Categories</li>
        <li>
          <a href="https://www.examplepharma.com/categories/list.html?catid=123">Antiviral Drugs</a>
        </li>
        <li>
          <a href="https://www.examplepharma.com/categories/list.html?catid=456">Cardiac Protection Drugs</a>
        </li>
        <li>Other Categories</li> <!-- This is the last element we don't need -->
      </ul>
    </div>
  </div>
</div>

XPath Expression

To extract the category names and links in the middle (excluding the first and last li elements), we use the following XPath expression:

//div[@class='com_main bg3']/div[1]/div[1]/ul/li[position()>1 and position()<last()]

Data Extraction Configuration

Based on the provided structure, we define the following configuration for data extraction:

{
  "total_rows": "//div[@class='com_main bg3']/div[1]/div[1]/ul/li[position()>1 and position()<last()]",
  "elements": {
    "category": "./a/text()",
    "link": {
      "col": "./a/@href",
      "callback": "absolute_url"
    },
    "category_id": {
      "col": "./a/@href",
      "function": {
        "regexp": "catid=(\\d+)",
        "type": "string"
      },
      "post_process": "return data.match(/catid=(\\d+)/)[1]"
    }
  }
}

Example Extraction Results

With the above configuration and HTML structure, we expect the following extraction results:

[
  {
    "category": "Antiviral Drugs",
    "link": "https://www.examplepharma.com/categories/list.html?catid=123",
    "category_id": "123"
  },
  {
    "category": "Cardiac Protection Drugs",
    "link": "https://www.examplepharma.com/categories/list.html?catid=456",
    "category_id": "456"
  }
]

Result Explanation

This result is a JSON array, where each object represents a drug category. For each category, we provide three key pieces of information:

  • category: The name of the drug category, such as “Antiviral Drugs” or “Cardiac Protection Drugs”.

  • link: A link to the list of drugs in that category.

  • category_id: The category ID extracted from the link, represented as the value of the query parameter catid.

Through this tutorial, we demonstrated how to extract drug category information from webpages with specific HTML structures, with a particular emphasis on link selection.

Previous

© Copyright 2025, HzaCode.

Built with Sphinx using a theme provided by Read the Docs.