Jexter Configuration - Extract Category Information

Extracting the Category in `category_step`

Description: Within the DP2 system, category_step serves as a crucial step for extracting category information from web pages. In this tutorial, we will detail how to extract category links from HTML pages using Jexter. We will demonstrate the use of Jexter configuration through three examples, each targeting a specific HTML structure and extraction requirement. This will help you understand how to write and apply Jexter configurations according to your specific situations.

Example 1: Extracting Elements with a Specific Class Name

Suppose we have the following HTML fragment, containing two div elements with product information, each with a p_parentBox class:

<div class="p_parentBox">
  <a href="https://www.example.com/product/12345/">Antitumor Medication</a>
</div>
<div class="p_parentBox">
  <a href="https://www.example.com/product/67890/">Cardiovascular Drugs</a>
</div>

To extract the product category, link, and category ID from these elements, we can use the following Jexter configuration:

{
  "total_rows": "//div[contains(@class, 'p_parentBox')]",
  "elements": {
    "category": ".//a/text()",
    "link": {
      "col": ".//a/@href",
      "callback": "absolute_url"
    },
    "category_id": {
      "col": ".//a/@href",
      "function": {
        "regexp": "/product/(\\d+)/",
        "type": "string"
      }
    }
  }
}

The expected extraction results are as follows:

[
  {
    "category": "Antitumor Medication",
    "link": "https://www.example.com/product/12345/",
    "category_id": "12345"
  },
  {
    "category": "Cardiovascular Drugs",
    "link": "https://www.example.com/product/67890/",
    "category_id": "67890"
  }
]

Example 2: Extracting Elements with a Specific Path Structure in the Link

Consider the following HTML fragment, containing a link to a specific product:

<div class="formMiddleContent482">
  <a href="https://www.example.com/path/0_482_54321.html">Anti-infection Drugs</a>
</div>

To extract the product category, link, and category ID defined by the link path structure, we can use the following configuration:

{
  "total_rows": "//div[contains(@class, 'formMiddleContent482')]//a",
  "elements": {
    "category": {
      "col": "./text()",
      "callback": "text"
    },
    "link": {
      "col": "./@href",
      "callback": "absolute_url"
    },
    "category_id": {
      "col": "./@href",
      "function": {
        "regexp": "0_482_(\\d+).html",
        "type": "string"
      }
    }
  }
}

This will produce the following extraction results:

[
  {
    "category": "Anti-infection Drugs",
    "link": "https://www.example.com/path/0_482_54321.html",
    "category_id": "54321"
  }
]

Example 3: Extracting Category ID Using Data Attributes

Finally, consider the following HTML fragment, where elements are marked with a data-cateid attribute identifying the category ID:

<div class="w-com-menu-in">
  <ul>
    <li data-cateid="98765">
      <div><a href="https://www.example.com/product/98765/">Healthcare Products</a></div>
    </li>
    <li data-cateid="54321">
      <div><a href="https://www.example.com/product/54321/">Traditional Chinese Medicine</a></div>
    </li>
  </ul>
</div>

We can extract the product category, link, and category ID through the following Jexter configuration:

{
  "total_rows": "//div[@class='w-com-menu-in']/ul/li",
  "elements": {
    "category": "./div/a/text()",
    "link": {
      "col": "./div/a/@href",
      "callback": "absolute_url"
    },
    "category_id": {
      "col": "./@data-cateid",
      "type": "string"
    }
  }
}