Select Attachments

In this tutorial, we’ll demonstrate how to effectively use Jexter for HTML attachments data extraction. This process is briefly introduced in our DP2 for Beginners guide, especially for extracting detailed information from HTML.

Here is a simplified HTML page example containing the product details section, which includes images, contact information, the official website link, and the company logo.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Product Details Page</title>
</head>
<body>
    <div>
        <h1>Product Details</h1>
        <dl class="pro_detail_op">
            <dt>Product Images</dt>
            <dd>
                <img src="http://www.example.com/image1.jpg" alt="Product Image 1">
                <img src="http://www.example.com/image2.jpg" alt="Product Image 2">
            </dd>
            <dt>Special Image</dt>
        
            <dd><img src="http://www.example.com/special_image.jpg" alt="Special Image"></dd>
            <dt>Contact Information</dt>
            <dd>
                <a href="tencent://message/?uin=328836088&Site=Songlu&Menu=yes">Contact Us</a>
            </dd>
            <dt>Official Website Link</dt>
            <dd>
                <a href="https://www.example.net">Visit Official Website</a>
            </dd>
            <dt>Company Logo</dt>
            <dd>
                <img src="http://www.example.com/logo.svg" alt="Company Logo">
            </dd>
        </dl>
    </div>
</body>
</html>

Jexter Configuration and Extraction Results

Below are three different Jexter configurations and their corresponding extraction results.

Basic Extraction Configuration

This configuration will extract all content within <dl class="pro_detail_op">.

{
  "attachments": {
    "innerHtml": "//dl[@class='pro_detail_op']",
    "extract_attachments": {}
  }
}

Extraction Results (Basic Extraction)

{
  "attachments": [
    {
      "task_fp": "...",
      "dp2_id": 32866965,
      "title": "Product Image 1",
      "link": "http://www.example.com/image1.jpg",
      "type": "jpg"
    },
    {
      "task_fp": "...",
      "dp2_id": 32866966,
      "title": "Product Image 2",
      "link": "http://www.example.com/image2.jpg",
      "type": "jpg"
    },
    {
      "task_fp": "...",
      "dp2_id": 32866967,
      "title": "Special Image",
      "link": "http://www.example.com/special_image.jpg",
      "type": "jpg"
    },
    {
      "task_fp": "...",
      "dp2_id": 32866968,
      "title": "Company Logo",
      "link": "http://www.example.com/logo.svg",
      "type": "svg"
    }
  ]
}

Configuration to Exclude Specific Type Attachments

This configuration will exclude SVG type attachments, extracting only JPG images.

{
  "attachments": {
    "innerHtml": "//dl[@class='pro_detail_op']",
    "extract_attachments": {
      "types_excluded": ["svg"]
    }
  }
}

Extraction Results (Exclude SVG)

{
  "attachments": [
    {
      "task_fp": "...",
      "dp2_id": 32866965,
      "title": "Product Image 1",
      "link": "http://www.example.com/image1.jpg",
      "type": "jpg"
    },
    {
      "task_fp": "...",
      "dp2_id": 32866966,
      "title": "Product Image 2",
      "link": "http://www.example.com/image2.jpg",
      "type": "jpg"
    },
    {
      "task_fp": "...",
      "dp2_id": 32866967,
      "title": "Special Image",
      "link": "http://www.example.com/special_image.jpg",
      "type": "jpg"
    }
  ]
}

Extraction Results (Specific Sub-elements)

{
  "attachments": [
    {
      "task_fp": "...",
      "dp2_id": 32866967,
      "title": "Special Image",
      "link": "http://www.example.com/special_image.jpg",
      "type": "jpg"
    }
  ]
}

Conclusion

Through these configuration examples, you can see how to use Jexter to customize data extraction strategies. You can choose the appropriate configuration based on the actual HTML structure and requirements to extract the needed data.