Extracting Drug Information in detail_step
Description
In this tutorial, we will guide you through a detailed example, showing you how to configure Jexter to extract drug names, approval numbers, company names, attachment links, and drug reference information.
Detailed Case Explanation
Let’s assume you have a web page with the following structure, which contains drug information:
<!-- Example HTML structure -->
<table>
<tr>
<td class="drug-name">Drug A</td>
<td class="approval-number">[Batch 123456]</td>
<td class="company-name">Company X</td>
</tr>
<tr>
<td class="drug-name">Drug B</td>
<td class="approval-number">[Batch 789012]</td>
<td class="company-name">Company Y</td>
</tr>
</table>
<div class="attachments">
<a href="attachment1.pdf">Attachment 1</a>
<a href="attachment2.pdf">Attachment 2</a>
</div>
<div class="reference">
<p class="reference-cell">This is a detailed reference for Drug A.</p>
<p class="reference-cell">This is a detailed reference for Drug B.</p>
</div>
We will extract the following details:
Drug Name (drug_name)
Approval Number (auth_num)
Company Name (company)
Attachment Links (attachments)
Drug Reference Information (drug_reference)
Jexter Configuration:
Here’s a Jexter configuration file that targets the example HTML structure:
{
"elements": {
"drug_name": {
"col": "//td[@class='drug-name']"
},
"auth_num": {
"col": "//td[@class='approval-number']",
"function": {
"regexp": "\\[Batch (\\d+)\\]",
"type": "string",
"return": [1]
}
},
"company": {
"col": "//td[@class='company-name']"
},
"attachments": {
"innerHtml": "//div[@class='attachments']/a",
"extract_attachments": {}
},
"drug_reference": {
"innerHtml": "//div[@class='reference']/p"
}
}
}
Explanation of Configuration
drug_name: Uses XPath to select the<td>element with the classdrug-nameto extract the drug name.auth_num: Uses XPath to select the<td>element with the classapproval-numberand applies a regular expression to extract the approval number. Thereturnarray specifies the first capturing group, which is the number after “Batch”.company: Uses XPath to select the<td>element with the classcompany-nameto extract the company name.attachments: UsesinnerHtmlto select the<a>elements within the<div>with the classattachments. Theextract_attachmentsfield is used to process the extracted links.drug_reference: UsesinnerHtmlto select the<p>elements within the<div>with the classreferenceto extract the drug reference information.
This tutorial has provided guidance on using Jexter to extract drug information from web pages. After configuring your extraction rules, it’s critical to review the output to ensure it aligns with your expectations. If discrepancies are identified, fine-tune the XPath expressions and regular expressions to more accurately mirror the structure of the target web page before saving your settings. This step is pivotal before proceeding to the last study step, ensuring the integrity and accuracy of the data collection process.