Step Configuration ④ - Detail Step
Configuring Detail Steps (detail_step)
In the detail_step of a Study, you focus on obtaining more in-depth information from the detailed pages corresponding to each link extracted from the list page. Correctly configuring the detail_step is crucial for building a complete dataset and analysis.
Step Description
The detail_step phase typically follows the list_step
phase closely, where it is responsible for extracting specific data from the detailed page of each list item. This data may include, but is not limited to:
Task ID (
dp2_id)Company name (
company)Product name (
drug_name)Approval number (
auth_num)Product specifications (
specification)Product description (
drug_reference)Attachments (
attachments)
Example Configuration
{
"data_in": {
"data_for_test": [
{
"dp2_id": 12345678,
"product_name": "Example Medication Name",
"product_link": "https://www.examplepharm.com/product-detail?id=12345"
}
]
},
"project_name": "examplepharm.drugs.detail",
"url": "{product_link}",
"type": "one-off",
"priority": 2,
"fetch_method": "direct",
"method": "GET",
"status": 1,
"charset": "UTF-8",
"charact_string_start": "",
"charact_string_end": "",
"add_only": 1,
"excluded_workers": "{excluded_workers}",
"interval": 5184000,
"data_out": {
"jpath": "",
"api": {
"url": "http://api2.example.cn/dp2/mongo/save",
"table": "company.examplepharm.drugs",
"type": "merge",
"where": {
"uniqueId": "{dp2_id}"
}
}
}
}
In this configuration:
data_incontains the product information passed from thelist_stepphase, includingdp2_id,product_name, andproduct_link. Here,12345678is used as an exampledp2_id, “Example Medication Name” as the product name, andhttps://www.examplepharm.com/product-detail?id=12345as the product link.project_namedefines the name of the current Study, here usingexamplepharm.drugs.detailas an example.The
urlfield uses the{product_link}placeholder, representing the URL of the detailed page.Fields such as
type,priority,fetch_method,method, etc., define the type and priority of the request.The
data_outfield contains the details of the API call for saving the extracted data to the database. Here, themergetype is used, indicating that new data is merged into existing records.The
add_onlyfield is set to 1, meaning that if a record already exists, it will not be updated.
Considerations
Ensure that the links in
data_inare valid and correctly point to the detailed pages.In
data_out, use placeholders (e.g.,{dp2_id}) to represent the data extracted from the detailed page, which will be replaced by the actual data during the extraction process.If the detailed page contains dynamically loaded content, you may need to adjust the
intervalfield to give the page enough time to load all content.
Perform comprehensive testing of your configuration before deployment to ensure accuracy and prevent errors. Properly configuring the detail_step is crucial for the success of the next extraction phase, ensuring that the right drug links are extracted and their details are accurately captured and stored.