depeelalgussz/pubmed-search-scraper
pubmed research article extraction
PubMed Search Scraper
This tool extracts research papers and academic records from PubMed based on keyword searches. It provides structured article metadata for researchers, analysts, and data engineers. The PubMed Search Scraper streamlines literature gathering and helps users build research-ready datasets with ease.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for pubmed-search-scraper you've just found your team — Let’s Chat. 👆👆
Introduction
PubMed Search Scraper retrieves detailed article information from PubMed search results, enabling automated literature collection for biomedical and academic research.
It solves the challenge of manually gathering large sets of scientific papers by providing consistent, structured data.
Ideal for students, academics, analysts, and anyone working with research trend tracking or scientific datasets.
Research Metadata Extraction
- Retrieves detailed article metadata, abstracts, tags, and citation formats.
- Supports scrolling pagination to capture extensive result sets.
- Handles article authors, identifiers, journal information, and social share links.
- Allows configurable result limits for targeted data extraction.
- Optimizes extraction with built-in handling for large query outputs.
Features
| Feature | Description |
|---|---|
| Keyword-based article scraping | Extracts articles based on customized PubMed queries. |
| Complete metadata extraction | Retrieves titles, authors, citations, journal info, PMIDs, tags, abstracts, and share links. |
| Pagination handling | Automatically scrolls and collects more items for large datasets. |
| Anti-blocking techniques | Ensures stable extraction during long or heavy searches. |
| Configurable limits | Control max items to manage performance and dataset size. |
What Data This Scraper Extracts
| Field Name | Field Description |
|---|---|
| title | Full title of the research article. |
| articleId | Unique PubMed article identifier. |
| articleUrl | Direct link to the article page. |
| authors.full | Complete list of article authors. |
| authors.short | Shortened author representation. |
| citation.full | Complete journal citation text. |
| citation.short | Abbreviated citation format. |
| pmid | PubMed ID reference. |
| tags | Article classification tags. |
| abstract.full | Full research abstract content. |
| abstract.short | Truncated preview of the abstract. |
| shareLinks | Social sharing URLs for platforms like Twitter and Facebook. |
Example Output
[
{
"title": "Rheumatoid arthritis.",
"articleId": "27156434",
"articleUrl": "https://pubmed.ncbi.nlm.nih.gov/27156434/",
"authors": {
"full": "Smolen JS, Aletaha D, McInnes IB.",
"short": "Smolen JS, et al."
},
"citation": {
"full": "Lancet. 2016 Oct 22;388(10055):2023-2038. doi: 10.1016/S0140-6736(16)30173-8. Epub 2016 May 3.",
"short": "Lancet. 2016."
},
"pmid": "27156434",
"tags": ["Free article.", "Review."],
"abstract": {
"full": "Rheumatoid arthritis is a chronic inflammatory joint disease, which can cause cartilage and bone damage as well as disability...",
"short": "Rheumatoid arthritis is a chronic inflammatory joint disease..."
},
"shareLinks": {
"twitter": "http://twitter.com/intent/tweet?text=Rheumatoid%20arthritis.%20https%3A//pubmed.ncbi.nlm.nih.gov/27156434/",
"facebook": "http://www.facebook.com/sharer/sharer.php?u=https%3A//pubmed.ncbi.nlm.nih.gov/27156434/",
"permalink": "https://pubmed.ncbi.nlm.nih.gov/27156434/"
}
}
]
Directory Structure Tree
PubMed Search Scraper/
├── src/
│ ├── main.py
│ ├── extractors/
│ │ ├── pubmed_parser.py
│ │ └── utils_formatting.py
│ ├── pagination/
│ │ └── scroll_handler.py
│ └── config/
│ └── settings.example.json
├── data/
│ ├── input.sample.json
│ └── sample_output.json
├── tests/
│ ├── test_parser.py
│ └── test_end_to_end.py
├── requirements.txt
└── README.md
Use Cases
- Medical researchers gather articles for systematic reviews and meta-analyses to accelerate scientific discovery.
- Data analysts track publication trends and build research intelligence dashboards for organizational insights.
- Academic institutions automate literature collection to support course development or research groups.
- Healthcare companies monitor emerging biomedical findings to stay aligned with innovation.
- Students streamline their thesis and dissertation research by automating article retrieval.
FAQs
Q: Can this scraper handle large search result sets?
A: Yes, it includes pagination logic allowing it to scroll through extensive lists while maintaining stable performance.
Q: What format are the results stored in?
A: Output is generated as structured JSON, with flexibility to export to CSV, Excel, HTML, JSONL, or XML.
Q: Does it support multiple search URLs at once?
A: Yes, you can provide multiple search URLs, and the tool will aggregate results across all queries.
Q: How accurate is the metadata extraction?
A: It mirrors the structure of PubMed article pages and consistently captures titles, authors, citations, abstracts, and IDs with high precision.
Performance Benchmarks and Results
Primary Metric: Handles up to hundreds of article results per minute under typical conditions, depending on query breadth.
Reliability Metric: Maintains a stable success rate across long paginated searches with minimal failures.
Efficiency Metric: Optimized metadata extraction ensures low overhead even when collecting full abstracts and citations.
Quality Metric: Delivers highly complete and structured metadata, suitable for academic and analytical workflows.
