BioWikiNet: A Multilingual Biodiversity Knowledge Network from Wikipedia
Abstract
BioWikiNet is a comprehensive biodiversity knowledge network constructed from Wikipedia articles across 11 major languages. This repository contains the complete data processing pipeline and analysis code for extracting, linking, and analyzing biodiversity-related content from Wikipedia, integrated with taxonomic data from the Global Biodiversity Information Facility (GBIF) and Wikidata.
Key Features
- Multilingual Coverage: Processes Wikipedia in Arabic, German, English, Spanish, French, Hindi, Indonesian, Japanese, Portuguese, Russian, and Chinese
- Automated Extraction: Identifies biology-related articles using template detection (Taxobox, Speciesbox, etc.)
- Data Integration: Links articles to GBIF backbone taxonomy and Wikidata identifiers
- Network Analysis: Computes network metrics including centrality indices and connectivity measures
- Reproducible Pipeline: Fully automated workflow using R targets framework
System Requirements
- R (>= 4.4.3)
- Git
- Minimum 16 GB RAM (32 GB recommended for processing all languages)
- ~500 GB disk space for Wikipedia dumps and processed data
Installation
# Clone repository
git clone https://github.com/uribo/BioWikiNet.git
cd BioWikiNet
# Restore R environment
Rscript -e "renv::restore()"Data Availability
This repository contains all code necessary to generate the final datasets, but the final data products are published on Figshare: https://doi.org/10.6084/m9.figshare.29431577.v2
To reproduce the final datasets, you can either:
- Run the complete targets pipeline (see below)
- Download intermediate products from the resources and place them in the
_targets/directory
Reproducibility
Running the Complete Pipeline
# Execute full pipeline
Rscript -e "targets::tar_make()"
# Visualize pipeline structure
Rscript -e "targets::tar_visnetwork()"Pipeline Stages
- Data Acquisition: Downloads Wikipedia NDJSON dumps for target languages
- Article Filtering: Identifies biology-related articles using template patterns
- Metadata Extraction: Extracts article metadata, links, and content
- Taxonomic Enrichment: Queries Wikidata for taxonomic identifiers
- GBIF Matching: Links entities to GBIF backbone taxonomy
- Network Construction: Builds article-to-article link networks
- Metric Calculation: Computes network centrality and connectivity indices
Repository Structure
BioWikiNet/
├── app/
│ ├── app.R # Shiny UI/server (visNetwork, DT, i18n, etc.)
│ ├── _targets.yaml # Points to ../_targets/ store
│ └── translations/translation_ja.json # Multilingual UI strings
├── R/ # Core processing modules
│ ├── combine.R
│ ├── gbif.R
│ ├── json_article2df.R
│ ├── links.R
│ ├── parquet.R
│ ├── valid.R
│ ├── wikidata_query.R
│ └── wikipedia_extra_info.R
├── data/ # Output datasets
├── data-raw/ # Raw input data (GBIF backbone, etc.)
├── renv/ # renv configures
├── _targets/ # Targets pipeline cache
├── _targets.R # Pipeline definition (1,875+ lines)
└── renv.lock # R package dependencies
Computed Network Metrics
The pipeline calculates the following network-based indicators:
- SCI (Species Connectivity Index): Measures species connectivity patterns
- Core Index: Quantifies centrality within the biodiversity knowledge network
- Excess Focus: Identifies disproportionate concentration on specific topics
Explorer App (Shiny)
An interactive Shiny application for exploring the BioWikiNet network lives in app/. It provides multilingual UI, search and filter for taxa, network visualization, per-node statistics, and GBIF details.
# Run the app
Rscript -e "shiny::runApp('app', launch.browser = TRUE)"License
This project is released under MIT.
Contact
For questions or collaboration inquiries:
- Email: [uryu.shinya@tokushima-u.ac.jp]
- GitHub Issues
Acknowledgments
This work acknowledges the contributions of:
- Wikidata community for structured biodiversity information
- Wikipedia editors across all supported languages
- GBIF for providing taxonomic backbone data