BioWikiNet: A Multilingual Biodiversity Knowledge Network from Wikipedia

Abstract

BioWikiNet is a comprehensive biodiversity knowledge network constructed from Wikipedia articles across 11 major languages. This repository contains the complete data processing pipeline and analysis code for extracting, linking, and analyzing biodiversity-related content from Wikipedia, integrated with taxonomic data from the Global Biodiversity Information Facility (GBIF) and Wikidata.

Key Features

Multilingual Coverage: Processes Wikipedia in Arabic, German, English, Spanish, French, Hindi, Indonesian, Japanese, Portuguese, Russian, and Chinese
Automated Extraction: Identifies biology-related articles using template detection (Taxobox, Speciesbox, etc.)
Data Integration: Links articles to GBIF backbone taxonomy and Wikidata identifiers
Network Analysis: Computes network metrics including centrality indices and connectivity measures
Reproducible Pipeline: Fully automated workflow using R targets framework

System Requirements

R (>= 4.4.3)
Git
Minimum 16 GB RAM (32 GB recommended for processing all languages)
~500 GB disk space for Wikipedia dumps and processed data

Installation

# Clone repository
git clone https://github.com/uribo/BioWikiNet.git
cd BioWikiNet

# Restore R environment
Rscript -e "renv::restore()"

Data Availability

This repository contains all code necessary to generate the final datasets, but the final data products are published on Figshare: https://doi.org/10.6084/m9.figshare.29431577.v2

To reproduce the final datasets, you can either:

Run the complete targets pipeline (see below)
Download intermediate products from the resources and place them in the _targets/ directory

Reproducibility

Running the Complete Pipeline

# Execute full pipeline
Rscript -e "targets::tar_make()"

# Visualize pipeline structure
Rscript -e "targets::tar_visnetwork()"

Pipeline Stages

Data Acquisition: Downloads Wikipedia NDJSON dumps for target languages
Article Filtering: Identifies biology-related articles using template patterns
Metadata Extraction: Extracts article metadata, links, and content
Taxonomic Enrichment: Queries Wikidata for taxonomic identifiers
GBIF Matching: Links entities to GBIF backbone taxonomy
Network Construction: Builds article-to-article link networks
Metric Calculation: Computes network centrality and connectivity indices

Repository Structure

BioWikiNet/
├── app/
│   ├── app.R                 # Shiny UI/server (visNetwork, DT, i18n, etc.)
│   ├── _targets.yaml         # Points to ../_targets/ store
│   └── translations/translation_ja.json  # Multilingual UI strings
├── R/                    # Core processing modules
│   ├── combine.R
│   ├── gbif.R
│   ├── json_article2df.R 
│   ├── links.R
│   ├── parquet.R
│   ├── valid.R
│   ├── wikidata_query.R
│   └── wikipedia_extra_info.R
├── data/                # Output datasets
├── data-raw/            # Raw input data (GBIF backbone, etc.)
├── renv/                # renv configures
├── _targets/            # Targets pipeline cache
├── _targets.R           # Pipeline definition (1,875+ lines)
└── renv.lock            # R package dependencies

Computed Network Metrics

The pipeline calculates the following network-based indicators:

SCI (Species Connectivity Index): Measures species connectivity patterns
Core Index: Quantifies centrality within the biodiversity knowledge network
Excess Focus: Identifies disproportionate concentration on specific topics

Explorer App (Shiny)

An interactive Shiny application for exploring the BioWikiNet network lives in app/. It provides multilingual UI, search and filter for taxa, network visualization, per-node statistics, and GBIF details.

# Run the app
Rscript -e "shiny::runApp('app', launch.browser = TRUE)"

License

This project is released under MIT.

Contact

For questions or collaboration inquiries:

Email: [uryu.shinya@tokushima-u.ac.jp]
GitHub Issues

Acknowledgments

This work acknowledges the contributions of:

Wikidata community for structured biodiversity information
Wikipedia editors across all supported languages
GBIF for providing taxonomic backbone data

uribo/BioWikiNet