Repos
29
Stars
428
Forks
108
Top Language
Java
Loading contributions...
Top Repositories
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
SPRUCE is an open-source enrichment platform for GreenOps which helps measure and reduce the environmental impact of cloud computing.
Resources for running StormCrawler with Docker services
Crawl configurations for benchmarking / testing StormCrawler
Use cases for DigitalPebble's TextClassification API
Repositories
29SPRUCE is an open-source enrichment platform for GreenOps which helps measure and reduce the environmental impact of cloud computing.
Estimate the environmental impact of GitHub Actions for your entire organization.
Resources for the DigitalPebble website
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Resources for running StormCrawler with Docker services
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Enrichment pipeline for CUR / FOCUS reports which adds energy and carbon data allowing to report and reduce the impact of the your cloud usage.
Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
Crawl configurations for benchmarking / testing StormCrawler
Wraps the charset detection logic from StormCrawler as a Tika module
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
StormCrawler topology to evaluate the performance of different backends and configurations
Documentation for Docker Official Images in docker-library
URLFrontier client written in Rust (mostly as a way of learning Rust)
Apache Nutch is an extensible and scalable web crawler
Ansible playbook for deploying a Storm cluster
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Mirror of Apache Storm
ElasticSearch module for Behemoth
Module for classifying Behemoth documents with a model from our Text Classification API
Support for old (pre 2013) CommonCrawl dataset in Behemoth
GATE Processing Resource wrapping DigitalPebble's TextClassification API
Setup for crawling tescobank with SC
No description provided.
Use cases for DigitalPebble's TextClassification API
A set of reusable Java components that implement functionality common to any web crawler
WARC resources for StormCrawler
resources for generating a corpus of docs from CC for Tika
Resources for comparison between 1.8 and 2.x of Apache Nutch