Sebastian Nagel
sebastian-nagel
Languages
Repos
63
Stars
25
Forks
7
Top Language
Python
Loading contributions...
Top Repositories
Repositories
63No description provided.
Mirror of Apache Nutch
PyAthena is a Python DB API 2.0 (PEP 249) client for Amazon Athena.
Common web archive utility code.
Java library for reading and writing WARC files with a typed API
Compact Language Detector 2
compact_enc_det - Compact Encoding Detection
Konstanz in Zahlen: Jährliche Zahlen und Fakten zur Stadt Konstanz
No description provided.
A set of reusable Java components that implement functionality common to any web crawler
Apache Hadoop docker image
Experiments and metrics about robots.txt captures, presentation at #ossym2022
Sort-friendly URI Reordering Transform (SURT) python module
Create a file tree with the raw data from a zip file in usable format
No description provided.
Declarations of terms of major social media platforms. Maintained by the Platform Governance Archive team, University of Bremen.
No description provided.
No description provided.
Web crawler SDK based on Apache Storm
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
news-please - an integrated web crawler and information extractor for news that just works.
DuckDB-Web - Source code of duckdb.org
Simhash and near-duplicate detection
A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
No description provided.
No description provided.
No description provided.
Run a high-fidelity browser-based crawler in a single Docker container
This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl