"topic:webarchiving" — Search

49 results for “topic:webarchiving”

An Awesome List for getting started with web archiving

Wayback Machine API interface & a command-line tool

archive-webpagearchive-webpagescdx-apiinternet-archiveinternet-archivingosintsavepagenowwayback-machinewayback-machine-apiwayback-machine-pythonweb-archivingwebarchiving

harvard-lil/warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

Python27025Updated 4 weeks ago

airagwarcwebarchiving

N0taN3rd/Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

JavaScript17425Updated 2 days ago

browser-automationchromechrome-headlesscrawlercrawlingheadless-chromehigh-fidelity-preservationpuppeteerwebarchiveswebarchiving

ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

C13217Updated 1 week ago

archiveteamarchivingcrawlcrawlercrawlerscrawlingdownloaderftpluascraperscrapingspiderwarcwebarchivingwgetwget-luazstd

machawk1/awesome-memento

A list of things related to software, literature, and other content for 🕣 Memento

1119Updated 1 day ago

awesomeawesome-listmementomemento-rfcwebarchiving

N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

JavaScript10422Updated 2 weeks ago

chrome-remote-interfacepupeteerwarcwarc-filesweb-archivesweb-archivingwebarchivewebarchiving

commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

Jupyter Notebook6411Updated 1 week ago

aws-athenacommon-crawlcommoncrawljupyter-notebookwebarchivingwebgraph-framework

peterk/warcworker

A dockerized, queued high fidelity web archiver based on Squidwarc

Python629Updated 1 month ago

archivinghigh-fidelity-preservationpreservationwebarchiveswebarchiving

cipher387/quickcacheandarchivesearch

Quick Cache and Archive search buttons

JavaScript394Updated 3 weeks ago

baidu-cachegoogle-cachewebarchivewebarchivingyandex-cache

datacoon/metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

Python352Updated 1 week ago

metadataosintosint-pythonwarcwarc-fileswebarchiving

ruarxive/awesome-digital-preservation

Awesome list dedicated to digital and data preservation tools, sources, services and so on.

336Updated 1 day ago

archivalawesomeawesome-listcrawlerdigital-preservationlistwarcwebarchiving

peterk/munin-indexerArchived

A social media open post web archiving tool

JavaScript262Updated 2 weeks ago

archivinghigh-fidelity-preservationpreservationwebarchiving

httpreserve/httpreserve

Digital Preservation of HTTP in documentary heritage.

Go243Updated 3 months ago

archivescode4libdigipresdigital-repositoriesdigitalpreservationdocumentary-heritageinternetarchivewaybackwaybackmachinewebarchiving

ArchiveTeam/WebArchiver

Decentralized web archiving

Python204Updated 11 months ago

archiverarchivingcrawlerdecentralizedpythonwarcwebwebarchiving

natliblux/warc-safe

A tool for detecting viruses and NSFW material in WARC files

Python181Updated 1 week ago

antivirusnsfw-classifierwarcwarc-safewebarchiving

basenana/nanafs

🗄 File-Based Reference Filing System.

Go184Updated 2 days ago

fuse-filesystemgtd-workflowstoragewebarchivingwebdavworkflow-engine

WebarchivCZ/Seeder

Seeder - Czech webarchive curating tool and public site

Python172Updated 3 weeks ago

archiveczechczech-republicdjangogovernmenttoolswebarchivewebarchiveswebarchiving

renevoorburg/robustify.js

A javascript for fighting link rot and content drift using link decoration and web archives.

HTML165Updated 5 months ago

linkrotwebarchiving

toimik/WarcProtocol

Parser for WARC (aka WebArchive) files

C#154Updated 4 months ago

warcwarc-fileswarc-formatwarc-readerwarc-recordwebarchivewebarchiveswebarchiving

httpreserve/tikalinkextract

Tika based link (URL) extractor for httpreserve

HTML110Updated 1 month ago

archivescode4libdigitalpreservationhttpreserveiipctikatika-wrapperurl-extractorwebarchiving

oduwsdl/tmvis

An archival thumbnail visualization server

JavaScript98Updated 4 months ago

archivemementonodejstimemaptmvisvisualizationwebarchivingwebpage-changes

atomotic/pywb-recorder-tor

pywb recorder over tor, anonymously records the web. (docker image)

CSS70Updated 4 years ago

torwebarchivingwebrecorder

ArchiveBox/community

A wiki of the broader Web Archiving Community: important organizations, alternative projects, blog posts, and more.

60Updated 4 months ago

archiveboxarchivingdigipresinternetpreservationwebarchiving

News-Archiver/news-archiver

News Archiver, Data Aggregation for CNN and Fox News

JavaScript62Updated 1 year ago

cnnfoxnewsjavascriptmysqlscraping-websiteswebarchiving

atomotic/webrecorder-chrome-extension

record current active tab on webrecorder.io

JavaScript50Updated 4 years ago

chrome-extensionwebarchivingwebrecorder

httpreserve/linkscanner

A helper package to tokenize textual content and retrieve hyperlinks

Go30Updated 1 year ago

archivescode4libdigitalpreservationdocumentary-heritagehttpreservewebarchiving

httpreserve/workbench

Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB

JavaScript20Updated 4 years ago

archivesboltdbcode4libdigital-repositoriesdigitalpreservationinternetarchivewebarchiving

GurenMashu/pagebinder

Here lies the code for 'pagebinder' - more details in README.

Python20Updated 4 months ago

archivercrawlerfirefoxpdfpdf-documentpythonpythontoolwebarchivingwebsite

exponential-decay/moonshine

Given four bytes, download a random file from web archives implementing the UKWA Shine interface

Go20Updated 2 years ago

archivescode4libdigipresfile-formatsglamukwawarclightwebarchiving

Page 1 of 2