vigna/webgraph
WebGraph is a framework for graph compression.
Welcome to WebGraph!
Introduction
WebGraph is a framework for graph compression aimed at studying web
graphs. It provides simple ways to manage very large graphs, exploiting
modern compression techniques. More precisely, it is currently made of:
-
A set of flat codes, called ζ codes, which are particularly
suitable for storing web graphs (or, in general, integers with power-law
distribution in a certain exponent range). The fact that these codes work
well can be easily tested empirically, but we also try to provide a
detailed mathematical
analysis. -
Algorithms for compressing web graphs that exploit gap compression and
referentiation (à la LINK),
intervalisation and ζ codes to provide a high compression ratio (see our
datasets). The algorithms are
controlled by several parameters, which provide different tradeoffs
between access speed and compression ratio. -
Algorithms for accessing a compressed graph without actually
decompressing it, using lazy techniques that delay the decompression until
it is actually necessary. -
Algorithms for analysing very large graphs, such as
HyperBall
which has been used to show that Facebook has just four degrees of
separation. -
An implementation of the algorithms above in Java and
Rust distributed under either the GNU
Lesser General Public License
2.1+ or the
Apache Software License
2.0. Besides a clearly
defined API, we also provide several classes tha modify (e.g., transpose)
or recompress a graph, so to experiment with various settings. -
Datasets for large graph. These are either
gathered from public sources (such as
WebBase), or
produced by UbiCrawler and BUbiNG.
In the end, with WebGraph you can access and analyse very large web
graphs. Using WebGraph is as easy as installing a few jar files and
downloading a dataset. This makes studying phenomena such as PageRank,
distribution of graph properties of the web graph, etc. very easy.
You are welcome to use and improve WebGraph! If you find our software
useful for your research, please quote this
paper.
This version of WebGraph is limited to graphs with at most 2³¹ nodes. For
larger graphs, have a look at the big
version.
JGraphT
JGraphT has a
few WebGraph adapters.
Hadoop
Helge Holzmann has developed an input
format for Hadoop for graphs
in BVGraph format.
WebGraph++
Jacob Ratkievicz has developed a C++ version of
WebGraph that you might
want to try. The library exposes a
BVGraph as an object of the
Boost Graph Library, so
it is easily integrable with other code.
pyWebgraph
Massimo Santini has developed a front-end
that interfaces Jython with
WebGraph. It makes exploring
small portions of very large graphs very easy and interactive.
Papers
-
A detailed description of
the compression algorithms used in WebGraph, published in the proceedings
of the Thirteenth International World–Wide Web
Conference. -
A mathematical analysis
of the performance of γ, δ and ζ codes against power-law distributions. -
Some quite surprising
experiments showing that the
transpose graph reacts very peculiarly to compression after
lexicographical or Gray-code sorting. -
A paper about
HyperBall
(then named HyperANF), our tool for computing an approximation of the
neighbourhood function, reachable nodes and geometric centralities of
massive graphs. More information can be found in this
preprint. -
HyperBall was used to
find out that on average there are just four degrees of
separation on
Facebook, and the experiment was reported by the
New York
Times.
Alas, the degrees were actually 3.74 (one less than the average
distance), but the off-by-one
between graph theory (“distance”) and sociology (“degrees of separation”)
generated a lot of confusion. -
A paper were we
describe our efforts to compress one of the largest social graphs
available—the graph of commits of the Software Heritage archive.