GitHunt
LU

lukasgd/dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

CRAN version
stream r-universe status
CRAN RStudio mirror downloads

This R package provides a fast C++ (re)implementation of several
density-based algorithms with a focus on the DBSCAN family for
clustering spatial data. The package includes:

Clustering

  • DBSCAN: Density-based spatial clustering of applications with
    noise.
  • HDBSCAN: Hierarchical DBSCAN with simplified hierarchy
    extraction.
  • OPTICS/OPTICSXi: Ordering points to identify the clustering
    structure clustering algorithms.
  • FOSC: Framework for Optimal Selection of Clusters for
    unsupervised and semisupervised clustering of hierarchical cluster
    tree.
  • Jarvis-Patrick clustering
  • SNN Clustering: Shared Nearest Neighbor Clustering.

Outlier Detection

  • LOF: Local outlier factor algorithm.
  • GLOSH: Global-Local Outlier Score from Hierarchies algorithm.

Fast Nearest-Neighbor Search (using kd-trees)

  • kNN search
  • Fixed-radius NN search

The implementations use the kd-tree data structure (from library ANN)
for faster k-nearest neighbor search, and are typically faster than the
native R implementations (e.g., dbscan in package fpc), or the
implementations in WEKA,
ELKI and Python’s
scikit-learn
.

Installation

Stable CRAN version: Install from within R with

install.packages("dbscan")

Current development version: Install from
r-universe.

install.packages("dbscan", repos = "https://mhahsler.r-universe.dev")

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = 0.4, minPts = 4)
db
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.4, minPts = 4
## The clustering contains 4 cluster(s) and 25 noise points.
## 
##  0  1  2  3  4 
## 25 47 38 36  4 
## 
## Available fields: cluster, eps, minPts

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt
## OPTICS ordering/clustering for 150 objects.
## Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA
## Available fields: order, reachdist, coredist, predecessor, minPts, eps,
##                   eps_cl, xi

Extract DBSCAN-like clustering from OPTICS and create a reachability
plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = 0.4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb
## HDBSCAN clustering for 150 objects.
## Parameters: minPts = 4
## The clustering contains 2 cluster(s) and 0 noise points.
## 
##   1   2 
## 100  50 
## 
## Available fields: cluster, minPts, coredist, cluster_scores,
##                   membership_prob, outlier_scores, hc

Visualize the hierarchical clustering as a simplified tree. HDBSCAN
finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

License

The dbscan package is licensed under the GNU General Public License
(GPL) Version 3
. The
OPTICSXi R implementation was directly ported from the ELKI
framework’s Java implementation (GNU AGPLv3), with explicit permission
granted by the original author, Erich Schubert.

Changes

References

  • Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast
    Density-Based Clustering with
    R
    . Journal of Statistical
    Software
    , 91(1), 1-30. doi: 10.18637/jss.v091.i01.
  • Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A
    Density-Based Algorithm for Discovering Clusters in Large Spatial
    Databases with Noise. Institute for Computer Science, University of
    Munich. Proceedings of 2nd International Conference on Knowledge
    Discovery and Data Mining (KDD-96), 226-231.
    https://dl.acm.org/doi/10.5555/3001460.3001507
  • Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF:
    identifying density-based local outliers. In ACM Int. Conf. on
    Management of Data, pages 93-104. doi:
    https://doi.org/10.1145/335191.335388
  • Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander
    (1999). OPTICS: Ordering Points To Identify the Clustering
    Structure. ACM SIGMOD international conference on Management of
    data. ACM Press. pp. doi: https://doi.org/10.1145/304181.304187
  • Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical
    density estimates for data clustering, visualization, and outlier
    detection. ACM Transactions on Knowledge Discovery from Data (TKDD),
    10(5):1-51. doi: https://doi.org/10.1145/2733381
  • R. A. Jarvis and E. A. Patrick. 1973. Clustering Using a Similarity
    Measure Based on Shared Near Neighbors. IEEE Trans. Comput. 22, 11
    (November 1973), 1025-1034. doi:
    https://doi.org/10.1109/T-C.1973.223640
  • Levent Ertoz, Michael Steinbach, Vipin Kumar, Finding Clusters of
    Different Sizes, Shapes, and Densities in Noisy, High Dimensional
    Data, SIAM International Conference on Data Mining, 2003, 47-59.
    doi: https://doi.org/10.1137/1.9781611972733.5

Languages

C++60.8%R36.4%C2.8%

Contributors

GNU General Public License v3.0
Created September 18, 2022
Updated September 14, 2022
lukasgd/dbscan | GitHunt