Title: | Clustering by Fast Search and Find of Density Peaks |
---|---|
Description: | An improved implementation (based on k-nearest neighbors) of the density peak clustering algorithm, originally described by Alex Rodriguez and Alessandro Laio (Science, 2014 vol. 344). It can handle large datasets (> 100,000 samples) very efficiently. It was initially implemented by Thomas Lin Pedersen, with inputs from Sean Hughes and later improved by Xiaojie Qiu to handle large datasets with kNNs. |
Authors: | Thomas Lin Pedersen [aut, cre], Sean Hughes [aut], Xiaojie Qiu [aut] |
Maintainer: | Thomas Lin Pedersen <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.3.3.9000 |
Built: | 2024-10-25 04:01:59 UTC |
Source: | https://github.com/thomasp85/densityclust |
This function checks whether findClusters()
has been performed on
the given object and returns a boolean depending on the outcome
clustered(x) ## S3 method for class 'densityCluster' clustered(x)
clustered(x) ## S3 method for class 'densityCluster' clustered(x)
x |
A densityCluster object |
TRUE
if findClusters()
have been performed, otherwise
FALSE
This function allows the user to extract the cluster membership of all the observations in the given densityCluster object. The output can be formatted in two ways as described below. Halo observations can be chosen to be removed from the output.
clusters(x, ...) ## S3 method for class 'densityCluster' clusters(x, as.list = FALSE, halo.rm = TRUE, ...)
clusters(x, ...) ## S3 method for class 'densityCluster' clusters(x, as.list = FALSE, halo.rm = TRUE, ...)
x |
The densityCluster object. |
... |
Currently ignored |
as.list |
Should the output be in the list format. Defaults to FALSE |
halo.rm |
Logical. should halo observations be removed. Defaults to TRUE |
Two formats for the output are available. Either a vector of integers
denoting for each observation, which cluster the observation belongs to. If
halo observations are removed, these are set to NA. The second format is a
list with a vector for each group containing the index for the member
observations in the group. If halo observations are removed their indexes are
omitted. The list format correspond to the following transform of the vector
format split(1:length(clusters), clusters)
, where clusters
are
the cluster information in vector format.
A vector or list with cluster memberships for the observations in the initial distance matrix
This function takes a distance matrix and optionally a distance cutoff and
calculates the values necessary for clustering based on the algorithm
proposed by Alex Rodrigues and Alessandro Laio (see references). The actual
assignment to clusters are done in a later step, based on user defined
threshold values. If a distance matrix is passed into distance
the
original algorithm described in the paper is used. If a matrix or data.frame
is passed instead it is interpretted as point coordinates and rho will be
estimated based on k-nearest neighbors of each point (rho is estimated as
exp(-mean(x))
where x
is the distance to the nearest
neighbors). This can be useful when data is so large that calculating the
full distance matrix can be prohibitive.
densityClust(distance, dc, gaussian = FALSE, verbose = FALSE, ...)
densityClust(distance, dc, gaussian = FALSE, verbose = FALSE, ...)
distance |
A distance matrix or a matrix (or data.frame) for the coordinates of the data. If a matrix or data.frame is used the distances and local density will be estimated using a fast k-nearest neighbor approach. |
dc |
A distance cutoff for calculating the local density. If missing it
will be estimated with |
gaussian |
Logical. Should a gaussian kernel be used to estimate the density (defaults to FALSE) |
verbose |
Logical. Should the running details be reported |
... |
Additional parameters passed on to get.knn |
The function calculates rho and delta for the observations in the provided
distance matrix. If a distance cutoff is not provided this is first estimated
using estimateDc()
with default values.
The information kept in the densityCluster object is:
rho
A vector of local density values
delta
A vector of minimum distances to observations of higher density
distance
The initial distance matrix
dc
The distance cutoff used to calculate rho
threshold
A named vector specifying the threshold values for rho and delta used for cluster detection
peaks
A vector of indexes specifying the cluster center for each cluster
clusters
A vector of cluster affiliations for each observation. The clusters are referenced as indexes in the peaks vector
halo
A logical vector specifying for each observation if it is considered part of the halo
knn_graph
kNN graph constructed. It is only applicable to the case where coordinates are used as input. Currently it is set as NA.
nearest_higher_density_neighbor
index for the nearest sample with higher density. It is only applicable to the case where coordinates are used as input.
nn.index
indices for each cell's k-nearest neighbors. It is only applicable for the case where coordinates are used as input.
nn.dist
distance to each cell's k-nearest neighbors. It is only applicable for the case where coordinates are used as input.
Before running findClusters the threshold, peaks, clusters and halo data is
NA
.
A densityCluster object. See details for a description.
Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotMDS(irisClust) split(iris[,5], irisClust$clusters)
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotMDS(irisClust) split(iris[,5], irisClust$clusters)
This function calculates a distance cutoff value for a specific distance matrix that makes the average neighbor rate (number of points within the distance cutoff value) fall between the provided range. The authors of the algorithm suggests aiming for a neighbor rate between 1 and 2 percent, but also states that the algorithm is quite robust with regards to more extreme cases.
estimateDc(distance, neighborRateLow = 0.01, neighborRateHigh = 0.02)
estimateDc(distance, neighborRateLow = 0.01, neighborRateHigh = 0.02)
distance |
A distance matrix |
neighborRateLow |
The lower bound of the neighbor rate |
neighborRateHigh |
The upper bound of the neighbor rate |
A numeric value giving the estimated distance cutoff value
If the number of points is larger than 448 (resulting in 100,128
pairwise distances), 100,128 distance pairs will be randomly selected to
speed up computation time. Use set.seed()
prior to calling
estimateDc
in order to ensure reproducable results.
Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072
irisDist <- dist(iris[,1:4]) estimateDc(irisDist)
irisDist <- dist(iris[,1:4]) estimateDc(irisDist)
This function uses the supplied rho and delta thresholds to detect cluster peaks and assign the rest of the observations to one of these clusters. Furthermore core/halo status is calculated. If either rho or delta threshold is missing the user is presented with a decision plot where they are able to click on the plot area to set the treshold. If either rho or delta is set, this takes presedence over the value found by clicking.
findClusters(x, ...) ## S3 method for class 'densityCluster' findClusters(x, rho, delta, plot = FALSE, peaks = NULL, verbose = FALSE, ...)
findClusters(x, ...) ## S3 method for class 'densityCluster' findClusters(x, rho, delta, plot = FALSE, peaks = NULL, verbose = FALSE, ...)
x |
A densityCluster object as produced by |
... |
Additional parameters passed on |
rho |
The threshold for local density when detecting cluster peaks |
delta |
The threshold for minimum distance to higher density when detecting cluster peaks |
plot |
Logical. Should a decision plot be shown after cluster detection |
peaks |
A numeric vector indicates the index of density peaks used for clustering. This vector should be retrieved from the decision plot with caution. No checking involved. |
verbose |
Logical. Should the running details be reported |
A densityCluster object with clusters assigned to all observations
Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotMDS(irisClust) split(iris[,5], irisClust$clusters)
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotMDS(irisClust) split(iris[,5], irisClust$clusters)
Generate a single panel of up to three diagnostic plots for a
densityClust
object.
plotDensityClust( x, type = "all", n = 20, mds = NULL, dim.x = 1, dim.y = 2, col = NULL, alpha = 0.8 )
plotDensityClust( x, type = "all", n = 20, mds = NULL, dim.x = 1, dim.y = 2, col = NULL, alpha = 0.8 )
x |
A densityCluster object as produced by |
type |
A character vector designating which figures to produce. Valid
options include |
n |
Number of observations to plot in the gamma graph. |
mds |
A matrix of scores for observations from a Principal Components Analysis or MDS. If omitted, and a MDS plot has been requested, one will be calculated. |
dim.x , dim.y
|
The numbers of the dimensions to plot on the x and y axes of the MDS plot. |
col |
Vector of colors for clusters. |
alpha |
Value in |
A panel of the figures specified in type
are produced.
If designated, clusters are color-coded and labelled. If present in
x
, the rho and delta thresholds are designated in the
decision graph by a set of solid black lines.
Eric Archer [email protected]
data(iris) data.dist <- dist(iris[, 1:4]) pca <- princomp(iris[, 1:4]) # Run initial density clustering dens.clust <- densityClust(data.dist) op <- par(ask = TRUE) # Show the decision graph plotDensityClust(dens.clust, type = "dg") # Show the decision graph and the gamma graph plotDensityClust(dens.clust, type = c("dg", "gg")) # Cluster based on rho and delta new.clust <- findClusters(dens.clust, rho = 4, delta = 2) # Show all graphs with clustering plotDensityClust(new.clust, mds = pca$scores) par(op)
data(iris) data.dist <- dist(iris[, 1:4]) pca <- princomp(iris[, 1:4]) # Run initial density clustering dens.clust <- densityClust(data.dist) op <- par(ask = TRUE) # Show the decision graph plotDensityClust(dens.clust, type = "dg") # Show the decision graph and the gamma graph plotDensityClust(dens.clust, type = c("dg", "gg")) # Cluster based on rho and delta new.clust <- findClusters(dens.clust, rho = 4, delta = 2) # Show all graphs with clustering plotDensityClust(new.clust, mds = pca$scores) par(op)
This function produces an MDS scatterplot based on the distance matrix of the densityCluster object (if there is only the coordinates information, a distance matrix will be calculate first), and, if clusters are defined, colours each observation according to cluster affiliation. Observations belonging to a cluster core is plotted with filled circles and observations belonging to the halo with hollow circles. This plotting is not suitable for running large datasets (for example datasets with > 1000 samples). Users are suggested to use other methods, for example tSNE, etc. to visualize their clustering results too.
plotMDS(x, ...)
plotMDS(x, ...)
x |
A densityCluster object as produced by |
... |
Additional parameters. Currently ignored |
densityClust()
for creating densityCluster
objects, and plotTSNE()
for an alternative plotting approach.
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotMDS(irisClust) split(iris[,5], irisClust$clusters)
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotMDS(irisClust) split(iris[,5], irisClust$clusters)
This function produces an t-SNE scatterplot based on the distance matrix of the densityCluster object (if there is only the coordinates information, a distance matrix will be calculate first), and, if clusters are defined, colours each observation according to cluster affiliation. Observations belonging to a cluster core is plotted with filled circles and observations belonging to the halo with hollow circles.
plotTSNE(x, ...)
plotTSNE(x, ...)
x |
A densityCluster object as produced by |
... |
Additional parameters. Currently ignored |
densityClust()
for creating densityCluster
objects, and plotMDS()
for an alternative plotting approach.
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotTSNE(irisClust) split(iris[,5], irisClust$clusters)
irisDist <- dist(iris[,1:4]) irisClust <- densityClust(irisDist, gaussian=TRUE) plot(irisClust) # Inspect clustering attributes to define thresholds irisClust <- findClusters(irisClust, rho=2, delta=2) plotTSNE(irisClust) split(iris[,5], irisClust$clusters)