Clustering in Python
What is Clustering?
Clustering is an unsupervised Machine Learning technique used in statistical data analysis, image processing, pattern recognition. Cluster algorithm classify each data (variables) in particular group. In similar variables, properties, features, data point in single group while other data points. Example : data point as income, education, profession, age, number of children, etc you come with different clusters and each cluster has people with similar socio-economic criteria.
interpret cluster
After computing optimal clusters, aggregate measure like mean has to be computed on all variables and then resultant values for all the variables have to be interpreted among the clusters.
K-Means Clustering
kmeans algorithm is iterative. First define means called K value(k points).partitions a data set into clusters and randomly selected centroids(center point). this process repetitive until cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.
#import KMeans library from sklearn.cluster from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=2) # want 2 cluster kmeans.fit(X)
Full Code: Click Hear.
Hierarchical Clustering
Hierarchical cluster algorithm is treat each data point as a separate cluster also known as hierarchical cluster analysis. of hierarchical clustering have two type. one is agglomerative Clustering and divisive Clustering. agglomerative cluster use bottom up approach. starting with single data points as cluster then merge other cluster. in the last all data point in the one cluster. Divisive Cluster use Top down approach. working as vice versa agglomerative Clustering. first all data point in one singe cluster then spilt.
Dendrogram
Hierarchial Clustering number of clusters will be decided only after looking at the dendrogram.
dendrogram use for visualise cluster and spiting single cluster to multiple.
Linkage
Linkage is the criteria based on which distances between two clusters is computed. Single, Complete, Average.
Single Linkage– The distance between two clusters is defined as the shortest distance between two points in each cluster. Complete Linkage – The distance between two clusters is defined as the long distance between two points in each cluster. Average Linkage – the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
#import AgglomerativeClustering from sklearn.cluster library from sklearn.cluster import AgglomerativeClustering #build Agglomerative Hierarchical Clustering h_complete=AgglomerativeClustering(n_clusters=3,linkage='complete',affinity = "euclidean").fit(X)
Full Code: Click Here.
Conclusion
Scalability in Cluster algorithm. if data size is huge then need a highly scalable cluster algorithm in data mining. if data set have noisy. then first impute them otherwise give a poor accuracy. and handle all kinds of attributes such as binary, categorical, numerical (interval-based) data.