Go to Google Home
Go
A data-code-compute resource for research and education in information visualization
InfoVis Home Learning Modules Software Databases Compute Resources References

Software > Hierarchical Clustering Using Ward's Algorithm

Description | Pros and Cons | Applications | Details | Usage Hints | References | Acknowledgments


line
Description

Agglomerative (bottom up) hierarchical clustering algorithms can be applied to create a hierarchy of clusters grouping similar data items, e.g., documents. Clustering starts with a set of singleton clusters, each containing a single document di D, i=1, ..., N, where D equals the entire set of documents and N equals the number of all documents. The two most similar clusters over the entire set D are merged to form a new cluster that covers both. This process is repeated for each of the remaining N-1 documents. A complete linkage algorithm is applied to determine the overall similarity of document clusters based on the document similarity matrix.

Merging of document clusters continues until a single, all-inclusive cluster remains. At termination, a uniform, binary hierarchy of document clusters is produced.

The figure below shows an example of how eight documents may be clustered hierarchically. Note that each document is identical to itself.

Graph 1
A agglomerative cluster hierarchy

Ward's Algorithm (Ward, 1963) is a commonly used procedure for forming hierarchical groups of mutually exclusive subsets. It is particularly useful for large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure reduces them to n - 1 mutually exclusive sets by considering the union of all possible n(n - 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained according to:

Formula


line
Pros & Cons

The algorithm can produce an ordering of the objects, which may be informative for data display. Smaller clusters are generated, which may be helpful for discovery.

No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. Different distance metrices most likely result in different clusters. Performing multiple experiments and comparing the results is recommended.


line
Applications

Can be applied to cluster any data set for which a distance or similarity measure is available.


line
Details

The implemented algorithm scans the data input file and stores it in a linked list. In each clustering step, the linked list is modified to reflect the number and content of currently existing clusters.


line
Usage Hints
  1. Download and Uncompress the zipped files on your computer using winzip or other utility.
  2. It should make a directory named src in the directory you have downloaded it in.
  3. There will be a file named HCclustering.java and a file named input.txt and one named output.txt the two .txt files are input and output sample files respectively and the .java file is the original source file.
  4. Compile it with 'javac HCclustering.java' from the same directory.
  5. Run using 'java HCclustering input.txt output.txt'. It will read the data from the input file and create an output file.
  6. For large networks more memory is needed: Run with option -mx512m which means that the maximum memory required can go upto 512 MegaBytes and this value can be changed according to need.

line
References

line
Acknowledgments

The algorithm was implemented and documented by Vivek Agrawal (vivekagr@iitk.ac.in).

line
Information Visualization Cyberinfrastructure @ SLIS, Indiana University
Last Modified January 14, 2004