InfoVis CyberInfrastructure- Similarity Flooding Algorithm

Description

Top

Similarity flooding is a matching algorithm based on fix-point computation that is usable across different scenarios [Melnek, S., et. al. 2002]. The algorithm takes two graphs in Pajek format and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using a set of filters. The percentage of accurate match done by the algorithm is a measure of its accuracy.

The data provided as input to the algorithm is first converted into directed labeled graphs. As a first step, an iterative fixpoint computation is performed whose results tell us what nodes in one graph are similar to nodes in the second graph. For computing the similarities, the underlying principle used is that elements of two distinct models are similar when their adjacent elements are similar. A part of the similarity of two elements propagates to their respective neighbors. The spreading of similarities in the matched models is considered similar to the way IP packets flood the network in broadcast communication.

The resulting output is then converted into a Pajek input file, which can then be used to easily visualise the similarities and differences between the datasets using appropriate color coding.

The amount of human intervention required in correcting the results is used as a measure of accuracy.

Pros & Cons

Top

The main advantage of Similarity Flooding lies in its ability to be used across a wide variety of datasets. The algorithm works independent of the complexities of the underlying data and utilizes the relationship between the data points to estimate similarity.

The disadvantage of Similarity Flooding is its dependence on human intervention to compute accurate matches.

Applications

Top

It is general graph matching algorithm that can be used to map systems that can be represented as graphs.

Ontology mapping
Database schema matching
Object recognition

Details

Top

The algorithm is implemented is Java. The code was customized to read the pajek input format. The algorithm performs a similarity match across all the network nodes. It generates a similarity measurement between all nodes of the two network. The resultant pajek file can be generated using a particular threshold value.

A sample visualization of common nodes shared by network 1 (green) and network 2 (red) is given below. The common nodes are connected by black edges.

Usage Hints

Top

The original code is available for download at http://www-db.stanford.edu/~melnik/mm/sfa/

The orginal code has been modified to accept the pajek input format. The modified code can be downloaded here

To run

At the command prompt, type:

java com.interdatanetworking.mm.alg.Match <sample1PajekInput.txt> <sample2PajekInput.txt> <results.txt>

where,

sample1PajekInput.txt - network 1 (nw1) in Pajek-input-file format.
sample2PajekInput.txt - network 2 (nw2) in Pajek-input-file format.
results.txt - similarity output file

From the similarity results file, it can be infered that the resultant similarity between node 1 in the first network and that in the 2nd is 1.0, while that between 1 and 2 is 0.193, etc.

Further filtering can be applied by using threshold values for the similarity measure by using the follwing perl script (simPajekGen.pl).

To run the script, use

perl simPajekGen.pl <sim-results file> <nw1> <nw2> <output file> <threshold value>

References

Top

Sergey Melnik, Hector Garcia-Molina and Erhard Rahm, Similarity Flooding: A Versatile Graph Matching Algorithm , Proceedings of the 18th International Conference on Data Engineering, San Jose, California, p. 117-128, 2002.

Acknowledgments

Top

This documentation was compiled by Lalitha Viswanath, Ketan Mane and Katy Börner.

Information Visualization Cyberinfrastructure @ SLIS, Indiana University
Last Modified March 10, 2005