Software > Burst Detection Description | Pros and Cons | Applications | Details | Usage Hints | References | Acknowledgments
The Burst Detection algorithm has been developed and provided by Jon Kleinberg (Cornell University). The algorithm aims to analyze documents to find features that have high intensity over finite/limited durations of time periods. Rather than using plain frequencies of the occurrences of words, the algorithm employs a probabilistic automaton whose states correspond to the frequencies of individual words. State transitions correspond to points in time around which the frequency of the word changes significantly. The algorithm is intended to extract meaningful structure from document
streams that arrive continuously over time (e.g., emails or news articles).
It generates a ranked list of the most significant word bursts in the
document stream, together with the intervals of time in which they occurred.
This can serve as a means of identifying topics or concepts that rose
to prominence over the course of the stream, were discussed actively for
a period of time, and then faded away.
Often, one is not only interested in the frequency of word occurences,
e.g., in publications, but also the sudden increase or decrease in word
usage. The burst algorithm provides a scalable means to identify burst
of word frequency activity in text data streams.
Two sample data sets were used by Kleinberg to demonstrate the algorithm:
The animal behavior data set comprising journal and article records was downloaded from the Biological Abstracts database. The burst algorithm was then applied to publications from a core set of 11 journals and the TITLE of the records were used to provide the text for burst detection. Several results were obtained and are available for download:
The project files are located on the 'iuniverse' server under '/u4/iv/IVR/src/edu/iu/iv/analysis/burst/' folder and are organized as follows:
The Burst Detection C-code can be compiled with the generic 'cc' compiler
at the unix system prompt. The '-o' option can be used to generate the
custom executable file. Sample data exists in the 'networking.index' file
and is constructed from paper titles at the two CS networking conferences
SIGCOMM and INFOCOM, 1988-2001. % cat networking.index | compute-bursts -bin -eps -rel
2 2 1 -trans 1
-bin -eps : they are just required - no explanation in code. Output is in the following format: ^ W : a x r y (b1 - b2)
Here, W is the word associated with the burst, a is the number of bins spanned by the burst interval, x is the weight of the burst, r is the reciprocal of the rate associated with the burst state, and b1 and b2 are the names of starting and ending bins.
The Burst Detection package was developed by Jon Kleinberg. It was integrated by Todd Holloway (tohollow@cs.indiana.edu) and Sidharth Thakur (sithakur@indiana.edu). Sidharth Thakur and Katy Börner wrote the documentation. |