University of Washington - Department of Statistics
Advisors: Alejandro Murua and Werner Stuetzle
The first step in document clustering is converting documents into vectors. This process starts from the term-document matrix, whose (i,j)th entry contains the number of occurences of word j in document i. The raw word counts are then transformed and weighted, and finally the dimension is reduced by principal component analysis. I present the results from an empirical study exploring the influence of transformation, term weighting, and dimensionality reduction on the performance of document classification and clustering procedures.
I then describe a model based document clustering procedure. This is based on the assumption that the documents referring to a given topic follow a Gaussian distribution, and that the entire document collection can therefore be modeled as a mixture of Gaussians. The parameters of the individual Gaussian mixture components, the mixing weights, and the number of components are all estimated from the data. The ability to estimate the number of mixture components (topics) represents a major advantage of model based document clustering over nonparametric clustering methods.
Model based clustering is computationally feasible for collections containing a few thousand documents, but it breaks down for large collections. In the last part of the talk I describe model based fractionation as a way of overcoming this limitation. The idea is to break the collection into small subsets or fractions, cluster those, and then combine the results. I present some preliminary results and discuss potential merits and limitations of model based fractionation.
The document collection that was used in our experiments is the Topic Detection and Tracking corpus, a collection of one year of news stories from CNN and Reuters.