## Industry/Government Track Poster Document Preprocessing For Naive Bayes Classification and Clustering with Mixture of Multinomials

Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiveness come from low computational cost, relatively low memory consumption, as well as ability to handle heterogeneous features and multiple classes. Recently, there has been several attempts to improve the accuracy of Naive Bayes by performing heuristic feature transformations, such as IDF, normalization by the length of the documents and taking the logarithms of the counts. We justify the use of these techniques and apply them to two problems: classification of products in Yahoo! Shopping and clustering the vectors of collocated terms in user queries to Yahoo! Search. The experimental evaluation allows us to draw conclusions about the promise that these transformations carry with regard to alleviating the strong assumptions of the multinomial model.

