## Probabilistic Author-Topic Models for Information Discovery (2004)

### Cached

### Download Links

- [psiexp.ss.uci.edu]
- [cocosci.berkeley.edu]
- [www.ics.uci.edu]
- [www.cs.uiuc.edu]
- [www.datalab.uci.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | THE TENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING |

Citations: | 134 - 9 self |

### BibTeX

@INPROCEEDINGS{Steyvers04probabilisticauthor-topic,

author = {Mark Steyvers and Padhraic Smyth and Michal Rosen-Zvi and Tom Groffiths},

title = {Probabilistic Author-Topic Models for Information Discovery},

booktitle = {THE TENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING},

year = {2004},

pages = {306--315},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.

### Citations

3020 | Indexing by latent semantic analysis - Deerwester, Dumais, et al. - 1990 |

2634 | Latent Dirichlet allocation
- BLEI, NG, et al.
(Show Context)
Citation Context ...e of the posterior distribution over parameters. However, when applied to probabilistic topic models (Hoffman, 1999), this approach is susceptible to local maxima and computationally inefficient (see =-=Blei, Ng, and Jordan, 2003-=-). We pursue an alternative parameter estimation strategy, outlined by Griffiths and Steyvers (2004), using Gibbs sampling, a Markov chain Monte Carlo algorithm to sample from the posterior distributi... |

874 | Probabilistic Latent Semantic Indexing - Hofmann - 1999 |

689 | Finding scientific topics - Griffiths, Steyvers |

678 | Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections
- Cutting
- 1992
(Show Context)
Citation Context ...and documents . A somewhat different approach is to cluster the docu-ments into groups containing similar semantic content, using any of a variety of well-known document clustering techniques (e.g., =-=Cutting et al., 1992-=-; McCallum, Nigam, and Ungar, 2000; Popescul et al., 2000). Each cluster of documents can then be associated with a latent topic (e.g., as represented by the mean term vector for documents in the clus... |

293 | Digital libraries and autonomous citation indexing - Lawrence, Giles, et al. - 1999 |

273 | Efficient Clustering of High-Dimensional Data Sets with Application to Reference
- McCallum, Nigam, et al.
- 2000
(Show Context)
Citation Context ...what different approach is to cluster the docu-ments into groups containing similar semantic content, using any of a variety of well-known document clustering techniques (e.g., Cutting et al., 1992; =-=McCallum, Nigam, and Ungar, 2000-=-; Popescul et al., 2000). Each cluster of documents can then be associated with a latent topic (e.g., as represented by the mean term vector for documents in the cluster). While clustering can provide... |

259 | The author-topic model for authors and documents - Rosen-Zvi |

253 | Operations for learning with graphical models - Buntine - 1994 |

244 | Referral Web: combining social networks and collaborative filtering - Kautz, Selman, et al. - 1997 |

204 | The missing link - a probabilistic model of document content and hypertext connectivity - Cohn, Hofmann |

106 | Applied Bayesian and classical inference: the case of the Federalist Papers - Mosteller, Wallace - 1984 |

106 | Algorithms for estimating relative importance in networks - White, Smyth - 2003 |

93 | Mapping authors in intellectual space: a technical overview - McCain - 1990 |

67 | Authorship Attribution with Support Vector Machines
- Diederich, Kindermann, et al.
- 2003
(Show Context)
Citation Context ... of a purported poem by Shakespeare (Thisted and Efron, 1987), identifying authors of software programs (Gray, Sallis, and MacDonell, 1997), and the use of techniques such as support vector machines (=-=Diederich et al., 2003-=-) for author identification. This work on author identification emphasizes the use of distinctive stylistic features (such as sentence length) that uniquely characterize a specific author. In contrast... |

48 | Mixed membership models of scientific publications - Erosheva, Fienberg, et al. - 2004 |

34 | Software forensics: Extending authorship analysis techniques to computer programs
- Gray, Sallis, et al.
- 1997
(Show Context)
Citation Context ...f disputed Federalist papers. More recent work of a similar nature includes authorship analysis of a purported poem by Shakespeare (Thisted and Efron, 1987), identifying authors of software programs (=-=Gray, Sallis, and MacDonell, 1997-=-), and the use of techniques such as support vector machines (Diederich et al., 2003) for author identification. This work on author identification emphasizes the use of distinctive stylistic features... |

34 | Clustering and identifying temporal trends in document databases
- Popescul, Flake, et al.
- 2000
(Show Context)
Citation Context ...ter the docu-ments into groups containing similar semantic content, using any of a variety of well-known document clustering techniques (e.g., Cutting et al., 1992; McCallum, Nigam, and Ungar, 2000; =-=Popescul et al., 2000-=-). Each cluster of documents can then be associated with a latent topic (e.g., as represented by the mean term vector for documents in the cluster). While clustering can provide useful broad informati... |

29 | WEBSON for textual data mining
- Lagus
- 1999
(Show Context)
Citation Context ...s to represent the high-dimensional term vectors in a lowerdimensional space. Local regions in the lower-dimensional space can then be associated with specific topics. For example, the WEBSOM system (=-=Lagus et al. 1999-=-) uses nonlinear dimensionality reduction via self-organizing maps to represent term vectors in a two-dimensional layout. Linear projection techniques, such as latent semantic indexing (LSI), are also... |

25 |
The Evolution of Stylometry
- Holmes
- 1998
(Show Context)
Citation Context ...d little or no attention as far as we are aware. The areas of stylometry, authorship attribution, and forensic linguistics focus on the problem of identifying what author wrote a given piece of text (=-=Holmes, 1998-=-). For example, Mosteller and Wallace in their classic 1964 study used Bayesian techniques to infer whether Hamilton or Madison was the more likely author of disputed Federalist papers. More recent wo... |

22 | Exploring the computing literature using temporal graph visualization - Erten, Harding, et al. - 2003 |

8 |
Did Shakespeare write a newly discovered poem?, Biometrika
- Thisted, Efron
- 1987
(Show Context)
Citation Context ...es to infer whether Hamilton or Madison was the more likely author of disputed Federalist papers. More recent work of a similar nature includes authorship analysis of a purported poem by Shakespeare (=-=Thisted and Efron, 1987-=-), identifying authors of software programs (Gray, Sallis, and MacDonell, 1997), and the use of techniques such as support vector machines (Diederich et al., 2003) for author identification. This work... |

6 | Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks, Intelligent Data Analysis 2003 - Mutschke - 2003 |

1 |
The author-topic model: a generative model for authors and documents, draft paper presented at
- Steyvers, Rosen-Zvi, et al.
- 2003
(Show Context)
Citation Context ...sed on a single sample so that specific topics can be identified and interpretedâ€”in tasks involving prediction of words and authors one can average over topics and use multiple samples when doing so (=-=Steyvers et al., 2003-=-). 3. AUTHOR-TOPICS FOR CITESEER 3.1 Learning the Model Our collection of CiteSeer abstracts contains D = 162, 489 abstracts with K =85, 465 authors. We preprocessed the text by removing all punctuati... |