Results 1 -
5 of
5
Exploiting Tag and Word Correlations for Improved Webpage Clustering
"... Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of soci ..."
Abstract
- Add to MetaCart
Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon 1 and Delicious 2, has led to a huge amount of user-generated content such as the tag information that is associated with the webpages. Inthispaper,wepresentasubspacebasedfeature extractionapproachwhichleveragestaginformationtocomplement the page-contents of a webpage to extract highly discriminative features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We compare our subspace based approach with a numberofbaselinesthatusetaginformationinvariousother ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. Although our results here are on the webpage clustering task, the same approach can be used for webpage classification as well. In the end, we also suggest possible future work for leveraging tag information in webpage clustering, especially when tag information is present for not all, but only for a small number of webpages. Also holds an adjunct position with the School of Computing,
Attribute Learning Using Joint Human and Machine Computation
, 2011
"... the degree of Doctor of Philosophy. Human computation is the study of systems where humans perform a major part of the computation or are an integral part of the overall computational process. The ESP Game, for example, is a human computation system that maps images to tags, by engaging humans to pl ..."
Abstract
- Add to MetaCart
the degree of Doctor of Philosophy. Human computation is the study of systems where humans perform a major part of the computation or are an integral part of the overall computational process. The ESP Game, for example, is a human computation system that maps images to tags, by engaging humans to play a game in which they are rewarded each time they agree on a description for an image. It was shown that these so-called Games with a Purpose are a reliable way to quickly collect millions of accurate image descriptors, which can then used to index images and facilitate search. However, most existing human computation systems operate without any machine intervention. Likewise, very few supervised learning systems are taking advantage of these powerful new platforms to elicit help from human teachers. It is therefore largely unknown what more a human computation system can achieve with machines in the loop. This thesis is centered around the problem of attribute learning – using the joint effort of human game players and machine learning algorithms to determine that a piece of music is “soothing”, that the bird in an image “has a red beak”, or that Ernest Hemingway is an “Nobel Prize winning author”. In particular, our work focuses on two aspects of the problem – how to acquire attributes and attribute values from human computers using incentive-compatible game mechanisms, and what active learning strategies to employ for attribute and attribute value acquisition.
Research Statement
"... I am motivated by the prospect of computers that learn, by interacting and collaborating with humans, how to solve problems. Such systems might take several forms. For example, imagine you are an entrepreneur and you want to train a computer to help you analyze what people are saying about your prod ..."
Abstract
- Add to MetaCart
I am motivated by the prospect of computers that learn, by interacting and collaborating with humans, how to solve problems. Such systems might take several forms. For example, imagine you are an entrepreneur and you want to train a computer to help you analyze what people are saying about your products. You have domain knowledge about your business and the decisions you want the system to make, such as identifying positive vs. negative product reviews across the Internet. You might want to initialize the system with your background knowledge (e.g., the words “wonderful” and “terrible ” indicate high and low customer satisfaction, respectively), inspect substantial amounts of relevant text data, and then allow it ask questions to help refine its understanding of your goals (e.g., is “predictable ” a positive word for your product? — which may depend on whether you make kitchen appliances or write novels). Alternatively, imagine you are a biologist with a highthroughput laboratory technique to test hundreds of proteins in tandem. You would like it to analyze hundreds (even thousands) of these measurements, induce hypotheses that might explain the data and communicate them to you (which you might want to edit based on your knowledge or intuition), and let it propose subsequent experiments in order to refine these hypotheses, or potentially discover other proteins with the properties you study.
Teaching Statement
"... Computer and information sciences are among the most relevant disciplines of the twenty-first century. As digital technology is more tightly woven into the fabric of society (for example, via the Web or in mobile devices), real-world problems rely more and more on computational solutions. Teaching o ..."
Abstract
- Add to MetaCart
Computer and information sciences are among the most relevant disciplines of the twenty-first century. As digital technology is more tightly woven into the fabric of society (for example, via the Web or in mobile devices), real-world problems rely more and more on computational solutions. Teaching offers a tremendous opportunity to shape the future generations of researchers and engineers — as well as computer-literate artists, entrepreneurs, doctors, and policy-makers — who will address these problems. Summary of Experience I have had the opportunity to teach in several different capacities: giving guest lectures in my areas of expertise during my postdoc at Carnegie Mellon, as an instructor and teaching assistant during graduate school at the University of Wisconsin–Madison, and as an informal mentor at both institutions. As an instructor at UW–Madison, I developed curricula, lectures, and assignments for two very different courses. Artificial Intelligence (CS 540, summer 2003) is the introductory course in AI for graduate students and advanced undergraduates. This course covers intelligent agents, game playing, logic, planning, and basic machine learning (enrollment: 30, student evaluation: 4.3/5.0). Computational Biology & Biostatistics (summer 2008) is a fastpaced seminar course for participants in an undergraduate summer program aimed at attracting women and minority students to research in bioinformatics and biostatistics. This course covers basic molecular biology, dynamic programming for sequence alignment, and statistical models for gene expression and biomedical text analysis (enrollment: 6). I also served as a TA for
Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Webpage Clustering
"... Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of soci ..."
Abstract
- Add to MetaCart
Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon.com and Delicious.com, has led to a huge amount of user-generated content such as the social tag information that is associated with the webpages. In this paper, we present a subspace based feature extraction approach which leverages the social tag information to complement the page-contents of a webpage for extracting beter features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We then present an extension that allows our approach to be applicable even if the webpage corpus is only partially tagged, i.e., when the social tags are present for not all, but only for a small number of webpages. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. We also discuss some possible future work including an active learning extension that can help in choosing which webpages to get tags for, if we only can get the social tags for only a small number of webpages.

