Results 1 - 10
of
20
New Regularized Algorithms for Transductive Learning
"... Abstract. We propose a new graph-based label propagation algorithm for transductive learning. Each example is associated with a vertex in an undirected graph and a weighted edge between two vertices represents similarity between the two corresponding example. We build on Adsorption, a recently propo ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract. We propose a new graph-based label propagation algorithm for transductive learning. Each example is associated with a vertex in an undirected graph and a weighted edge between two vertices represents similarity between the two corresponding example. We build on Adsorption, a recently proposed algorithm and analyze its properties. We then state our learning algorithm as a convex optimization problem over multi-label assignments and derive an efficient algorithm to solve this problem. We state the conditions under which our algorithm is guaranteed to converge. We provide experimental evidence on various real-world datasets demonstrating the effectiveness of our algorithm over other algorithms for such problems. We also show that our algorithm can be extended to incorporate additional prior information, and demonstrate it with classifying data where the labels are not mutually exclusive. Key words: label propagation, transductive learning, graph based semi-supervised learning. 1
Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition
"... Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based SSL algorithms for class-instance acquisition on a variety of graphs constructed from different domains. We find that the recently proposed MAD algorithm is the most effective. We also show that class-instance extraction can be significantly improved by adding semantic information in the form of instance-attribute edges derived from an independently developed knowledge base. All of our code and data will be made publicly available to encourage reproducible research in this area. 1
Internet Ad Auctions: Insights and Directions
"... Abstract. On the Internet, there are advertisements (ads) of different kinds: image, text, video and other specially marked objects that are distinct from the underlying content of the page. There is an industry behind the management of such ads, and they face a number of algorithmic challenges. Thi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. On the Internet, there are advertisements (ads) of different kinds: image, text, video and other specially marked objects that are distinct from the underlying content of the page. There is an industry behind the management of such ads, and they face a number of algorithmic challenges. This note will present a small selection of such problems, some insights and open research directions. 1
Automatically Incorporating New Sources in Keyword Search-Based Data Integration
"... Scientific data offers some of the most interesting challenges in data integration today. Scientific fields evolve rapidly and accumulate masses of observational and experimental data that needs to be annotated, revised, interlinked, and made available to other scientists. From the perspective of th ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Scientific data offers some of the most interesting challenges in data integration today. Scientific fields evolve rapidly and accumulate masses of observational and experimental data that needs to be annotated, revised, interlinked, and made available to other scientists. From the perspective of the user, this can be a major headache as the data they seek may initially be spread across many databases in need of integration. Worse, even if users are given a solution that integrates the current state of the source databases, new data sources appear with new data items of interest to the user. Here we build upon recent ideas for creating integrated views over data sources using keyword search techniques, ranked answers, and user feedback [32] to investigate how to automatically discover when a new data source has content relevant to a user’s view — in essence, performing automatic data integration for incoming data sets. The new architecture accommodates a variety of methods to discover related attributes, including label propagation algorithms from the machine learning community [2] and existing schema matchers [11]. The user may provide feedback on the suggested new results, helping the system repair any bad alignments or increase the cost of including a new source that is not useful. We evaluate our approach on actual bioinformatics schemas and data, using state-of-the-art schema matchers as components. We also discuss how our architecture can be adapted to more traditional settings with a mediated schema.
Power Iteration Clustering
"... We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a one-dimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via k-means. ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a one-dimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via k-means. We demonstrate this method’s effectiveness and scalability on several synthetic and real datasets, and conclude that to find a meaningful low-dimensional embedding for clustering, it is not necessary to find any eigenvectors—we just need a linear combination of the top eigenvectors. 1
Mining Advertiser-specific User Behavior Using Adfactors
"... Consider an online ad campaign run by an advertiser. The ad serving companies that handle such campaigns record users ’ behavior that leads to impressions of campaign ads, as well as users ’ responses to such impressions. This is summarized and reported to the advertisers to help them evaluate the p ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Consider an online ad campaign run by an advertiser. The ad serving companies that handle such campaigns record users ’ behavior that leads to impressions of campaign ads, as well as users ’ responses to such impressions. This is summarized and reported to the advertisers to help them evaluate the performance of their campaigns and make better budget allocation decisions. The most popular reporting statistics are the click-through rate and the conversion rate. While these are indicative of the effectiveness of an ad campaign, the advertisers often seek to understand more sophisticated long-term effects of their ads on the brand awareness and the user behavior that leads to the conversion, thus creating a need for the reporting measures that can capture both the duration and the
Semi-Supervised Classification of Network Data Using Very Few Labels
, 2009
"... The goal of semi-supervised learning methods is to reduce the amount of labeled training data required by learning from both labeled and unlabeled instances. We make contribution towards this goal along several dimensions. Macskassy and Provost [13] proposed the weighted-vote relational neighbor cla ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The goal of semi-supervised learning methods is to reduce the amount of labeled training data required by learning from both labeled and unlabeled instances. We make contribution towards this goal along several dimensions. Macskassy and Provost [13] proposed the weighted-vote relational neighbor classifier (wvRN) as a simple yet solid baseline for semi-supervised learning on network data. It is shown to be essentially the same as the Gaussian-field classifier proposed by Zhu et al. [22] and proves to be very effective on many benchmark network datasets. We describe another simple and intuitive semisupervised learning method based on random graph walk that outperforms wvRN by a large margin on several benchmark datasets when very few labels are available. Secondly, we show that using authoritative instances as training seeds — instances that arguably cost much less to label — dramatically reduces the amount of labeled data required to achieve the same classification accuracy. For some existing state-of-the-art semi-supervised learning methods the labeled data needed is reduced by a factor of 50. Third, we offer insights as to why learning methods based on random graph walk are able to more fully exploit the unlabeled data than previous methods. Based on the above observations, we strongly recommend the proposed method as a strong baseline for future research on semi-supervised classification of network data.
Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent
"... Three methods are proposed to classify queries by intent (CQI), e.g., navigational, informational, commercial, etc. Following mixed-initiative dialog systems, search engines should distinguish navigational queries where the user is taking the initiative from other queries where there are more opport ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Three methods are proposed to classify queries by intent (CQI), e.g., navigational, informational, commercial, etc. Following mixed-initiative dialog systems, search engines should distinguish navigational queries where the user is taking the initiative from other queries where there are more opportunities for system initiatives (e.g., suggestions, ads). The query intent problem has a number of useful applications for search engines, affecting how many (if any) advertisements to display, which results to return, and how to arrange the results page. Click logs are used as a substitute for annotation. Clicks on ads are evidence for commercial intent; other types of clicks are evidence for other intents. We start with a simple Naïve Bayes baseline that works well when there is plenty of training data. When training data is less plentiful, we back off to nearby URLs in a click graph, using a method similar to Word-Sense Disambiguation. Thus, we can infer that designer trench is commercial because it is close to www.saksfifthavenue.com, which is known to be commercial. The baseline method was designed for precision and the backoff method was designed for recall. Both methods are fast and do not require crawling webpages. We recommend a third method, a hybrid of the two, that does no harm when there is plenty of training data, and generalizes better when there isn’t, as a strong baseline for the CQI task. 1 Classify Queries By Intent (CQI) Determining query intent is an important problem for today’s search engines. Queries are short (consisting of 2.2 terms on average (Beitzel et al., 2004)) and contain ambiguous terms. Search engines need to derive what users want from this limited source of information. Users may be searching for a specific page, browsing for information, or trying to buy something. Guessing the correct intent is important for returning relevant items. Someone searching for designer trench is likely to be interested in results or ads for trench coats, while someone searching for world war I trench might be irritated by irrelevant clothing advertisements.
VideoMule: A Consensus Learning Approach to Multi-Label Classification from Noisy User-Generated
"... With the growing proliferation of conversational media and devices for generating multimedia content, the Internet has seen an expansion in websites catering to user-generated media. Most of the user-generated content is multimodal in nature as it has videos, audio, text (in the form of tags), comme ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
With the growing proliferation of conversational media and devices for generating multimedia content, the Internet has seen an expansion in websites catering to user-generated media. Most of the user-generated content is multimodal in nature as it has videos, audio, text (in the form of tags), comments and so on. Content analysis is a challenging problem on this type of media since it is noisy, unstructured and unreliable. In this paper we propose VideoMule, a consensus learning approach for multi-label video classification from noisy user-generated videos. In our scheme, we train classification and clustering algorithms on individual modes of information such as user comments, tags, video features and so on. We then combine the results of trained classifiers and clustering algorithms using a novel heuristic consensus learning algorithm which as a whole performs better than each individual learning model. of all traffic on the web. This statistic is expected to grow over the next couple of years [1]. There are several commmon characteristics of the data in these content-sharing websites. Most of the content-sharing websites allow for seamless uploading of videos in standardized formats, they allow for tagging and commenting of these videos, sharing of videos between users and also embedding in a HTML page. In addition to this, many websites allow for rating and commenting of videos by the users. In this way, a typical document in a content-sharing website contains not only videos, but also their associated meta-data like video description,
Typed Graph Models for Semi-Supervised Learning of Name Ethnicity
"... This paper presents an original approach to semi-supervised learning of personal name ethnicity from typed graphs of morphophonemic features and first/last-name co-occurrence statistics. We frame this as a general solution to an inference problem over typed graphs where the edges represent labeled r ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents an original approach to semi-supervised learning of personal name ethnicity from typed graphs of morphophonemic features and first/last-name co-occurrence statistics. We frame this as a general solution to an inference problem over typed graphs where the edges represent labeled relations between features that are parameterized by the edge types. We propose a framework for parameter estimation on different constructions of typed graphs for this problem using a gradient-free optimization method based on grid search. Results on both in-domain and out-of-domain data show significant gains over 30 % accuracy improvement using the techniques presented in the paper. 1

