Results 1 - 10
of
16
Mining version histories to guide software changes
- In 26th International Conference on Software Engineering (ICSE 2004
, 2004
"... We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed... ”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is indetectable by pro ..."
Abstract
-
Cited by 236 (20 self)
- Add to MetaCart
We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed... ”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is indetectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26 % of further files to be changed—and 15 % of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1.
Learning a model of a web user’s interests
- IN THE 9TH INTERNATIONAL CONFERENCE ON USER MODELING(UM2003
, 2003
"... ..."
Approximate Query Translation Across Heterogeneous Information Sources
, 1999
"... In this paper we present a mechanism for approximately translating Boolean query constraints across heterogeneous information sources. Achieving the best translation is challenging because sources support different constraints for formulating queries, and often these constraints cannot be prec ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In this paper we present a mechanism for approximately translating Boolean query constraints across heterogeneous information sources. Achieving the best translation is challenging because sources support different constraints for formulating queries, and often these constraints cannot be precisely translated. For instance, a query [score ? 8] might be "perfectly" translated as [rating ? 0.8] at some site, but can only be approximated as [grade = A] at another. Unlike other work, our general framework adopts a customizable "closeness" metric for the translation that combines both precision and recall. Our results show that for query translation we need to handle interdependencies among both query conjuncts as well as disjuncts. As the basis, we identify the essential requirements of a rule system for users to encode the mappings for atomic semantic units. Our algorithm then translates complex queries by rewriting them in terms of the semantic units. We show that, un...
An Effective Complete-Web Recommender System
, 2003
"... There are a number of recommendation systems that can suggest the webpages, within a single website, that other (purportedly similar) users have visited. By contrast, our goal is a system that can recommend "information content" (IC) pages --- i.e., pages that contain information relevant to the use ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
There are a number of recommendation systems that can suggest the webpages, within a single website, that other (purportedly similar) users have visited. By contrast, our goal is a system that can recommend "information content" (IC) pages --- i.e., pages that contain information relevant to the user --- from anywhere in the web. This paper describes how we addressed this challenge, We first collected a number of annotated user sessions, whose pages each include a bit indicating whether it was IC. Our system, ICPF, then used this collection to learn the characteristics of words that appear in such IC-pages, in terms of the word's "browsing features" (e.g., did the user follow links whose anchor included this word, etc.). This paper
Approximate Query Mapping: Accounting for Translation Closeness
, 2001
"... In this paper we present a mechanism for approximately translating Boolean query constraints across heterogeneous information sources. Achieving the best translation is challenging because sources support different constraints for formulating queries, and often these constraints cannot be precisely ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this paper we present a mechanism for approximately translating Boolean query constraints across heterogeneous information sources. Achieving the best translation is challenging because sources support different constraints for formulating queries, and often these constraints cannot be precisely translated. For instance, a query [score > 8] might be "perfectly" translated as [rating > 0.8] at some site, but can only be approximated as [grade = A] at another. Unlike other work, our general framework adopts a customizable "closeness" metric for the translation that combines both precision and recall. Our results show that for query translation we need to handle interdependencies among both query conjuncts as well as disjuncts. As the basis, we identify the essential requirements of a rule system for users to encode the mappings for atomic semantic units. Our algorithm then translates complex queries by rewriting them in terms of the semantic units. We show that, under practical assumptions, our algorithm generates the best approximate translations with respect to the closeness metric of choice. We also present a case study to show how our technique may be applied in practice.
Enhancing Semantic Web Data Access
, 2006
"... The Semantic Web was invented by Tim Berners-Lee in 1998 as a web of data for machine consumption. Its applicability in supporting real world applications on the World Wide Web, however, remains unclear to this day because most existing works treat the Semantic Web as one universal RDF graph and ign ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
The Semantic Web was invented by Tim Berners-Lee in 1998 as a web of data for machine consumption. Its applicability in supporting real world applications on the World Wide Web, however, remains unclear to this day because most existing works treat the Semantic Web as one universal RDF graph and ignore the Web aspect. In fact, the Semantic Web is distributed on the Web as a web of belief: each piece of Semantic Web data is independently published on the Web as a certain agent’s belief instead of the universal truth. Therefore, we enhance the current conceptual model of the Semantic Web to characterize both the content and the context of Semantic Web data. A significant sample dataset is harvested to demonstrate the non-trivial presence and the global properties of the Semantic Web on the Web. Based on the enhanced conceptual model, we introduce a novel search and navigation model for the unique behaviors in Web-scale Semantic Web data access, and develop an enabling tool – the Swoogle Semantic Web search engine. To evaluate the data quality of Semantic Web data, we also (i) develop an explainable ranking schema that orders the popularity of Semantic Web documents and terms, and (ii) introduce a new level of granularity of Semantic Web data – RDF molecule that supports lossless RDF graph decomposition and effective provenance tracking. This dissertation systematically investigates the Web aspect of the Semantic Web. Its primary contribu-tions are the enhanced conceptual model of the Semantic Web, the novel Semantic Web search and navigation model, and the Swoogle Semantic Web search engine.
Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries
"... Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective s ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.
Predicting Web Information Content
"... Abstract. This paper introduces a novel method for predicting the current information need of a web user from the content of the pages the user has visited and the actions the user has applied to these pages. This inference is based on a parameterized model of how the sequence of actions chosen by t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. This paper introduces a novel method for predicting the current information need of a web user from the content of the pages the user has visited and the actions the user has applied to these pages. This inference is based on a parameterized model of how the sequence of actions chosen by the user indicates the degree to which page content satisfies the user’s information need. We show that the model parameters can be estimated using standard methods from a labelled corpus. Data from lab experiments demonstrate that the prediction model can effectively identify the information needs of new users, browsing previously unseen pages. The paper concludes with an overview of our “complete-web ” recommendation system, WebIC, which uses the prediction model to recommend useful pages to the user, from anywhere on the Web. 1
Hyperclique Pattern Discovery
- Data Mining and Knowledge Discovery Journal
, 2006
"... Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not e#ective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious p ..."
Abstract
- Add to MetaCart
Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not e#ective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious patterns involving items which are from di#erent support levels and are poorly correlated. In this paper, we present a framework for mining highly-correlated association patterns called hyperclique patterns. In this framework, an objective measure called h-confidence is applied to discover hyperclique patterns. We prove that the objects in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearson's correlation coe#cient). Also, we show that the h-confidence measure satisfies a cross-support property which can help e#ciently eliminate spurious patterns involving items with substantially di#erent support levels. Indeed, this cross-support property is not limited to hconfidence and can be generalized to some other association measures. In addition, an algorithm called hyperclique miner is proposed to exploit both cross-support and anti-monotone properties of the h-confidence measure for the e#cient discovery of hyperclique patterns. Finally, our experimental results show that hyperclique miner can e#ciently identify hyperclique patterns, even at extremely low levels of support. Keywords: Association Analysis, Hyperclique Patterns, H-confidence 1.

