Results 1 - 10
of
83
SPARK: Top-k keyword query in relational databases
- In Proceedings of SIGMOD
, 2007
"... With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support keyword queries over text data. As a search result is often assembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support keyword queries over text data. As a search result is often assembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. In this paper, we study the effectiveness and the efficiency issues of answering top-k keyword query in relational database systems. We propose a new ranking formula by adapting existing IR techniques based on a natural notion of virtual document. Compared with previous approaches, our new ranking method is simple yet effective, and agrees with human perceptions. We also study efficient query processing methods for the new ranking method, and propose algorithms that have minimal accesses to the database. We have conducted extensive experiments on large-scale real databases using two popular RDBMSs. The experimental results demonstrate significant improvement to the alternative approaches in terms of retrieval effectiveness and efficiency. Categories and Subject Descriptors
Automatic extraction of titles from general documents using machine learning
- In Proc. of ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL
, 2005
"... In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.
Web object retrieval
- In Proc. WWW
, 2007
"... The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects embedded in static Web pages and online Web databases. In this paper, we propose a paradigm shift to enable searching at the obje ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects embedded in static Web pages and online Web databases. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. In this paper, we propose several language models for Web object retrieval. We test these models on our academic search engine called Libra and compare their performances. 1.
Optimisation methods for ranking functions with multiple parameters
- In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management
, 2006
"... parameters ..."
Adding Geographic Scopes to Web Resources
- CEUS - Computers, Environment and Urban Systems
, 2006
"... Many Web pages are rich in geographic information and primarily relevant to geographically limited communities. However, existing IR systems only recently began to offer local services and largely ignore geo-spatial information. This paper presents our work on automatically identifying the geographi ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Many Web pages are rich in geographic information and primarily relevant to geographically limited communities. However, existing IR systems only recently began to offer local services and largely ignore geo-spatial information. This paper presents our work on automatically identifying the geographical scope of Web documents, which provides the means to develop retrieval tools that take the geographical context into consideration. Our approach makes extensive use of an ontology of geographical concepts, and includes a system architecture for extracting geographic information from large collections of Web documents. The proposed method involves recognising geographical references over the documents and assigning geographical scopes through a graph ranking algorithm. Initial evaluation results are encouraging, indicating the viability of this approach.
Online expansion of rare queries for sponsored search
- In Proceedings of the 18th International World Wide Web Conference
, 2009
"... Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are larg ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user’s query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost.
Automatic Feature Selection in the Markov Random Field Model for Information Retrieval
- In Proceedings of CIKM’07
, 2007
"... Previous applications of the Markov random field model for information retrieval have used manually chosen features. However, it is often difficult or impossible to know, a priori, the best set of features to use for a given task or data set. Therefore, there is a need to develop automatic feature s ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Previous applications of the Markov random field model for information retrieval have used manually chosen features. However, it is often difficult or impossible to know, a priori, the best set of features to use for a given task or data set. Therefore, there is a need to develop automatic feature selection techniques. In this paper we describe a greedy procedure for automatically selecting features to use within the Markov random field model for information retrieval. We also propose a novel, robust method for describing classes of textual information retrieval features. Experimental results, evaluated on standard TREC test collections, show that our feature selection algorithm produces models that are either significantly more effective than, or equally effective as, models with manually selected features, such as those used in the past.
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval
"... We describe indexing and retrieval techniques that are suited to perform terabyte-scale information retrieval tasks on a standard desktop PC. Starting from an Okapi-BM25-based default baseline retrieval function, we explore both sides of the effectiveness spectrum. On one side, we show how term prox ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We describe indexing and retrieval techniques that are suited to perform terabyte-scale information retrieval tasks on a standard desktop PC. Starting from an Okapi-BM25-based default baseline retrieval function, we explore both sides of the effectiveness spectrum. On one side, we show how term proximity can be integrated into the scoring function in order to improve the search results. On the other side, we show how index pruning can be employed to increase retrieval efficiency -- at the cost of reduced retrieval effectiveness. We show that
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval
- In Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 2005
"... This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6 % improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1 %-29.0 % improvements).

