Results 1 - 10
of
57
The Importance of Prior Probabilities for Entry Page Search
- PROCEEDINGS OF THE 25TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 2002
"... An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, nu ..."
Abstract
-
Cited by 114 (16 self)
- Add to MetaCart
An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability.
Query Type Classification for Web Document Retrieval
- IN PROCEEDINGS OF THE 26TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 2003
"... The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insu #ciencies of content information. However, static combination of multiple ev ..."
Abstract
-
Cited by 61 (1 self)
- Add to MetaCart
The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insu #ciencies of content information. However, static combination of multiple evidences may lower the retrieval performance. We need di#erent strategies to find target documents according to a query type. We can classify user queries as three categories, the topic relevance task, the homepage finding task, and the service finding task. In this paper, a user query classification scheme is proposed. This scheme uses the di#erence of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification. After we classified a user query, we apply di#erent algorithms and information for the better results. For the topic relevance task, we emphasize the content information, on the other hand, for the homepage finding task, we emphasize the Link information and the URL information. We could get the best performance when our proposed classification method with the OKAPI scoring algorithm was used.
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
- In Intl. World Wide Web Conf. (WWW
, 2003
"... In contrast to traditional document retrieval, a web page as a whole is not a good information unit to search because it often contains multiple topics and a lot of irrelevant information from navigation, decoration, and interaction part of the page. In this paper, we propose a VIsion-based Page Seg ..."
Abstract
-
Cited by 56 (10 self)
- Add to MetaCart
In contrast to traditional document retrieval, a web page as a whole is not a good information unit to search because it often contains multiple topics and a lot of irrelevant information from navigation, decoration, and interaction part of the page. In this paper, we propose a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page. Compared with simple DOM based segmentation method, our page segmentation scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. By using our VIPS algorithm to assist the selection of query expansion terms in pseudo-relevance feedback in web information retrieval, we achieve 27 % performance improvement on Web Track dataset.
Extracting content structure for web pages based on visual representation
- Proc.5 th Asia Pacific Web Conference
, 2003
"... Abstract. A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent a ..."
Abstract
-
Cited by 37 (6 self)
- Add to MetaCart
Abstract. A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results. 1
Query Expansion using Associated Queries
- IN PROC. INT. CONF. ON INFORMATION AND KNOWLEDGE MANAGEMENT
, 2003
"... Hundreds of millions of users each day use web search engines to meet their information needs. Advances in web search e#ectiveness are therefore perhaps the most significant public outcomes of IR research. Query expansion is one such method for improving the e#ectiveness of ranked retrieval by ad ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
Hundreds of millions of users each day use web search engines to meet their information needs. Advances in web search e#ectiveness are therefore perhaps the most significant public outcomes of IR research. Query expansion is one such method for improving the e#ectiveness of ranked retrieval by adding additional terms to a query. In previous approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run. We propose a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection. Our
Overview of the Ninth Text REtrieval Conference (TREC-9)
- In Proceedings of the Ninth Text REtrieval Conference (TREC-9
, 2000
"... This paper serves as an introduction to the research described in detail in the remainder of the volume. The next section provides a summary of the retrieval background knowledge that is assumed in the other papers. Section 3 presents a short description of each track|a more complete description of ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
This paper serves as an introduction to the research described in detail in the remainder of the volume. The next section provides a summary of the retrieval background knowledge that is assumed in the other papers. Section 3 presents a short description of each track|a more complete description of a track can be found in that track's overview paper in the proceedings. The nal section looks forward to future TREC conferences
When are links useful? Experiments in Text Classification.
- In Advances in IR, 25th European Conference on IR research, ECIR
, 2003
"... Link analysis methods have become popular for information access tasks, especially information retrieval, where the link information in a document collection is used to complement the traditionally used content information. However, there has been little firm evidence to con- firm the utility of ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Link analysis methods have become popular for information access tasks, especially information retrieval, where the link information in a document collection is used to complement the traditionally used content information. However, there has been little firm evidence to con- firm the utility of link information. We show that link information can be useful when the document collection has a sufficiently high link density and links are of sufficiently high quality. We report experiments on text classification of the Cora and WebKB data sets using Probabilistic Latent Semantic Analysis and Probabilistic Hypertext Induced Topic Selection.
Very Large Scale Retrieval and Web Search
, 2004
"... Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues have been addressed, and new methods have been tried, not only for retrieval, but also for test collection construction. Since the Web Track was a natural evolutionary step from the VLC Track, from here on we will refer to them as a single VLC/Web track. The corpora created in support of the track have been distributed to more than 120 organisations world wide; they are clearly being used for evaluation and research purposes well beyond the confines of TREC. Not only that but the Web Track model has been adopted for similar Japanese language evaluations within the context of NTCIR (NII-NACSIS Test Collection for IR Systems, research.nii. ac.jp/ntcir/index-en.html). Each editio
Blockbased web search
- In ACM SIGIR Conference
, 2004
"... Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to take advantage of block-level evidence to improve ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to take advantage of block-level evidence to improve retrieval performance in the web context. Because of the special characteristics of web pages, different page segmentation method will have different impact on web search performance. We compare four types of methods, including fixed-length page segmentation, DOM-based page segmentation, vision-based page segmentation, and a combined method which integrates both semantic and fixed-length properties. Experiments on block-level query expansion and retrieval are performed. Among the four approaches, the combined method achieves the best performance for web search. Our experimental results also show that such a semantic partitioning of web pages effectively deals with the problem of multiple drifting topics and mixed lengths, and thus has great potential to boost up the performance of current web search engines.

