Results 1 -
5 of
5
An Analysis of Statistical and Syntactic Phrases
, 1997
"... As the amount of textual information available through the World Wide Web grows, there is a growing need for high-precision IR systems that enable a user to find useful information from the masses of available textual data. Phrases have traditionally been regarded as precision-enhancing devices and ..."
Abstract
-
Cited by 65 (2 self)
- Add to MetaCart
As the amount of textual information available through the World Wide Web grows, there is a growing need for high-precision IR systems that enable a user to find useful information from the masses of available textual data. Phrases have traditionally been regarded as precision-enhancing devices and have proved useful as content-identifiers in representing documents. In this study, we compare the usefulness of phrases recognized using linguistic methods and those recognized by statistical techniques. We focus in particular on high-precision retrieval. We discover that once a good basic ranking scheme is being used, the use of phrases does not have a major effect on precision at high ranks. Phrases are more useful at lower ranks where the connection between documents and relevance is more tenuous. Also, we find that the syntactic and statistical methods for recognizing phrases yield comparable performance. 1 Introduction The amount of textual information available through the World Wide...
Document Expansion for Speech Retrieval
, 1999
"... Advances in automatic speech recognition allow us to search large speech collections using traditional information retrieval methods. The problem of "aboutness" for documents --- is a document about a certain concept --- has been at the core of document indexing for the entire history of IR. This p ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
Advances in automatic speech recognition allow us to search large speech collections using traditional information retrieval methods. The problem of "aboutness" for documents --- is a document about a certain concept --- has been at the core of document indexing for the entire history of IR. This problem is more difficult for speech indexing since automatic speech transcriptions often contain mistakes. In this study we show that document expansion can be successfully used to alleviate the effect of transcription mistakes on speech retrieval. The loss
Integrated Search Tools for Newspaper Digital Libraries
- in proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 2000
"... Our project aims at the creation of a digital library of newspaper issues dated since 1890. At the moment, all the available source material is property of Lambrakis Press SA, the largest publisher in Greece. The printed material exceeds!,200,000 pages, half of which are of A2 size. In order to faci ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Our project aims at the creation of a digital library of newspaper issues dated since 1890. At the moment, all the available source material is property of Lambrakis Press SA, the largest publisher in Greece. The printed material exceeds!,200,000 pages, half of which are of A2 size. In order to facilitate the access to the digitized material, we have developed a retro-conversion procedure [1,2] according to which articles are traced and catalogued. For the time being, the full text of the articles is only partially available. The reasons for this are that in some cases we have encountered low quality originals, rare old fonts (in our case old Greek fonts) as well as the absence of a suitable dictionary that could correct the OCR outcome and make the recognition reliable. In addition, an integrated set of search tools is provided to the users so that they can easily find the information they are looking for. Finally, when the user
Type of proposal: Long paper
- Proc. of the Content-Based Multimedia Information Access International Conference (RIAO2000
, 2000
"... An important issue pertaining to the retro-conversion of newspapers, i.e. the conversion of newspaper issues into digital resources, is the identification and appropriate digital representation of an article. To complete this task, a number of steps have to be followed, from segmentation of the news ..."
Abstract
- Add to MetaCart
An important issue pertaining to the retro-conversion of newspapers, i.e. the conversion of newspaper issues into digital resources, is the identification and appropriate digital representation of an article. To complete this task, a number of steps have to be followed, from segmentation of the newspaper image to optical character recognition and linking of different items belonging to the same article. In this paper, an evaluation of different information retrieval techniques is presented that aim at linking textual parts of an article that can be found on different pages of a newspaper issue. Three document matching techniques are evaluated, namely title-to-title, title-to-text and text-to-text matching. In addition, the effect on the matching accuracy of using a stemmer and of employing appropriate conflict resolution techniques is studied for each of the above approaches. Experimental results involving a number of issues of a Greek newspaper show that the best technique, namely text-to-text matching augmented with a stemmer and conflict resolution, can reach a high linking accuracy rate of 96%.
DCU and ISI@INEX 2010: Ad-hoc and Data-Centric tracks
"... Abstract. We describe the participation of Dublin City University (DCU) and the Indian Statistical Institute (ISI) in INEX 2010. The main contributions of this paper are: i) a simplified version of Hierarchical Language Model (HLM) which involves scoring XML elements with a combined probability of g ..."
Abstract
- Add to MetaCart
Abstract. We describe the participation of Dublin City University (DCU) and the Indian Statistical Institute (ISI) in INEX 2010. The main contributions of this paper are: i) a simplified version of Hierarchical Language Model (HLM) which involves scoring XML elements with a combined probability of generating the given query from itself and the top level article node, is shown to outperform the baselines of Language Model (LM) and Vector Space Model (VSM) scoring of XML elements; ii) the Expectation Maximization (EM) feedback in LM is shown to be the most effective on the domain specific collection of IMDB; iii) automated removal of sentences indicating aspects of irrelevance from the narratives of INEX ad-hoc topics is shown to improve retrieval effectiveness. 1

