Results 1 -
6 of
6
Flexible web document analysis for delivery to narrow-bandwidth devices
- in: Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR’01
, 2001
"... We propose a set of baseline heuristics for identifying genuinely tabular information and news links in HTML documents. A prototype implementation of these heuristics is described for delivering content from news providers ' home pages to a narrow-bandwidth device such as a portable digital assistan ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
We propose a set of baseline heuristics for identifying genuinely tabular information and news links in HTML documents. A prototype implementation of these heuristics is described for delivering content from news providers ' home pages to a narrow-bandwidth device such as a portable digital assistant or cellular phone display. Its evaluation on 75 web-sites is provided, along with a discussion of topics for future research. 1.
What Fraction of Images on the Web Contain Text?
- IN PROCEEDINGS OF WEB DOCUMENT ANALYSIS, 2001. ONLINE: HTTP://WWW.CSC.LIV.AC.UK/ WDA2001/ PAPERS/27 KANUNGO WDA2001.PDF
, 2001
"... Web search engines index text represented in symbolic form. However, it is well known that a fraction of the text on the web is present in the form of images, and the textual content of these images is not indexed by the search engines. This fact immediately raises a few questions: i) What fraction ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Web search engines index text represented in symbolic form. However, it is well known that a fraction of the text on the web is present in the form of images, and the textual content of these images is not indexed by the search engines. This fact immediately raises a few questions: i) What fraction of the images on the web contain text? ii) What fraction of the text content of these images does not appear in the web page in symbolic form? Answers to these questions will give the web users an idea about the amount of information being missed by the search engines, and, justify whether or not Optical Character Recognition should be a standard part of search engine indexing. To answer these questions we statistically sample the images referenced in the web pages retrieved by a search engine for specific queries and then find the fraction of sampled images that contain text.
Colour text segmentation in web images based on human perception
, 2007
"... There is a significant need to extract and analyse the text in images on Web documents, for effective indexing, semantic analysis and even presentation by non-visual means (e.g., audio). This paper argues that the challenging segmentation stage for such images benefits from a human perspective of co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
There is a significant need to extract and analyse the text in images on Web documents, for effective indexing, semantic analysis and even presentation by non-visual means (e.g., audio). This paper argues that the challenging segmentation stage for such images benefits from a human perspective of colour perception in preference to RGB colour space analysis. The proposed approach enables the segmentation of text in complex situations such as in the presence of varying colour and texture (characters and background). More precisely, characters are segmented as distinct regions with separate chromaticity and/or lightness by performing a layer decomposition of the image. The method described here is a result of the authors ’ systematic approach to approximate the human colour perception characteristics for the identification of character regions. In this instance, the image is decomposed by performing histogram analysis of Hue and Lightness in the HLS colour space and merging using information on human discrimination of wavelength and luminance.
Colour
"... text segmentation in web images based on human perception ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
text segmentation in web images based on human perception
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices
, 2001
"... We propose a set of baseline heuristics for identifying genuinely tabular information and news links in HTML documents. A prototype implementation of these heuristics is described for delivering content from news providers' home pages to a narrow-bandwidth device such as a portable digital assistant ..."
Abstract
- Add to MetaCart
We propose a set of baseline heuristics for identifying genuinely tabular information and news links in HTML documents. A prototype implementation of these heuristics is described for delivering content from news providers' home pages to a narrow-bandwidth device such as a portable digital assistant or cellular phone display. Its evaluation on 75 web-sites is provided, along with a discussion of topics for future research.
To Search for Images on the Web,
- In Proceedings of the First International Workshop on Web Document Analysis (WDA2001), online at http://www.csc.liv.ac.uk/ wda2001
, 2001
"... this paper, we want to argue that image pro- This material is based upon work supported by the U. S. Department of Defense and by the National Science Foundation under Grant No. 9734102. Additional support was provided by Sun Microsystems ..."
Abstract
- Add to MetaCart
this paper, we want to argue that image pro- This material is based upon work supported by the U. S. Department of Defense and by the National Science Foundation under Grant No. 9734102. Additional support was provided by Sun Microsystems

