Results 1 - 10
of
31
Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices
- In Intl. World Wide Web Conf. (WWW
, 2003
"... Mobile devices have already been widely used to access the Web. However, because most available web pages are designed for desktop PC in mind, it is inconvenient to browse these large web pages on a mobile device with a small screen. In this paper, we propose a new browsing convention to facilitate ..."
Abstract
-
Cited by 73 (5 self)
- Add to MetaCart
Mobile devices have already been widely used to access the Web. However, because most available web pages are designed for desktop PC in mind, it is inconvenient to browse these large web pages on a mobile device with a small screen. In this paper, we propose a new browsing convention to facilitate navigation and reading on a small-form-factor device. A web page is organized into a two level hierarchy with a thumbnail representation at the top level for providing a global view and index to a set of subpages at the bottom level for detail information. A page adaptation technique is also developed to analyze the structure of an existing web page and split it into small and logically related units that fit into the screen of a mobile device. For a web page not suitable for splitting, auto-positioning or scrolling-by-block is used to assist the browsing as an alterative. Our experimental results show that our proposed browsing convention and developed page adaptation scheme greatly improve the user’s browsing experiences on a device with a small display.
Mining Data Records in Web Pages
, 2003
"... A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and noncontiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially. Categories and Subject Descriptors I.5 [Pattern Recognition]: statistical and structural H.2.8 [Database Applications]: data mining Keywords Web data records, Web mining, Web information integration 1.#
WebTables: Exploring the Power of Tables on the Web
, 2008
"... The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that conta ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own “schema ” of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power
Information extraction
- FnT Databases
"... The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natu ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem. This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of theextraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process. 1
OntoMiner: Bootstrapping and Populating Ontologies from Domain Specific Websites
- Proceedings of the First International Workshop on Semantic Web and Databases (SWDB 2003
, 2003
"... Abstract. RDF/XML has been widely recognized as the standard for annotating online Web documents and for transforming the HTML Web to the so called Semantic Web. In order to enable widespread usability for the Semantic Web there is a need to bootstrap large, rich and upto-date domain ontologies that ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Abstract. RDF/XML has been widely recognized as the standard for annotating online Web documents and for transforming the HTML Web to the so called Semantic Web. In order to enable widespread usability for the Semantic Web there is a need to bootstrap large, rich and upto-date domain ontologies that organize most relevant concepts, their relationships and instances. In this paper, we present automated techniques for bootstrapping and populating specialized domain ontologies by organizing and mining a set of relevant Web sites provided by the user. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. Next, we present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. Experimental evaluation for the News and Hotels domain indicates that our algorithms can bootstrap and populate domain specific ontologies with high precision and recall. 1
Design of an End-to-End Method to Extract Information From Tables
- International Journal Document Analysis Research
"... This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extrac ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to: place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between the different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.
Uncovering the relational web
- In under review
, 2008
"... The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured dat ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style tables could be useful for improving web search, schema design, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web’s HTML table corpus. For example, we extracted 14.1 billion HTML tables from a several-billion-page portion of Google’s generalpurpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also describe the crawl’s distribution of table sizes and data types. Second, we describe a system for performing relation recovery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1 % of good relations from the remainder, nor to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems. 1.
Tableseer: Automatic table metadata extraction and searching in digital libraries
- In Technical Report
, 2007
"... Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm – TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, T ableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.
Web Mining: Machine Learning for Web Applications
- Annual Review of Information Science and Technology
, 2004
"... With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich
Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model
, 2006
"... Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML's design purpose to con ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML's design purpose to convey visual instead of semantic information. In this paper, we propose a robust technique for table extraction from arbitrary web pages. This technique relies upon the positional information of visualized DOM element nodes in a browser and, hereby, separates the intricacies of code implementation from the actual intended visual appearance. The novel aspect of the proposed web table extraction technique is the effective use of spatial reasoning on the CSS2 visual box model, which shows a high level of robustness even without any form of learning (F-measure ~ 90%). We describe the ideas behind our approach, the tabular pattern recognition algorithm operating on a double topographical grid structure and allowing for effective and robust extraction, and general observations on web tables that should be borne in mind by any automatic web table extraction mechanism.

