Results 1 - 10
of
98
Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages
- Data & Knowledge Engineering
, 1999
"... Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the esse ..."
Abstract
-
Cited by 101 (43 self)
- Add to MetaCart
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g. advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology---a conceptual model instance---that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents an...
Record-Boundary Discovery In Web Documents
, 1998
"... Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, e ..."
Abstract
-
Cited by 101 (18 self)
- Add to MetaCart
Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this thesis we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the...
Ontology-based extraction and structuring of information from data-rich unstructured documents
, 1998
"... We present a new approach to extracting information from unstructured documents based on an application ontology that describes a domain of interest. Starting with such an ontology, we formulate rules to extract constants and context keywords from unstructured documents. For each unstructured docume ..."
Abstract
-
Cited by 60 (10 self)
- Add to MetaCart
We present a new approach to extracting information from unstructured documents based on an application ontology that describes a domain of interest. Starting with such an ontology, we formulate rules to extract constants and context keywords from unstructured documents. For each unstructured document of interest, we extract its constants and keywords and apply a recognizer to organize extracted constants as attribute values of tuples in a generated database schema. To make our approach general, we fix all the processes and change only the ontological description for a different application domain. In experiments we conducted on two different types of unstructured documents taken from the Web, our approach attained recall ratios in the 80 % and 90 % range and precision ratios near 98%.
Developing XML Documents with Guaranteed "Good" Properties
- In ER’01
, 2001
"... Many XML documents are being produced, but there are no agreed-upon standards formally defining what it means for complying XML documents to have "good" properties. In this paper we present a formal definition for a proposed canonical normal form for XML documents called XNF . XNF guarantees tha ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
Many XML documents are being produced, but there are no agreed-upon standards formally defining what it means for complying XML documents to have "good" properties. In this paper we present a formal definition for a proposed canonical normal form for XML documents called XNF . XNF guarantees that complying XML documents have maximally compact connectivity while simultaneously guaranteeing that the data in complying XML documents cannot be redundant. Further, we present a conceptual-model-based methodology that automatically generates XNF-compliant DTDs and prove that the algorithms, which are part of the methodology, produce DTDs to ensure that all complying XML documents satisfy the properties of XNF.
Multifaceted exploitation of metadata for attribute match discovery in information integration
- In Proceedings of the International Workshop on Information Integration on the Web (WIIW’01
, 2001
"... Automating semantic matching of attributes for the purpose of information integration is challenging, and the dynamics of the Web further exacerbate this problem. Believing that many facets of metadata can contribute to a resolution, we present a framework for multifaceted exploitation of metadata i ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
Automating semantic matching of attributes for the purpose of information integration is challenging, and the dynamics of the Web further exacerbate this problem. Believing that many facets of metadata can contribute to a resolution, we present a framework for multifaceted exploitation of metadata in which we gather information about potential matches from various facets of metadata and combine this information to generate and place confidence values on potential attribute matches. To make the framework apply in the highly dynamic Web environment, we base our process largely on machine learning. Experiments we have conducted are encouraging, showing that when the combination of facets converges as expected, the results are highly reliable. 1 1
Existence Dependency: The key to semantic integrity between Structural And Behavioural . . .
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
, 1998
"... In object-oriented conceptual modelling, the Generalisation/Specialisation hierarchy and the Whole/Part relationship are prevalent classification schemes for object types. This paper presents an object-oriented conceptual model where, in the end, object types are classified according to two relation ..."
Abstract
-
Cited by 30 (20 self)
- Add to MetaCart
In object-oriented conceptual modelling, the Generalisation/Specialisation hierarchy and the Whole/Part relationship are prevalent classification schemes for object types. This paper presents an object-oriented conceptual model where, in the end, object types are classified according to two relationships only: existence dependency and generalisation/specialisation. Existence dependency captures some of the interesting semantics that are usually associated with the concept of aggregation (also called composition or Part Of relation), but in contrast with the latter concept, the semantics of existence dependency are very precise and its use clear cut. The key advantage of classifying object types according to existence dependency are the simplicity of the concept, its absolute unambiguity and the fact that it enables to check conceptual schemes for semantic integrity and consistency. We will
A Conceptual-Modeling Approach to Extracting Data from the Web
- IN PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON CONCEPTUAL MODELING (ER'98
, 1998
"... Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the esse ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data. The approach is based on an ontology---a conceptual model instance---that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the...
Part-Whole Relationship Categories and their Application in Object-Oriented Analysis
"... Part decomposition and, conversely, the construction of composite objects out of individual parts have long been recognized as ubiquitous and essential mechanisms involving abstraction. This applies, in particular, in areas such as CAD, manufacturing, software development, and computer graphics. Al ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Part decomposition and, conversely, the construction of composite objects out of individual parts have long been recognized as ubiquitous and essential mechanisms involving abstraction. This applies, in particular, in areas such as CAD, manufacturing, software development, and computer graphics. Although the part-of relationship is distinguished in object-oriented modeling techniques, it ranks far behind the concept of generalization/specialization and a rigorous definition of its semantics is still missing. In this paper we first show in which ways a shift in emphasis on the part-of relationship leads to analysis and design models that are easier to understand and to maintain. We then investigate the properties of part-of relationships in order to define their semantics. This is achieved by means of a categorization of part-of relationships and by associating semantic constraints with individual categories. We further suggest a precise and, compared with existing techniques, less redundant specification of constraints accompanying part-of categories based on the degree of exclusiveness and dependence of parts on composite objects. Although the approach appears generally applicable, the object-oriented Unified Modeling Language (UML) is used to present our findings. Several examples demonstrate the applicability of the categories introduced.
From Rules To Rule Patterns
, 1996
"... . Rule-based systems are a commonly accepted solution for smoothly capturing the context-dependent and time-dependent organizational knowledge of large enterprises, also known as business policies. At the same time, however, the design of rule-based applications is one of the most pressing open rese ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
. Rule-based systems are a commonly accepted solution for smoothly capturing the context-dependent and time-dependent organizational knowledge of large enterprises, also known as business policies. At the same time, however, the design of rule-based applications is one of the most pressing open research problems. This is largely because of the expressive power and flexibility of existing rule-based models together with a lack of design guidelines on how to apply these models. Learning from analogous problems in object-oriented system development and borrowing their solution metaphor we introduce rule patterns as generic rule-based solutions for specifying business policies. The advantage of rule patterns is their predefined, reusable, and dynamically customizable nature allowing the designer to reuse existing experience for building new rule-based applications. The paper introduces the general notion of rule patterns and illustrates the approach by sample rule patterns for specifying i...
Formal Deadlock Elimination In An Object Oriented Conceptual Schema
, 1995
"... Object oriented models model structural and behavioural aspects of objects in the Universe of Discourse. As the dynamic aspects of objects include parallelism and synchronisation of object life cycles, conceptual schemas must be verified for problematic behaviour like deadlock. In this paper we will ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Object oriented models model structural and behavioural aspects of objects in the Universe of Discourse. As the dynamic aspects of objects include parallelism and synchronisation of object life cycles, conceptual schemas must be verified for problematic behaviour like deadlock. In this paper we will present fragments of a method for object oriented analysis and the process algebra that allows to formally verify a conceptual schema build according to this method for deadlock behaviour.

