Results 1 - 10
of
14
Probase: A Probabilistic Taxonomy for Text Understanding
"... Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing onto ..."
Abstract
-
Cited by 76 (21 self)
- Add to MetaCart
(Show Context)
Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing ontologies has the needed depth and breadth for “universal understanding”. In this paper, we present a universal, probabilistic taxonomy that is more comprehensive than any existing ones. It contains 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages. Unlike traditional taxonomies that treat knowledge as black and white, it uses probabilities to model inconsistent, ambiguous and uncertain information it contains. We present details of how the taxonomy is constructed, its probabilistic modeling, and its potential applications in text understanding.
Biperpedia: An Ontology for Search Applications
"... Search engines make significant efforts to recognize queries that can be answered by structured data and invest heavily in creating and maintaining high-precision databases. While these databases have a relatively wide coverage of entities, the number of attributes they model (e.g., GDP, CAPITAL, AN ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Search engines make significant efforts to recognize queries that can be answered by structured data and invest heavily in creating and maintaining high-precision databases. While these databases have a relatively wide coverage of entities, the number of attributes they model (e.g., GDP, CAPITAL, ANTHEM) is relatively small. Extending the number of attributes known to the search engine can enable it to more precisely answer queries from the long and heavy tail, extract a broader range of facts from the Web, and recover the semantics of tables on the Web. We describe Biperpedia, an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names. Biperpedia extracts attributes from the query stream, and then uses the best extractions to seed attribute extraction from text. For every attribute Biperpedia saves a set of synonyms and text patterns in which it appears, thereby enabling it to recognize the attribute in more contexts. In addition to a detailed analysis of the quality of Biperpedia, we show that it can increase the number of Web tables whose semantics we can recover by more than a factor of 4 compared with Freebase. 1.
Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases
"... Tabular data on the Web has become a rich source of struc-tured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the se-mantics of those Web tables and providing effective search and ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Tabular data on the Web has become a rich source of struc-tured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the se-mantics of those Web tables and providing effective search and exploration mechanisms over them. An important part of table understanding and search is column concept deter-mination, i.e., identifying the most appropriate concepts as-sociated with the columns of the tables. The problem be-comes especially challenging with the availability of increas-ingly rich knowledge bases that contain hundreds of millions of entities. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity par-tition. We prove that both the problem of finding the op-timal aggregation strategy and that of finding the optimal partition strategy are NP-hard, and propose efficient heuris-tic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real-world datasets show that our method achieves high annotation quality and perfor-mance, and scales well. 1.
Context-dependent conceptualization
- In IJCAI
, 2013
"... Conceptualization seeks to map a short text (i.e., a word or a phrase) to a set of concepts as a mecha-nism of understanding text. Most of prior research in conceptualization uses human-crafted knowl-edge bases that map instances to concepts. Such approaches to conceptualization have the limitation ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Conceptualization seeks to map a short text (i.e., a word or a phrase) to a set of concepts as a mecha-nism of understanding text. Most of prior research in conceptualization uses human-crafted knowl-edge bases that map instances to concepts. Such approaches to conceptualization have the limitation that the mappings are not context sensitive. To overcome this limitation, we propose a framework in which we harness the power of a probabilis-tic topic model which inherently captures the se-mantic relations between words. By combining la-tent Dirichlet allocation, a widely used topic model with Probase, a large-scale probabilistic knowledge base, we develop a corpus-based framework for context-dependent conceptualization. Through this simple but powerful framework, we improve con-ceptualization and enable a wide range of applica-tions that rely on semantic understanding of short texts, including frame element prediction, word similarity in context, ad-query similarity, and query similarity. 1
Automatically Generating Government Linked Data from Tables
"... Most open government data is encoded and published in structured tables found in reports, on the Web, and in spreadsheets or databases. Current approaches to gener-ating Semantic Web representations from such data re-quires human input to create schemas and often results in graphs that do not follow ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Most open government data is encoded and published in structured tables found in reports, on the Web, and in spreadsheets or databases. Current approaches to gener-ating Semantic Web representations from such data re-quires human input to create schemas and often results in graphs that do not follow best practices for linked data. Evidence for a table’s meaning can be found in its column headers, cell values, implicit relations between columns, caption and surrounding text but also requires general and domain-specific background knowledge. We describe techniques grounded in graphical models and probabilistic reasoning to infer meaning (seman-tics) associated with a table using background knowl-edge from the Linked Open Data cloud. We represent a table’s meaning by mapping columns to classes in an appropriate ontology, linking cell values to literal con-stants, implied measurements, or entities in the linked data cloud (existing or new) and discovering or and identifying relations between columns.
A.: A Domain Independent Framework for Extracting Linked Semantic Data from Tables
- Search Computing III. LNCS
, 2012
"... Abstract. Vast amounts of information is encoded in tables found in documents, on the Web, and in spreadsheets or databases. Integrating or searching over this information benefits from understanding its intended meaning and making it explicit in a semantic representation language like RDF. Most cur ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Vast amounts of information is encoded in tables found in documents, on the Web, and in spreadsheets or databases. Integrating or searching over this information benefits from understanding its intended meaning and making it explicit in a semantic representation language like RDF. Most current approaches to generating Semantic Web rep-resentations from tables requires human input to create schemas and often results in graphs that do not follow best practices for linked data. Evidence for a table’s meaning can be found in its column headers, cell values, implicit relations between columns, caption and surrounding text but also requires general and domain-specific background knowledge. Ap-proaches that work well for one domain, may not necessarily work well for others. We describe a domain independent framework for interpreting the intended meaning of tables and representing it as Linked Data. At the core of the framework are techniques grounded in graphical models and probabilistic reasoning to infer meaning associated with a table. Using background knowledge from resources in the Linked Open Data cloud, we jointly infer the semantics of column headers, table cell values (e.g., strings and numbers) and relations between columns and represent the inferred meaning as graph of RDF triples. A table’s meaning is thus cap-tured by mapping columns to classes in an appropriate ontology, linking cell values to literal constants, implied measurements, or entities in the linked data cloud (existing or new) and discovering or and identifying relations between columns.
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Context-Dependent Conceptualization
"... Conceptualization seeks to map a short text (i.e., a word or a phrase) to a set of concepts as a mechanism of understanding text. Most of prior research in conceptualization uses human-crafted knowledge bases that map instances to concepts. Such approaches to conceptualization have the limitation th ..."
Abstract
- Add to MetaCart
Conceptualization seeks to map a short text (i.e., a word or a phrase) to a set of concepts as a mechanism of understanding text. Most of prior research in conceptualization uses human-crafted knowledge bases that map instances to concepts. Such approaches to conceptualization have the limitation that the mappings are not context sensitive. To overcome this limitation, we propose a framework in which we harness the power of a probabilistic topic model which inherently captures the semantic relations between words. By combining latent Dirichlet allocation, a widely used topic model with Probase, a large-scale probabilistic knowledge base, we develop a corpus-based framework for context-dependent conceptualization. Through this simple but powerful framework, we improve conceptualization and enable a wide range of applications that rely on semantic understanding of short texts, including frame element prediction, word similarity in context, ad-query similarity, and query similarity. 1
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Synthesizing Union Tables from the Web
"... Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the us ..."
Abstract
- Add to MetaCart
Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of structured data on the Web, and study the challenging problem of synthesizing tables from the Web, i.e., producing never-before-seen tables from raw tables on the Web. Table synthesis offers an important semantic advantage: when a set of related tables are combined into a single union table, powerful mechanisms, such as temporal or geographical comparison and visualization, can be employed to understand and mine the underlying data holistically. We focus on one fundamental task of table synthesis, namely, table stitching. Within a given site, many tables with identical schemas can be scattered across many pages. The task of table stitching involves combining such tables into a single meaningful union table and identifying extra attributes and values for its rows so that rows from different original tables can be distinguished. Specifically, we first define the notion of stitchable tables and identify collections of tables that can be stitched. Second, we design an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns. We also assign meaningful names to these synthesized columns. Experiments on real world tables demonstrate the effectiveness of our approach. 1
Probase: a Universal Knowledge Base for Semantic Search
"... We demonstrate a prototype system that showcases the power of using a knowledge base (Probase) for search. The goal of Probase is to enable common sense computing, and its foundation is a universal, probabilistic ontology that is more comprehensive than any of the existing ontologies. Currently, it ..."
Abstract
- Add to MetaCart
(Show Context)
We demonstrate a prototype system that showcases the power of using a knowledge base (Probase) for search. The goal of Probase is to enable common sense computing, and its foundation is a universal, probabilistic ontology that is more comprehensive than any of the existing ontologies. Currently, it contains over 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages. Unlike traditional knowledge bases that treat knowledge as black and white, it supports probabilistic interpretations of the information it contains. The probabilistic nature also enables it to incorporate heterogeneous information in a natural way. Besides the system, we also demonstrate two applications, i) semantic web search and ii) understanding and searching web tables, that are built on top of the Probase framework. They indicate that a little common sense goes a long way: machines can be made more intelligent