Results 1 - 10
of
17
Entityrank: Searching entities directly and holistically
- In VLDB
, 2007
"... As the Web has evolved into a data-rich repository, with the standard “page view, ” current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data “entities ” (e.g., phone number, paper PDF, date), today’s engines only take us indi ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
As the Web has evolved into a data-rich repository, with the standard “page view, ” current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data “entities ” (e.g., phone number, paper PDF, date), today’s engines only take us indirectly to pages. While entities appear in many pages, current engines only find each page individually. Toward searching directly and holistically for finding information of finer granularity, we study the problem of entity search, a significant departure from traditional document retrieval. We focus on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively. 1.
Indexing Dataspaces
, 2007
"... Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, dataspaces do not assume that all the semantic relationships between sources are known and specified. Much of the user interactio ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, dataspaces do not assume that all the semantic relationships between sources are known and specified. Much of the user interaction with dataspaces involves exploring the data, and users do not have a single schema to which they can pose queries. Consequently, it is important that queries are allowed to specify varying degrees of structure, spanning keyword queries to more structure-aware queries. This paper considers indexing support for queries that combine keywords and structure. We describe several extensions to inverted lists to capture structure when it is present. In particular, our extensions incorporate attribute labels, relationships between data items, hierarchies of schema elements, and synonyms among schema elements. We describe experiments showing that our indexing techniques improve query efficiency by an order of magnitude compared with alternative approaches, and scale well with the size of the data.
Space-Efficient Framework for Top-k String Retrieval Problems ∗
"... Abstract — Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant ” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant ” is involved. In inf ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract — Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant ” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant ” is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan [25]. He considered two metrics for relevance: frequency and proximity. He took a thresholdbased approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [25] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics. Keywords-document retrieval; text indexing; succinct data structures; top-k queries 1.
Web information extraction and user modeling: towards closing the gap
- IEEE Data Engineering Bulletin
, 2005
"... Web search engines have become the primary method of accessing information on the web. Billions of queries are submitted to major web search engines, reflecting a wide range of information needs. While significant progress has been made on improving the relevance of the results, web search process o ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Web search engines have become the primary method of accessing information on the web. Billions of queries are submitted to major web search engines, reflecting a wide range of information needs. While significant progress has been made on improving the relevance of the results, web search process often remains a frustrating experience. At the same time, web information extraction has seen tremendous progress, such that knowledge bases of millions of facts extracted from the web are now a reality. Yet it is not clear how effectively these knowledge bases support common user information needs. We posit that a key for web information extraction to significantly impact the web search experience is to connect the extraction process with user modeling, particularly with automatic methods for inferring user information needs and anticipated interaction patterns. In this paper we overview some recent efforts for user modeling and inferring user preferences in the context of closing the gap between web information extraction and user modeling. 1
Annotating and Searching Web Tables Using Entities, Types and Relationships
"... Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured, free-format text. At the same time, unlike ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured, free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from “organic ” Web tables need not be constrained by availability of precious editorial time. Unfortunately, in the absence of any formal, uniform schema imposed on Web tables, Web search cannot take advantage of these high-quality sources of relational information. In this paper we propose new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express. We propose a new graphical model for making all these labeling decisions for each table simultaneously, rather than make separate local decisions for entities, types and relations. Experiments using the YAGO catalog, DB-Pedia, tables from Wikipedia, and over 25 million HTML tables from a 500 million page Web crawl uniformly show the superiority of our approach. We also evaluate the impact of better annotations on a prototype relational Web search tool. We demonstrate clear benefits of our annotations beyond indexing tables in a purely textual manner. 1.
Exploring the knowledge in Semi Structured Data Sets with Rich Queries
"... Abstract. Semantics can be integrated in to search processing during both document analysis and querying stages. We describe a system that incorporates both, semantic annotations of Wikipedia articles into the search process and allows for rich annotation search, enabling users to formulate queries ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Semantics can be integrated in to search processing during both document analysis and querying stages. We describe a system that incorporates both, semantic annotations of Wikipedia articles into the search process and allows for rich annotation search, enabling users to formulate queries based on their knowledge about how entities relate to one another while simultaneously retaining the freedom of free text search where appropriate. The outcome of this work is an application consisting of semantic annotators, an extended search engine and an interactive user interface. 1
EntityEngine: Answering Entity-Relationship Queries using Shallow Semantics
"... We introduce EntityEngine, a system for answering entityrelationship queries over text. Such queries combine SQLlike structures with IR-style keyword constraints and therefore, can be expressive and flexible in querying about entities and their relationships. EntityEngine consists of various offline ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We introduce EntityEngine, a system for answering entityrelationship queries over text. Such queries combine SQLlike structures with IR-style keyword constraints and therefore, can be expressive and flexible in querying about entities and their relationships. EntityEngine consists of various offline and online components, including a position-based ranking model for accurate ranking of query answers and a novel entity-centric index for efficient query evaluation.
Weighted Proximity Best-Joins for Information Retrieval †
"... Abstract—We consider the problem of efficiently computing weighted proximity best-joins over multiple lists, with applications in information retrieval and extraction. We are given a multi-termquery,andforeachqueryterm,alistofallitsmatches withscores, sorted by locations. The problemis to findthe ov ..."
Abstract
- Add to MetaCart
Abstract—We consider the problem of efficiently computing weighted proximity best-joins over multiple lists, with applications in information retrieval and extraction. We are given a multi-termquery,andforeachqueryterm,alistofallitsmatches withscores, sorted by locations. The problemis to findthe overall best matchset, consisting of one match from each list, such that the combined score according to a scoring function is maximized. We study three types of functions that consider both individual match scores and proximity of match locations in scoring a matchset. We present algorithms that exploit the properties of the scoring functions in order to achieve time complexities linear in the size of the match lists. Experiments show that these algorithms greatly outperform the naive algorithm based on taking the cross product of all match lists. Finally, we extend our algorithms for an alternative problem definitionapplicable to information extraction, where we need to find all good matchsets in a document. I.
Department of Economics and Business Engineering,
"... Summary. In this work we propose intelligent, automated content analysis techniques for different media to extract knowledge from the multimedia content. Information derived from different sources/modalities will be analyzed and fused, in terms of spatiotemporal, personal and even social contextual ..."
Abstract
- Add to MetaCart
Summary. In this work we propose intelligent, automated content analysis techniques for different media to extract knowledge from the multimedia content. Information derived from different sources/modalities will be analyzed and fused, in terms of spatiotemporal, personal and even social contextual information. In order to achieve this goal, semantic analysis will be applied to the content items, taking into account the content itself (e.g. text, images and video), as well as existing personal, social and contextual information (e.g. semantic and machine-processable metadata and tags). The above process exploits the so-called “Media Intelligence” towards the ultimate goal of identifying “Collective Intelligence”, emerging from the collaboration and competition among people, empowering innovative services2 Mylonas et al. and user interactions. The utilization of “Media Intelligence ” constitutes a departure from traditional methods for information sharing, since semantic multimedia analysis has to fuse information from both the content itself and the social context, while at the same time the social dynamics have to be taken into account. Such intelligence provides added-value to the available multimedia content and renders

