Results 1 - 10
of
13
Topic cube: Topic modeling for olap on multidimensional text databases
- In Proc. of the SIAM International Conference on Data Mining (SDM
, 2009
"... As the amount of textual information grows explosively in various kinds of business systems, it becomes more and more desirable to analyze both structured data records and unstructured text data simultaneously. While online analytical processing (OLAP) techniques have been proven very useful for ana ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
As the amount of textual information grows explosively in various kinds of business systems, it becomes more and more desirable to analyze both structured data records and unstructured text data simultaneously. While online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining structured data, they face challenges in handling text data. On the other hand, probabilistic topic models are among the most effective approaches to latent topic analysis and mining on text data. In this paper, we propose a new data model called topic cube to combine OLAP with probabilistic topic modeling and enable OLAP on the dimension of text data in a multidimensional text database. Topic cube extends the traditional data cube to cope with a topic hierarchy and store probabilistic content measures of text documents learned through a probabilistic topic model. To materialize topic cubes efficiently, we propose a heuristic method to speed up the iterative EM algorithm for estimating topic models by leveraging the models learned on component data cells to choose a good starting point for iteration. Experiment results show that this heuristic method is much faster than the baseline method of computing each topic cube from scratch. We also discuss potential uses of topic cube and show sample experimental results. 1
Creating Relational Data from Unstructured and Ungrammatical Data Sources
, 2008
"... In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources a ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructured, ungrammatical data “posts. ” The unstructured nature of posts makes query and integration difficult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search difficult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to
Efficient Techniques for Document Sanitization
"... Sanitization of a document involves removing sensitive information from the document, so that it may be distributed to a broader audience. Such sanitization is needed while declassifying documents involving sensitive or confidential information such as corporate emails, intelligence reports, medical ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Sanitization of a document involves removing sensitive information from the document, so that it may be distributed to a broader audience. Such sanitization is needed while declassifying documents involving sensitive or confidential information such as corporate emails, intelligence reports, medical records, etc. In this paper, we present the ERASE framework for performing document sanitization in an automated manner. ERASE can be used to sanitize a document dynamically, so that different users get different views of the same document based on what they are authorized to know. We formalize the problem and present algorithms used in ERASE for finding the appropriate terms to remove from the document. Our preliminary experimental study demonstrates the efficiency and efficacy of the proposed algorithms. disclosure of proprietary information while sharing data with outsourced operations. Example. Figure 1 shows an example U.S. government document that has been sanitized prior to release [16]. This sanitized document gives limited information (such as the purpose and the funding amount) on an erstwhile secret medical research project, while hiding the names of the funding sources, principal investigators and their affiliation.
RankIE: Document Retrieval on Ranked Entity Graphs
"... Developer communities built around software products, like the SAP Community Network, provide a knowledge base for reocurring problems and their solutions. Due to the large amount of content maintained in such communities, e.g., in forums, finding relevant solutions is a major challenge beyond the s ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Developer communities built around software products, like the SAP Community Network, provide a knowledge base for reocurring problems and their solutions. Due to the large amount of content maintained in such communities, e.g., in forums, finding relevant solutions is a major challenge beyond the scope of common keyword-based search engines. In fact, it is measured that around 50 % of the forum questions of our particular scenario have already been answered at the time they are posted. We target this challenge by an entity aware search, which exploits structured knowledge, such as domain-specific ontologies, for both query interpretation and creation of document indexes. The system takes a natural language query as input, interprets it as an entity graph, matches this graph with pre-processed content and supports the user in refining his query based on the top-k relevant entities. Results are presented in a user interface that supports faceted search based on entities. Additionally, the user interface is structured according to possible search intentions of users. The evaluation of our system on the SCN scenario yields that the top 5 entities in user queries are recognized with a precision of 83 % compared to 61 % of state of the art algorithms. 1. ENTITY-BASED FORUM SEARCH The SAP Community Network (SCN) is a community platform for customers and developers working with SAP products. SCN forums serve as a knowledge base for reoccurring problems and their solutions. We have developed the RankIE system (Ranking of documents based on Information Extraction), to support users in finding relevant content based on entities, such as software components, error messages, field specific terms, etc. Users post queries to the system and refine their search intention. RankIE points to existing high quality answers and allows filtering results according to the document source. The main processing steps are: (P1) Offline recognition, ranking and indexing of entity graphs for the documents in the text corpus (Sec. 2). Fig-
iNextCube: Information Network-Enhanced Text Cube ∗
"... Nowadays, most business, administration, and/or scientific databases contain both structured attributes and text attributes. We call a database that consists of both multidimensional structured data and narrative text data as multidimensional text database. Searching, OLAP, and mining such databases ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Nowadays, most business, administration, and/or scientific databases contain both structured attributes and text attributes. We call a database that consists of both multidimensional structured data and narrative text data as multidimensional text database. Searching, OLAP, and mining such databases pose many research challenges. To enhance the power of data analysis, interesting entities and relationships can be extracted from such databases to derive heterogeneous information networks, which in turn will substantially increase the power and flexibility of data exploration in such databases. Based on our previous studies on TextCube [1], TopicCube [2], and information network analysis, such as RankClus [3] and NetClus [4], we construct iNextCube, an information-Network-enhanced text Cube. In this demo, we show the power of iNextCube in the search and analysis of two multidimensional text databases: (i) a DBLP-based CS bibliographic database, and (ii) an online news database. 1.
Matching Reviews to Objects using a Language Model
"... We develop a general method to match unstructured text reviews to a structured list of objects. For this, we propose a language model for generating reviews that incorporates a description of objects and a generic review language model. This mixture model gives us a principled method to find, given ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We develop a general method to match unstructured text reviews to a structured list of objects. For this, we propose a language model for generating reviews that incorporates a description of objects and a generic review language model. This mixture model gives us a principled method to find, given a review, the object most likely to be the topic of the review. Extensive experiments and analysis on reviews from Yelp show that our language model-based method vastly outperforms traditional tfidf-based methods. 1
Targeted Disambiguation of Ad-hoc, Homogeneous Sets of Named Entities
, 2012
"... In many entity extraction applications, the entities to be recognized are constrained to be from a list of “target entities”. In many cases, these target entities are (i) ad-hoc, i.e., do not exist in a knowledge base and (ii) homogeneous (e.g., all the entities are IT companies). We study the follo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In many entity extraction applications, the entities to be recognized are constrained to be from a list of “target entities”. In many cases, these target entities are (i) ad-hoc, i.e., do not exist in a knowledge base and (ii) homogeneous (e.g., all the entities are IT companies). We study the following novel disambiguation problem in this unique setting: given the candidate mentions of all the target entities, determine which ones are true mentions of a target entity. Prior techniques only consider target entities present in a knowledge base and/or having a rich set of attributes. In this paper, we develop novel techniques that require no knowledge about the entities excepttheir names. Ourmain insight is to leverage the homogeneity constraint and disambiguate the candidate mentions collectively across all documents. We propose a graph-based model, called MentionRank, for that purpose. Furthermore, if additional knowledge is available for some or all of the entities, our model can leverage it to further improve quality. Our experiments demonstrate the effectiveness of our model. To the best of our knowledge, this is the first work on targeted entity disambiguation for ad-hoc entities.
Lightweight Database Wrapper for Unstructured Data
"... We propose an approach to interpreting a structured query language over unstructured data. We define partial records on heterogeneous relations as a means to bridge the gap between structured and unstructured data, as well as an algebra that merges relational operations with Information Extraction o ..."
Abstract
- Add to MetaCart
We propose an approach to interpreting a structured query language over unstructured data. We define partial records on heterogeneous relations as a means to bridge the gap between structured and unstructured data, as well as an algebra that merges relational operations with Information Extraction operations. This algebra provides support to our Lightweight SQL (L-SQL) language. As part of this effort we define an interface between a relational engine and an Information Extraction module. We have implemented a system based on these ideas and describe the system, including optimization efforts and early experiments. 1.
Unstructured information integration through data-driven similarity discovery
"... Information integration from multiple heterogeneous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the integration of structured and unstructured (or text) data. In this paper we describe our work on a data-d ..."
Abstract
- Add to MetaCart
Information integration from multiple heterogeneous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the integration of structured and unstructured (or text) data. In this paper we describe our work on a data-driven approach to integrating various sources of text data, without relying on the availability of schema information. To this end, we have used various existing tools from natural language processing, data mining and related areas in a novel manner. The tools are used at the ’preprocessing ’ stage to (a) characterise each set of unstructured information (or collection of text data), (b) identify the related sets of unstructured information and (c) relate these sets to various reference data sets. All these steps are based solely on the instance values of the data sets. Subsequently the information compiled in the preprocessing stage may be used at query time to query the structured and text data. We also present our results on applying our techniques for data integration across multiple unstructured data sources, relating to customer comments of a service provider. 1
Mixed-Initiative, Entity-Centric Data Aggregation using Assistopedia ∗
"... Wikis allow for collaborators to collect information about entities. In turn, such entity information can be used for AI tasks, such as information extraction. However, these collaborators are almost exclusively human users. Allowing arbitrary software agents to act as collaborators can greatly enri ..."
Abstract
- Add to MetaCart
Wikis allow for collaborators to collect information about entities. In turn, such entity information can be used for AI tasks, such as information extraction. However, these collaborators are almost exclusively human users. Allowing arbitrary software agents to act as collaborators can greatly enrich a wiki since agents can contribute structured data to complement the human-contributed, unstructured-data. For instance, agents can import huge volumes of structured data about entities, enriching the pages, and agents can update wiki pages to reflect real-time information changes (e.g., win-loss records in sports). This paper describes an approach that allows for both arbitrary software agents and human users to collaborate. In particular, we address three key problems: agents updating the correct wiki pages, policies for agent updates, and sharing the schema across collaborators. Using our approach, we describe creating entity-focused wikis which include the ability to create dynamic categories of entities based on their wiki pages. These categories dynamically update their membership based upon real-world changes.

