Results 1 -
8 of
8
Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
"... Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entitie ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus of 1.5 million disambiguated mentions in Web pages by selecting link anchors referring to Wikipedia entities. We show that the combination of the hierarchical model with distributed inference quickly obtains high accuracy (with error reduction of 38%) on this large dataset, demonstrating the scalability of our approach. 1
and others, FastBit: Interactively Searching Massive Data
- Proc. of SciDAC 2009
, 2009
"... Abstract. As scientific instruments and computer simulations produce more and more data, the task of locating the essential information to gain insight becomes increasingly difficult. FastBit is an efficient software tool to address this challenge. In this article, we present a summary of the key te ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. As scientific instruments and computer simulations produce more and more data, the task of locating the essential information to gain insight becomes increasingly difficult. FastBit is an efficient software tool to address this challenge. In this article, we present a summary of the key techniques, namely bitmap compression, encoding and binning. The advances in these techniques have led to a search tool that can answer structured (SQL) queries orders of magnitude faster than popular database systems. To illustrate how FastBit is used in applications, we present three examples involving a high-energy physics experiment, a combustion simulation, and an accelerator simulation. In each case, FastBit significantly reduces the response time and enables interactive exploration on terabytes of data. 1.
Online generation of locality sensitive hash signatures
- In ACL Short Papers
, 2010
"... Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common c ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique. 1
The infinite HMM for unsupervised PoS tagging
- In Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing
, 2009
"... We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We extend previous work on fully unsupervised part-of-speech tagging. Using a non-parametric version of the HMM, called the infinite HMM (iHMM), we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging. We experiment with two non-parametric priors, the Dirichlet and Pitman-Yor processes, on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm. We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported. Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations. 1
Grammar based statistical MT on Hadoop
, 2009
"... An end-to-end toolkit for large scale PSCFG based MT ..."
J Internet Serv Appl DOI 10.1007/s13174-010-0001-z ORIGINAL PAPER The unique strengths and storage access characteristics of discard-based search
, 2010
"... © The Author(s) 2010. This article is published with open access at Springerlink.com Abstract Discard-based search is a new approach to searching the content of complex, unlabeled, nonindexed data such as digital photographs, medical images, and realtime surveillance data. The essence of this approa ..."
Abstract
- Add to MetaCart
© The Author(s) 2010. This article is published with open access at Springerlink.com Abstract Discard-based search is a new approach to searching the content of complex, unlabeled, nonindexed data such as digital photographs, medical images, and realtime surveillance data. The essence of this approach is query-specific content-based computation, pipelined with human cognition. In this approach, query-specific parallel computation shrinks a search task down to human scale, thus allowing the expertise, judgment, and intuition of an expert to be brought to bear on the specificity and selectivity of the search. In this paper, we report on the lessons learned in the Diamond project from applying discard-based search to a variety of applications in the health sciences. From the viewpoint of a user, discard-based search offers unique strengths. From the viewpoint of server hardware and software, it offers unique opportunities for optimization that contradict long-established tenets of storage design. Together, these distinctive end-to-end attributes herald a new genre of Internet applications. Keywords Data-intensive computing · Non-text search technology · Medical image processing · Interactive search · Computer vision · Pattern recognition · Distributed
Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor
"... We present progress on Joshua, an opensource decoder for hierarchical and syntaxbased machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation ..."
Abstract
- Add to MetaCart
We present progress on Joshua, an opensource decoder for hierarchical and syntaxbased machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats. 1
8 Parallel File Systems
"... The success of a CDI Grid is dependent upon the design of its storage infrastructure. As seen in Chapter 7, processing in this environment revolves around the simultaneous movement and transformation of data on many compute elements. Effective storage solutions combine hardware and software to meet ..."
Abstract
- Add to MetaCart
The success of a CDI Grid is dependent upon the design of its storage infrastructure. As seen in Chapter 7, processing in this environment revolves around the simultaneous movement and transformation of data on many compute elements. Effective storage solutions combine hardware and software to meet these needs.

