Results 1 - 10
of
138
TwitterStand: News in Tweets ∗
"... Twitter is an electronic medium that allows a large user populace to communicate with each other simultaneously. Inherent to Twitter is an asymmetrical relationship between friends and followers that provides an interesting social networklike structure among the users of Twitter. Twitter messages, c ..."
Abstract
-
Cited by 37 (9 self)
- Add to MetaCart
Twitter is an electronic medium that allows a large user populace to communicate with each other simultaneously. Inherent to Twitter is an asymmetrical relationship between friends and followers that provides an interesting social networklike structure among the users of Twitter. Twitter messages, called tweets, are restricted to 140 characters and thus are usually very focused. We investigate the use of Twitter to build a news processing system, called TwitterStand, from Twitter tweets. The idea is to capture tweets that correspond to late breaking news. The result is analogous to a distributed news wire service. The difference is that the identities of the contributors/reporters are not known in advance and there may be many of them. Furthermore, tweets are not sent according to a schedule: they occur as news is happening, and tend to be noisy while usually arriving at a high throughput rate. Some of the issues addressed include removing the noise, determining tweet clusters of interest bearing in mind that the methods must be online, and determining the relevant locations associated with the tweets.
A maximum entropy model of phonotactics and phonotactic learning
, 2006
"... The study of phonotactics (e.g., the ability of English speakers to distinguish possible words like blick from impossible words like *bnick) is a central topic in phonology. We propose a theory of phonotactic grammars and a learning algorithm that constructs such grammars from positive evidence. Our ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
The study of phonotactics (e.g., the ability of English speakers to distinguish possible words like blick from impossible words like *bnick) is a central topic in phonology. We propose a theory of phonotactic grammars and a learning algorithm that constructs such grammars from positive evidence. Our grammars consist of constraints that are assigned numerical weights according to the principle of maximum entropy. Possible words are assessed by these grammars based on the weighted sum of their constraint violations. The learning algorithm yields grammars that can capture both categorical and gradient phonotactic patterns. The algorithm is not provided with any constraints in advance, but uses its own resources to form constraints and weight them. A baseline model, in which Universal Grammar is reduced to a feature set and an SPE-style constraint format, suffices to learn many phonotactic phenomena. In order to learn nonlocal phenomena such as stress and vowel harmony, it is necessary to augment the model with autosegmental tiers and metrical grids. Our results thus offer novel, learning-theoretic support for such representations. We apply the model to English syllable onsets, Shona vowel harmony, quantity-insensitive stress typology, and the full phonotactics of Wargamay, showing that the learned grammars capture the distributional generalizations of these languages and accurately predict the findings of a phonotactic experiment.
Speaking with Hands: Creating Animated Conversational Characters from Recordings of Human Performance
, 2004
"... We describe a method for using a database of recorded speech and captured motion to create an animated conversational character. People's utterances are composed of short, clearly-delimited phrases; in each phrase, gesture and speech go together meaningfully and synchronize at a common point of maxi ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
We describe a method for using a database of recorded speech and captured motion to create an animated conversational character. People's utterances are composed of short, clearly-delimited phrases; in each phrase, gesture and speech go together meaningfully and synchronize at a common point of maximum emphasis. We develop tools for collecting and managing performance data that exploit this structure. The tools help create scripts for performers, help annotate and segment performance data, and structure specific messages for characters to use within application contexts. Our animations then reproduce this structure. They recombine motion samples with new speech samples to recreate coherent phrases, and blend segments of speech and motion together phraseby -phrase into extended utterances. By framing problems for utterance generation and synthesis so that they can draw closely on a talented performance, our techniques support the rapid construction of animated characters with rich and appropriate expression.
A Shallow Parser Based on Closed-Class Words to Capture Relations in Biomedical Text
, 2003
"... Natural language processing for biomedical text currently focuses mostly on entity and relation extraction. These entities and relations are usually pre-specified entities, e.g., proteins, and pre-specified relations, e.g., inhibit relations. A shallow parser that captures the relations between noun ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Natural language processing for biomedical text currently focuses mostly on entity and relation extraction. These entities and relations are usually pre-specified entities, e.g., proteins, and pre-specified relations, e.g., inhibit relations. A shallow parser that captures the relations between noun phrases automatically from free text has been developed and evaluated. It uses heuristics and a noun phraser to capture entities of interest in the text. Cascaded finite state automata structure the relations between individual entities. The automata are based on closed-class English words and model generic relations not limited to specific words. The parser also recognizes coordinating conjunctions and captures negation in text, a feature usually ignored by others. Three cancer researchers evaluated 330 relations extracted from 26 abstracts of interest to them. There were 296 relations correctly extracted from the abstracts resulting in 90% precision of the relations and an average of 11 correct relations per abstract.
Text mining and ontologies in biomedicine: making sense of raw text
- Briefings in Bioinformatics
, 2005
"... The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill info rmation, extract facts, discover implicit links and generate hypotheses relevant to use ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill info rmation, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. In this article, we summarize different approaches in which ontologies have been used for text mining applications in biomedicine.
STEWARD: Architecture of a Spatio-Textual Search Engine
- in ACMGIS 2007
, 2007
"... STEWARD (“Spatio-Textual Extraction on the Web Aiding Retrieval of Documents”), a system for extracting, querying, and visualizing textual references to geographic locations in unstructured text documents, is presented. Methods for retrieving and processing web documents, extracting and disambiguati ..."
Abstract
-
Cited by 23 (15 self)
- Add to MetaCart
STEWARD (“Spatio-Textual Extraction on the Web Aiding Retrieval of Documents”), a system for extracting, querying, and visualizing textual references to geographic locations in unstructured text documents, is presented. Methods for retrieving and processing web documents, extracting and disambiguating georeferences, and identifying geographic focus are described. A brief overview of STEWARD’s querying capabilities, as well as the design of an intuitive user interface, are provided. Finally, several application scenarios and future extensions to STEWARD are discussed.
Fast Context-Free Grammar Parsing Requires Fast Boolean Matrix Multiplication
, 2002
"... In 1975, Valiant showed that Boolean matrix multiplication can be used for parsing context-free grammars (CFGs), yielding the asympotically fastest (although not practical) CFG parsing algorithm known. We prove a dual result: any CFG parser with time complexity $O(g n^{3 - \epsilson})$, where $g$ is ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
In 1975, Valiant showed that Boolean matrix multiplication can be used for parsing context-free grammars (CFGs), yielding the asympotically fastest (although not practical) CFG parsing algorithm known. We prove a dual result: any CFG parser with time complexity $O(g n^{3 - \epsilson})$, where $g$ is the size of the grammar and $n$ is the length of the input string, can be efficiently converted into an algorithm to multiply $m \times m$ Boolean matrices in time $O(m^{3 - \epsilon/3})$. Given that practical, substantially sub-cubic Boolean matrix multiplication algorithms have been quite difficult to find, we thus explain why there has been little progress in developing practical, substantially sub-cubic general CFG parsers. In proving this result, we also develop a formalization of the notion of parsing.
Representing hierarchical POMDPs as DBNs for multi-scale robot localization
, 2004
"... We explore the advantages of representing hierarchical partially observable Markov decision processes (H-POMDPs) as dynamic Bayesian networks (DBNs). In particular, we focus on the special case of using H-POMDPs to represent multiresolution spatial maps for indoor robot navigation. Our results show ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
We explore the advantages of representing hierarchical partially observable Markov decision processes (H-POMDPs) as dynamic Bayesian networks (DBNs). In particular, we focus on the special case of using H-POMDPs to represent multiresolution spatial maps for indoor robot navigation. Our results show that a DBN representation of H-POMDPs can train significantly faster than the original learning algorithm for H-POMDPs or the equivalent flat POMDP, and requires much less data. In addition, the DBN formulation can easily be extended to parameter tying and factoring of variables, which further reduces the time and sample complexity. This enables us to apply H-POMDP methods to much larger problems than previously possible. 1.
Multi-level annotation of linguistic data with MMAX2
- Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods
, 2006
"... This paper describes how richly annotated corpora can be created with the annotation tool MMAX2. The description is from the point of view of Computational Linguistics, a discipline where annotated corpora are often used as resources for software development. The paper outlines the important steps i ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
This paper describes how richly annotated corpora can be created with the annotation tool MMAX2. The description is from the point of view of Computational Linguistics, a discipline where annotated corpora are often used as resources for software development. The paper outlines the important steps in the life cycle of an annotation and details how the tool MMAX2 can be employed in each of them. 1
Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News
- In Proceedings of the 14th International World Wide Web Conference
, 2005
"... The Rich News system, that can automatically annotate radio and television news with the aid of resources retrieved from the World Wide Web, is described. Automatic speech recognition gives a temporally precise but conceptually inaccurate annotation model. Information extraction from related web new ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
The Rich News system, that can automatically annotate radio and television news with the aid of resources retrieved from the World Wide Web, is described. Automatic speech recognition gives a temporally precise but conceptually inaccurate annotation model. Information extraction from related web news sites gives the opposite: conceptual accuracy but no temporal data. Our approach combines the two for temporally accurate conceptual semantic annotation of broadcast news. First low quality transcripts of the broadcasts are produced using speech recognition, and these are then automatically divided into sections corresponding to individual news stories. A key phrases extraction component finds key phrases for each story and uses these to search for web pages reporting the same event. The text and meta-data of the web pages is then used to create index documents for the stories in the original broadcasts, which are semantically annotated using the KIM knowledge management platform. A web interface then allows conceptual search and browsing of news stories, and playing of the parts of the media files corresponding to each news story. The use of material from the World Wide Web allows much higher quality textual descriptions and semantic annotations to be produced than would have been possible using the ASR transcript directly. The semantic annotations can form a part of the Semantic Web, and an evaluation shows that the system operates with high precision, and with a moderate level of recall.

