Results 1 - 10
of
11
Web-Based Inference Detection
"... Newly published data, when combined with existing public knowledge, allows for complex and sometimes unintended inferences. We propose semi-automated tools for detecting these inferences prior to releasing data. Our tools give data owners a fuller understanding of the implications of releasing data ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Newly published data, when combined with existing public knowledge, allows for complex and sometimes unintended inferences. We propose semi-automated tools for detecting these inferences prior to releasing data. Our tools give data owners a fuller understanding of the implications of releasing data and help them adjust the amount of data they release to avoid unwanted inferences. Our tools first extract salient keywords from the private data intended for release. Then, they issue search queries for documents that match subsets of these keywords, within a reference corpus (such as the public Web) that encapsulates as much of relevant public knowledge as possible. Finally, our tools parse the documents returned by the search queries for keywords not present in the original private data. These additional keywords allow us to automatically estimate the likelihood of certain inferences. Potentially dangerous inferences are flagged for manual review. We call this new technology Web-based inference control. The paper reports on two experiments which demonstrate early successes of this technology. The first experiment shows the use of our tools to automatically estimate the risk that an anonymous document allows for re-identification of its author. The second experiment shows the use of our tools to detect the risk that a document is linked to a sensitive topic. These experiments, while simple, capture the full complexity of inference detection and illustrate the power of our approach.
Application of a probability-based algorithm to extraction of product features from online reviews
, 2006
"... Prior research has demonstrated the viability of automatically extracting product features from online reviews. This paper presents a probability-based algorithm and compares it to an existing support-based approach. Specifically, I used each algorithm to extract features from 7 Amazon.com product c ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Prior research has demonstrated the viability of automatically extracting product features from online reviews. This paper presents a probability-based algorithm and compares it to an existing support-based approach. Specifically, I used each algorithm to extract features from 7 Amazon.com product categories and then asked end users to rate the features in terms of helpfulness for choosing products. The end users preferred the features identified by the probability-based algorithm. This probability-based algorithm can identify features that comprise a single noun or two successive nouns (which end users rated as more helpful than features comprising only one noun), yet even for collections of tens of thousands of reviews, it still executes fast enough (at around 1ms per review) for practical use. Over one dozen colleagues helped pre-test early versions of the survey. Norman Sadeh and George Duncan provided valuable input concerning the study’s design and analysis. This work has been funded in part by the EUSES Consortium via the National Science Foundation (ITR-0325273) and by the National Science Foundation under Grant CCF-0438929. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the sponsors.
Extending the Cochran rule for the comparison of word frequencies between corpora
- In Proceedings of the 7th International Conference on Statistical analysis of textual data (JADT 2004
, 2004
"... We first describe a number of inter-related issues that need to be considered by the researcher when comparing frequencies of linguistic features in two or more corpora. We then describe the chi-squared and log-likelihood tests used in previous research for the comparison of word frequencies. Our fo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We first describe a number of inter-related issues that need to be considered by the researcher when comparing frequencies of linguistic features in two or more corpora. We then describe the chi-squared and log-likelihood tests used in previous research for the comparison of word frequencies. Our focus, in this paper, is on the issue of reliability of the statistical tests, and we describe simulation experiments to compare the reliability of the chisquared and log-likelihood statistics under conditions of different-sized corpora and probability of a word occurring in text. We observe that the Cochran rule provides a good guide to accuracy of both statistics in general, but in some cases it needs to be extended. We conclude by recommending higher cut-off values for the Cochran rule at the 5%, 1 % and 0.1 % levels. In order to extend applicability of the frequency comparisons to expected values of 1 or more, use of the log-likelihood statistic is preferred over the chi-squared statistic, at the 0.01 % level. The trade-off for corpus linguists is that the new critical value is 15.13.
Peeping Tom in the Neighborhood: Keystroke Eavesdropping on Multi-User Systems
"... A multi-user system usually involves a large amount of information shared among its users. The security implications of such information can never be underestimated. In this paper, we present a new attack that allows a malicious user to eavesdrop on other users ’ keystrokes using such information. O ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
A multi-user system usually involves a large amount of information shared among its users. The security implications of such information can never be underestimated. In this paper, we present a new attack that allows a malicious user to eavesdrop on other users ’ keystrokes using such information. Our attack takes advantage of the stack information of a process disclosed by its virtual file within procfs, the process file system supported by Linux. We show that on a multi-core system, the ESP of a process when it is making system calls can be effectively sampled by a “shadow ” program that continuously reads the public statistical information of the process. Such a sampling is shown to be reliable even in the presence of multiple users, when the system is under a realistic workload. From the ESP content, a keystroke event can be identified if they trigger system calls. As a result, we can accurately determine inter-keystroke timings and launch a timing attack to infer the characters the victim entered. We developed techniques for automatically analyzing an application’s binary executable to extract the ESP pattern that fingerprints a keystroke event. The occurrences of such a pattern are identified from an ESP trace the shadow program records from the application’s runtime to calculate timings. These timings are further analyzed using a Hidden Markov Model and other public information related to the victim on a multi-user system. Our experimental study demonstrates that our attack greatly facilitates password cracking and also works very well on recognizing English words. 1
Grammatical word class variation within the British National Corpus Sampler
"... This paper examines the relationship between part-of-speech frequencies and text typology in the British National Corpus Sampler. Four pairwise comparisons of part-of-speech frequencies were made: written language vs. spoken language; informative writing vs. imaginative writing; conversational speec ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper examines the relationship between part-of-speech frequencies and text typology in the British National Corpus Sampler. Four pairwise comparisons of part-of-speech frequencies were made: written language vs. spoken language; informative writing vs. imaginative writing; conversational speech vs. ‘task-oriented ’ speech; and imaginative writing vs. ‘task-oriented ’ speech. The following variation gradient was hypothesized: conversation – task-oriented speech – imaginative writing – informative writing; however, the actual progression was: conversation – imaginative writing – task-oriented speech – informative writing. It thus seems that genre and medium interact in a more complex way than originally hypothesized. However, this conclusion has been made on the basis of broad, pre-existing text types within the BNC, and, in future, the internal structure of these text types may need to be addressed. 1.
VARD 2: A tool for dealing with spelling variation in historical corpora
"... When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatica ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatical shifts in language over time. Basic corpus retrieval techniques such as frequency profiling and concordancing will also be affected, in addition to the more sophisticated
COMPUTING THE VOCABULARY DEMANDS OF L2 READING
"... Linguistic computing can make two important contributions to second language (L2) reading instruction. One is to resolve longstanding research issues that are based on an insufficiency of data for the researcher, and the other is to resolve related pedagogical problems based on insufficiency of inpu ..."
Abstract
- Add to MetaCart
Linguistic computing can make two important contributions to second language (L2) reading instruction. One is to resolve longstanding research issues that are based on an insufficiency of data for the researcher, and the other is to resolve related pedagogical problems based on insufficiency of input for the learner. The research section of the paper addresses the question of whether reading alone can give learners enough vocabulary to read. When the computer’s ability to process large amounts of both learner and linguistic data is applied to this question, it becomes clear that, for the vast majority of L2 learners, free or wide reading alone is not a sufficient source of vocabulary knowledge for reading. But computer processing also points to solutions to this problem. Through its ability to reorganize and link documents, the networked computer can increase the supply of vocabulary input that is available to the learner. The development section of the paper elaborates a principled role for computing in L2 reading pedagogy, with examples, in two broad areas, computer-based text design and computational enrichment of undesigned texts.
Phrase-based Memory-based Machine Translation
"... This master thesis aims to investigate a phrase-based approach of Memory-based Machine Translation. This is a form of automatic translation powered by lazy-learning classifiers to translate fragments of the input sentence. A parallel corpus serves as the basis for training such a classifier. In the ..."
Abstract
- Add to MetaCart
This master thesis aims to investigate a phrase-based approach of Memory-based Machine Translation. This is a form of automatic translation powered by lazy-learning classifiers to translate fragments of the input sentence. A parallel corpus serves as the basis for training such a classifier. In the phrase-based approach the principal component of these fragments is a phrase of arbitrary length. This can be contrasted to prior research in the field in which this component was a single word. A key element in the research is a comparison of three methods of phrase extraction. A new decoder has been developed to deal with the characteristics unique to this approach, and re-assemble the translated fragments into one final translation. This research will show that one of the proposed phrase-extraction methods is capable of outperforming previous word-based approaches, even though this gain is limited and the impact
PSYCHOLOGICAL SCIENCE Research Article The Relationship Between Language and the Environment Information Theory Shows Why We Have Only Three Lightness Terms
"... ABSTRACT—The surface reflectance of objects is highly variable, ranging between 4 % for, say, charcoal and 90% for fresh snow. When stimuli are presented simultaneously, people can discriminate hundreds of levels of visual intensity. Despite this, human languages possess a maximum of just three basi ..."
Abstract
- Add to MetaCart
ABSTRACT—The surface reflectance of objects is highly variable, ranging between 4 % for, say, charcoal and 90% for fresh snow. When stimuli are presented simultaneously, people can discriminate hundreds of levels of visual intensity. Despite this, human languages possess a maximum of just three basic terms for describing lightness. In English, these are white (or light), black (or dark), and gray. Why should this be? Using information theory, combined with estimates of the distribution of reflectances in the natural world and the reliability of lightness recall over time, we show that three lightness terms is the optimal number for describing surface reflectance properties in a modern urban or indoor environment. We also show that only two lightness terms would be required in a forest or rural environment. People can discriminate hundreds of levels of visual intensity (Chapanis, 1965), and yet English possesses only three basic terms for describing lightness. English is far from unusual in this paucity of brightness terms. Although the theoretical details of Berlin and Kay’s (1969) survey of basic color terms are controversial (Saunders & van Brakel, 1997), it is a robust finding that even the simplest languages (in terms of the number of color terms) have two basic lightness terms, whereas the most complicated have only three. Why should this be? One way of addressing this interesting question is to use information theory

