Results 11 - 20
of
524
Criteria for evaluating usability evaluation methods
- International Journal of Human-Computer Interaction
, 2001
"... The current variety of alternative approaches to usability evaluation methods (UEMs) designed to assess and improve usability in software systems is offset by a general lack of understanding of the capabilities and limitations of each. Practitioners need to know which methods are more effective and ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
The current variety of alternative approaches to usability evaluation methods (UEMs) designed to assess and improve usability in software systems is offset by a general lack of understanding of the capabilities and limitations of each. Practitioners need to know which methods are more effective and in what ways and for what purposes. However, UEMs cannot be evaluated and compared reliably because of the lack of standard criteria for comparison. In this article, we present a practical discussion of factors, comparison criteria, and UEM performance measures useful in studies comparing UEMs. In demonstrating the importance of developing appropriate UEM evaluation criteria, we offer operational definitions and possible measures of UEM performance. We highlight specific challenges that researchers and practitioners face in comparing UEMs and provide a point of departure for further discussion and refinement of the principles and techniques used to approach UEM evaluation and comparison. 1.
A reference collection for Web spam
- SIGIR Forum
, 2006
"... We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges. 1
The pyramid method: incorporating human content selection variation in summarization evaluation
- ACM Transactions on Speech and Language Processing
, 2007
"... Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and infor ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and informative? In this paper we address these very questions by proposing a method for analysis of multiple human abstracts into semantic content units. Such analysis allows us not only to quantify human variation in content selection, but also to assign empirical importance weight to different content units. It serves as the basis for an evaluation method, the Pyramid Method, that incorporates the observed variation and is predictive of different equally informative summaries. We discuss the reliability of content unit annotation, the properties of Pyramid scores, and their correlation with other evaluation methods.
How Well do Experienced Software Developers Predict Software Change?
, 1998
"... practitioners and their customers who are familiar with the effects of large deviations between planned time for delivery and the actual one. Time is often estimated based on the size of the software to build and it is therefore interesting to investigate how well experienced software developers pre ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
practitioners and their customers who are familiar with the effects of large deviations between planned time for delivery and the actual one. Time is often estimated based on the size of the software to build and it is therefore interesting to investigate how well experienced software developers predict change.
Recognizing Subjectivity: A Case Study of Manual Tagging
- Natural Language Engineering
, 1999
"... In this paper, we describe a case study of a sentence-level categorization in which tagging instructions are developed and used by four judges to classify clauses from the Wall Street Journal as either subjective or objective. Agreement among the four judges is analyzed, and, based on that analysis, ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
In this paper, we describe a case study of a sentence-level categorization in which tagging instructions are developed and used by four judges to classify clauses from the Wall Street Journal as either subjective or objective. Agreement among the four judges is analyzed, and, based on that analysis, each clause is given a final classification. To provide empirical support for the classifications, correlations are assessed in the data between the subjective category and a basic semantic class posited by Quirk et al. (1985).
Automatic summarization of open-domain multiparty dialogues in diverse genres
- Computational Linguistics
, 2002
"... Automatic summarization of open-domain spoken dialogues is a relatively new research area. This article introduces the task and the challenges involved and motivates and presents an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different gen ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
Automatic summarization of open-domain spoken dialogues is a relatively new research area. This article introduces the task and the challenges involved and motivates and presents an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain. We address the following issues, which are intrinsic to spoken-dialogue summarization and typically can be ignored when summarizing written text such as news wire data: (1) detection and removal of speech disfluencies; (2) detection and insertion of sentence boundaries; and (3) detection and linking of cross-speaker information units (question-answer pairs). A system evaluation is performed using a corpus of 23 dialogue excerpts with an average duration of about 10 minutes, comprising 80 topical segments and about 47,000 words total. The corpus was manually annotated for relevant text spans by six human annotators. The global evaluation shows that for the two more informal genres, our summarization system using dialoguespecific components significantly outperforms two baselines: (1) a maximum-marginal-relevance ranking algorithm using TF*IDF term weighting, and (2) a LEAD baseline that extracts the first n words from a text. 1.
Nearest neighbor classification from multiple feature subsets
- Intelligent Data Analysis
, 1999
"... Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding, that significantly improve classifiers like decision trees, rule learners, or neural networks. Unfortunately, th ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding, that significantly improve classifiers like decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor classifier. In this paper, we present MFS, a combining algorithm designed to improve the accuracy of the nearest neighbor (NN) classifier. MFS combines multiple NN classifiers each using only a random subset of features. The experimental results are encouraging: On 25 datasets from the UCI Repository, MFS signi cantly outperformed several standard NN variants and was competitive with boosted decision trees. In additional experiments, we show that MFS is robust to irrelevant features, and is able to reduce both bias and variance components of error.
Verb Class Disambiguation Using Informative Priors
- COMPUTATIONAL LINGUISTICS
, 2004
"... Levin’s (1993) study of verb classes is a widely used resource for lexical semantics. In her framework, some verbs, such as give, exhibit no class ambiguity. But other verbs, such as write, have several alternative classes. We extend Levin’s inventory to a simple statistical model of verb class ambi ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
Levin’s (1993) study of verb classes is a widely used resource for lexical semantics. In her framework, some verbs, such as give, exhibit no class ambiguity. But other verbs, such as write, have several alternative classes. We extend Levin’s inventory to a simple statistical model of verb class ambiguity. Using this model we are able to generate preferences for ambiguous verbs without the use of a disambiguated corpus. We additionally show that these preferences are useful as priors for a verb sense disambiguator.
Data Management and Analysis Methods
- IN DENZIN N, LINCOLN Y (EDS.) HANDBOOK OF QUALITATIVE RESEARCH, 2ND ED., THOUSAND OAKS, CA: SAGE PUBLICATIONS
, 2000
"... This chapter is about methods for managing and analyzing qualitative data. By qualitative data we mean text: newspapers, movies, sitcoms, e-mail traffic, folktales, life histories. We also mean narratives—narratives about getting divorced, about being sick, about surviving hand-to-hand combat, about ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
This chapter is about methods for managing and analyzing qualitative data. By qualitative data we mean text: newspapers, movies, sitcoms, e-mail traffic, folktales, life histories. We also mean narratives—narratives about getting divorced, about being sick, about surviving hand-to-hand combat, about selling sex, about trying to quit smoking. In fact, most of the archaeologically recoverable information about human thought and human behavior is text, the good stuff of social science. Scholars in content analysis began using computers in the 1950s to do statistical analysis of texts (Pool, 1959), but recent advances in technology are changing the economics of the social
A Comprehensive Investigation of Quality Factors in Object-Oriented Designs. An Industrial Case Study
, 1998
"... This paper aims at empirically exploring the relationships between most of the existing design coupling, cohesion, and inheritance measures for object-oriented (OO) systems, and the fault-proneness of OO system classes. The underlying goal of this study is to better understand the relationship betwe ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
This paper aims at empirically exploring the relationships between most of the existing design coupling, cohesion, and inheritance measures for object-oriented (OO) systems, and the fault-proneness of OO system classes. The underlying goal of this study is to better understand the relationship between existing design measurement in OO systems and the quality of the software developed. In addition, we aim at assessing whether such relationships, once modeled, can be used to effectively drive and focus inspections or testing. The study described here is a replication of an analogous study conducted in a university environment with systems developed by students. In order to draw more general conclusions and to (dis)confirm the results obtained there, we now replicated the study using data collected on an industrial system developed by professionals. Results show that many of our findings are consistent across systems, despite the very disparate nature of the systems under study. Some of the strong dimensions captured by the measures in each data set are visible in both the university and industrial case study. For example, the frequency of method invocations appears to be the main driving factor of fault-proneness in all systems. However, there are also differences across studies, which illustrate the fact that, although many principles and techniques can be reused, quality does not follow universal laws and quality models must be developed locally, wherever needed.

