Results 1 - 10
of
11
Advances on online learning-based spam filters
- In PhD Thesis - Tufts University
, 2008
"... I would like to take this opportunity to thank my advisor Carla Brodley for her patient guidance, my parents David and Paula Sculley for their support and encouragement, and my bride Jessica Evans for making everything worth doing. I gratefully acknowledge Rediff.com for funding the writing of this ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
I would like to take this opportunity to thank my advisor Carla Brodley for her patient guidance, my parents David and Paula Sculley for their support and encouragement, and my bride Jessica Evans for making everything worth doing. I gratefully acknowledge Rediff.com for funding the writing of this dissertation.
On Free Speech and Civil Discourse: Filtering Abuse in Blog Comments
"... Internet blogs provide forums for discussions within virtual communities, allowing readers to post comments on what they read. However, such comments may contain abuse, such as personal attacks, offensive remarks about race or religion, or commercial spam, all of which reduce the value of community ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Internet blogs provide forums for discussions within virtual communities, allowing readers to post comments on what they read. However, such comments may contain abuse, such as personal attacks, offensive remarks about race or religion, or commercial spam, all of which reduce the value of community discussion. Ideally, filters would promote civil discourse by removing abusive comments while protecting free speech by not removing any comments unnecessarily. In this paper, we investigate the use of user flags to train filters for this task, with the goal of empowering each community to enforce its own standards. We find encouraging results on experiments using a large corpus of blog comment data with real users flags. We conclude by proposing several novel deployment schemes for filters in this setting. 1
Filtering Email Spam in the Presence of Noisy User Feedback
"... Recent email spam filtering evaluations, such as those conducted at TREC, have shown that near-perfect filtering results are attained with a variety of machine learning methods when filters are given perfectly accurate labeling feedback for training. Yet in realworld settings, labeling feedback may ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recent email spam filtering evaluations, such as those conducted at TREC, have shown that near-perfect filtering results are attained with a variety of machine learning methods when filters are given perfectly accurate labeling feedback for training. Yet in realworld settings, labeling feedback may be far from perfect. Real users give feedback that is often mistaken, inconsistent, or even maliciously inaccurate. To our knowledge, the impact of this noisy labeling feedback on current spam filtering methods has not been previously explored in the literature. In this paper, we show that noisy feedback may harm or even break state-of-the-art spam filters, including recent TREC winners. We then propose and evaluate several approaches to make such filters robust to label noise. We find that although such modifications are effective for uniform random label noise, more realistic “natural ” label noise from human users remains a difficult challenge. 1 Introduction: Noise
Sisterhood of Classifiers: A Comparative Study of Naive Bayes and Noisy-or Networks
"... Classification is a task central to many machine learning problems. In this paper we examine two Bayesian network classifiers, the naive Bayes and the noisy-or models. They are of particular interest because of their simple structures. We compare them on two dimensions: expressive power and ability ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Classification is a task central to many machine learning problems. In this paper we examine two Bayesian network classifiers, the naive Bayes and the noisy-or models. They are of particular interest because of their simple structures. We compare them on two dimensions: expressive power and ability to learn. As it turns out, naive Bayes, noisy-or, and logistic regression classifiers all have equivalent expressiveness. We show mathematical derivations of how to transform a classifer in one model into the other two. These classifiers differ on their ability to learn though. We conducted an experiment confirming the intuition that naive Bayes performs better than noisy-or when the data fits its independence assumptions, and vice versa. However, we still do not have a clear set of criteria for determining under exactly what conditions would each classifier excel. Further study of the strenghts and weaknesses of each classifier should provide deeper insight on how to improve the current models. One possible extension would be to combine the naive Bayes and noisy-or model so that the network will more closely depict the actual relationship between the attributes. 1
Analysis of Biological Networks: Protein-Protein Interaction Data Preprocessing
"... In Lecture 1 we learned about two methods for identifying protein interactions: Yeast Two-Hybrid and CoImmunoprecipitation. These techniques are designed to identify physical bindings between proteins. 1.1.1 Yeast Two-Hybrid (Y2H) The yeast two-hybrid technique [17, 9] allows the detection pair-wise ..."
Abstract
- Add to MetaCart
In Lecture 1 we learned about two methods for identifying protein interactions: Yeast Two-Hybrid and CoImmunoprecipitation. These techniques are designed to identify physical bindings between proteins. 1.1.1 Yeast Two-Hybrid (Y2H) The yeast two-hybrid technique [17, 9] allows the detection pair-wise protein interactions. It exploits the modular property typical of many eukaryotic transcription factors, which can be usually decomposed in two distinct modules, one directly binding to DNA (DB, DNA-binding domain) and the other activating transcription (AD, transcriptional activating domain). The first component, DB, is able to bind to DNA even by itself, while the second module, AD, will activate transcription only if physically associated to a binding domain. In the two-hybrid experiment the test proteins are expressed as fusion proteins (hybrids) with a DNA-binding domain (DB, the bait) and a transcriptional activating domain (AD, the prey). Fusions partners are coexpressed in yeast nucleus where a protein-protein interaction is identified thanks to the activation of the reporter gene, which can be detected and measured. Figure 1 shows that the two proteins whose interaction is under scrutiny, here indicated as bait and prey, are expressed as fusion proteins, respectively, with a binding domain (BD) and an activation domain (AD). If an interaction between bait and prey takes place, the complex formed activates the transcription of the reporter gene, allowing, as a consequence, the detection of the interaction itself.
Investigation of data clustering preprocessing
"... algorithm on independent attributes to improve the performance of CLONALG Dr. S.Chitra 1, B.MadhuSudhanan 2,DR.M.Rajaram 3, DR.S.N.Sivanandham 4 It is a popularly held belief that preprocessing of data generally improves the classification efficiency of data mining algorithms. We study the effects o ..."
Abstract
- Add to MetaCart
algorithm on independent attributes to improve the performance of CLONALG Dr. S.Chitra 1, B.MadhuSudhanan 2,DR.M.Rajaram 3, DR.S.N.Sivanandham 4 It is a popularly held belief that preprocessing of data generally improves the classification efficiency of data mining algorithms. We study the effects of preprocess by utilizing an algorithm to cluster points in a data set based upon each attribute independently, resulting in additional information about the data points with respect to each of its dimensions. Noise, data boundaries are identified and the cleaned data subset is used to study the performance of CLONALG data mining algorithm against unprocessed dataset. I. ARTIFICIAL IMMUNE SYSTEMS
Novel Methods to Elucidate Core Classes in Multi-Dimensional Biomedical Data
"... Breast cancer, which is the most common cancer in women, is a complex disease characterised by multiple molecular alterations. Current routine clinical management relies on availability of robust clinical and pathologic prognostic and predictive factors, like the Nottingham Prognostic Index, to supp ..."
Abstract
- Add to MetaCart
Breast cancer, which is the most common cancer in women, is a complex disease characterised by multiple molecular alterations. Current routine clinical management relies on availability of robust clinical and pathologic prognostic and predictive factors, like the Nottingham Prognostic Index, to support decision making. Recent advances in highthroughput molecular technologies supported the evidence of a biologic heterogeneity of breast cancer. This thesis is a multi-disciplinary work involving both computer scientists and molecular pathologists. It focuses on the development of advanced computational models for the classification of breast cancer into sub-types of the disease based on protein expression levels of selected markers. In a previous study conducted at the University of Nottingham, it has been suggested that immunohistochemical analysis may be used to identify distinct biological classes of breast cancer. The objectives of this work were related both to the clinical and technical aspects. From a clinical point of view, the aim was to encourage a multiple techniques approach
Analysis of Biological Networks: Network Motifs PPI Networks: Data Preprocessing and Functional Annotation
, 2009
"... Network motifs are defined as ”Recurring patterns of interactions that are significantly over-represented”. The motivation for analyzing the motif content of the network lies on the basic assumption that the overrepresentation of a certain subnetwork indicates it has some functional importance. Thus ..."
Abstract
- Add to MetaCart
Network motifs are defined as ”Recurring patterns of interactions that are significantly over-represented”. The motivation for analyzing the motif content of the network lies on the basic assumption that the overrepresentation of a certain subnetwork indicates it has some functional importance. Thus, exploring the abundant motifs in a network may provide with novel insights regarding the functionality of these motifs in the network. Most of the notions and analyzes described here have been developed in Uri Alon’s laboratory in the Weizmann Institute. 1.1 Motifs in general networks Milo et al. [20] analyzed 18 different networks from the following sources (Figure 1): • Transcription networks from E. coli and S. cerevisiae. • Synaptic connections between neurons in the nematode C. elegans. • Food webs of throphic interactions between predator and prey in different ecological systems. For each network, all the possible motifs of size n = 3 (n denoting the amount of nodes; shown in Figure 1) and of n = 4 were enumerated and compared to the average count over 1000 random networks. In this case, the randomized networks were generated while preserving the following properties of the original network: • In-degree, out-degree and mutual degree. This is done by the edge-swapping procedure for generating random graphs, as described in Lecture 2, where swapping of a single edge and a mutual edge is done separately. A mutual edge between u and v consists of a directed edge (u, v) and of a directed edge (v, u). • The number of appearances of all (n − 1)-node subgraphs (for n> 3). This is done to ensure that a high significance was not assigned to a pattern only because it has a highly significant sub-pattern. For example, a high number of 4-cliques in a network is less surprising if it has an enrichment of
Semantic Grid Map Building
"... Conventional Occupancy Grid (OG) map which contains occupied and unoccupied cells can be enhanced by incorporating semantic labels of places to build semantic grid map. Map with semantic information is more understandable to humans and hence can be used for efficient communication, leading to effect ..."
Abstract
- Add to MetaCart
Conventional Occupancy Grid (OG) map which contains occupied and unoccupied cells can be enhanced by incorporating semantic labels of places to build semantic grid map. Map with semantic information is more understandable to humans and hence can be used for efficient communication, leading to effective human robot interactions. This paper proposes a new approach that enables a robot to explore an indoor environment to build an occupancy grid map and then perform semantic labeling to generate a semantic grid map. Geometrical information is obtained by classifying the places into three different semantic classes based on data collected by a 2D laser range finder. Classification is achieved by implementing logistic regression as a multi-class classifier, and the results are combined in a probabilistic framework. Labeling accuracy is further improved by topological correction on robot position map which is an intermediate product, and also by outlier removal process on semantic grid map. Simulation on data collected in a university environment shows appealing results. 1
Analysis of Human Immune System Inspired Intrusion Detection System
"... Abstract — Artificial Immune Systems (AIS) are algorithms inspired by the human immune system. The human immune system is a robust, decentralized, error tolerant and adaptive system. Such properties are highly desirable for the development of novel computer systems. Unlike some other bio-inspired te ..."
Abstract
- Add to MetaCart
Abstract — Artificial Immune Systems (AIS) are algorithms inspired by the human immune system. The human immune system is a robust, decentralized, error tolerant and adaptive system. Such properties are highly desirable for the development of novel computer systems. Unlike some other bio-inspired techniques, such as genetic algorithms and neural networks, the field of AIS encompasses a spectrum of algorithms to implement different functions. In this paper we investigate CLONALG for network intrusion classification. The Clonal Selection Algorithm (CLONALG) is inspired by the clonal selection theory of acquired immunity, which has shown success on broad range of engineering problem domains.

