Results 1 - 10
of
342
A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach
- J MOL BIOL
, 2001
"... We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including ob ..."
Abstract
-
Cited by 177 (3 self)
- Add to MetaCart
We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. In these cases, the performance of SVM either matches or is significantly better than that of traditional machine learning approaches, including neural networks. The first use of the SVM approach to predict protein secondary structure is described here. Unlike the previous studies, we first constructed several binary classifiers, then assembled a tertiary classifier for three secondary structure states (helix, sheet and coil) based on these binary classifiers. The SVM method achieved a good performance of segment overlap accuracy SOV = 76.2 % through sevenfold cross validation on a database of 513 non-homologous protein chains with multiple sequence alignments, which out-performs existing methods. Meanwhile three-state overall per-residue accuracy Q 3 achieved 73.5 %, which is at least comparable to existing single prediction methods. Furthermore a useful "reliability index" for the predictions was developed. In addition, SVM has many attractive features, including effective avoidance of overfitting, the ability to handle large feature spaces, information condensing of the given data set, etc. The SVM method is conveniently applied to many other pattern classification tasks in biology.
Mining E-mail Content for Author Identification Forensics
- SIGMOD RECORD
, 2001
"... We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different em ..."
Abstract
-
Cited by 124 (3 self)
- Add to MetaCart
(Show Context)
We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different email topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.
Boosting trees for anti-spam email filtering
- In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG
, 2001
"... This paper describes a set of comparative experiments for the problem of automatically ltering unwanted electronic mail messages. Several variants of the AdaBoost algorithm with con dence{ rated predictions (Schapire & Singer 99) have been applied, which di er in the complexity of the base learn ..."
Abstract
-
Cited by 120 (0 self)
- Add to MetaCart
This paper describes a set of comparative experiments for the problem of automatically ltering unwanted electronic mail messages. Several variants of the AdaBoost algorithm with con dence{ rated predictions (Schapire & Singer 99) have been applied, which di er in the complexity of the base learners considered. Two main conclusions can be drawn from our experiments: a) The boosting{based methods clearly outperform the baseline learning algorithms (Naive Bayes and Induction of Decision Trees) on the PU1 corpus, achieving very high levels of the F1 measure � b) Increasing the complexity of the base learners allows to obtain better \high{precision " classi ers, which isavery important issue when misclassication costs are considered. 1
Finding Deceptive Opinion Spam by Any Stretch of the Imagination
"... Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008). Consequently, websites containing consumer reviews are becoming targets of opinion spam. While recent work has focused primarily on manually identifiable instances of opinion spam, in this work we s ..."
Abstract
-
Cited by 98 (10 self)
- Add to MetaCart
(Show Context)
Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008). Consequently, websites containing consumer reviews are becoming targets of opinion spam. While recent work has focused primarily on manually identifiable instances of opinion spam, in this work we study deceptive opinion spam—fictitious opinions that have been deliberately written to sound authentic. Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90 % accurate on our gold-standard opinion spam dataset. Based on feature analysis of our learned models, we additionally make several theoretical contributions, including revealing a relationship between deceptive opinions and imaginative writing. 1
Authorship Attribution with Support Vector Machines
- APPLIED INTELLIGENCE
, 2000
"... In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency v ..."
Abstract
-
Cited by 90 (0 self)
- Add to MetaCart
(Show Context)
In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60-80% of the cases. In a second experiment we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVM on full word forms was remarkably robust even if the author wrote about different topics.
Automatically assessing review helpfulness
- In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP
, 2006
"... User-supplied reviews are widely and increasingly used to enhance ecommerce and other websites. Because reviews can be numerous and varying in quality, it is important to assess how helpful each review is. While review helpfulness is currently assessed manually, in this paper we consider the task of ..."
Abstract
-
Cited by 75 (1 self)
- Add to MetaCart
(Show Context)
User-supplied reviews are widely and increasingly used to enhance ecommerce and other websites. Because reviews can be numerous and varying in quality, it is important to assess how helpful each review is. While review helpfulness is currently assessed manually, in this paper we consider the task of automatically assessing it. Experiments using SVM regression on a variety of features over Amazon.com product reviews show promising results, with rank correlations of up to 0.66. We found that the most useful features include the length of the review, its unigrams, and its product rating. 1
SVMs for the Blogosphere: Blog identification and Splog detection
- In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs
, 2006
"... Weblogs, or blogs have become an important new way to publish information, engage in discussions and form communities. The increasing popularity of blogs has given rise to search and analysis engines focusing on the “blogosphere”. A key requirement of such systems is to identify blogs as they crawl ..."
Abstract
-
Cited by 72 (7 self)
- Add to MetaCart
Weblogs, or blogs have become an important new way to publish information, engage in discussions and form communities. The increasing popularity of blogs has given rise to search and analysis engines focusing on the “blogosphere”. A key requirement of such systems is to identify blogs as they crawl the Web. While this ensures that only blogs are indexed, blog search engines are also often overwhelmed by spam blogs (splogs). Splogs not only incur computational overheads but also reduce user satisfaction. In this paper we first describe experimental results of blog identification using Support Vector Ma-chines (SVM). We compare results of using different feature sets and introduce new features for blog iden-tification. We then report preliminary results on splog detection and identify future work.
On Attacking Statistical Spam Filters
- IN PROCEEDINGS OF THE CONFERENCE ON E-MAIL AND ANTI-SPAM (CEAS)
, 2004
"... The efforts of anti-spammers and spammers has often been described as an arms race. As we devise new ways to stem the flood of bulk mail, spammers respond by working their way around the new mechanisms. Their attempts to bypass spam filters illustrates this struggle. Spammers have tried many thin ..."
Abstract
-
Cited by 64 (0 self)
- Add to MetaCart
The efforts of anti-spammers and spammers has often been described as an arms race. As we devise new ways to stem the flood of bulk mail, spammers respond by working their way around the new mechanisms. Their attempts to bypass spam filters illustrates this struggle. Spammers have tried many things from using HTML layout tricks, letter substitution, to adding random data. While at times their attacks are clever, they have yet to work strongly against the statistical nature that drives many filtering systems. The challenges in successfully developing such an attack are great as the variety of filtering systems makes it less likely that a single attack can work against all of them. Here, we examine the general attack methods spammers use, along with challenges faced by developers and spammers. We also demonstrate an attack that, while easy to implement, attempts to more strongly work against the statistical nature behind filters.
Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora
- In Technical Report, Computer Science department, IR-418
"... Office workers everywhere are drowning in email—not only spam, but also large quantities of legitimate email to be read and organized for browsing. Although there have been extensive investigations of automatic document categorization, email gives rise to a number of unique challenges, and there has ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
(Show Context)
Office workers everywhere are drowning in email—not only spam, but also large quantities of legitimate email to be read and organized for browsing. Although there have been extensive investigations of automatic document categorization, email gives rise to a number of unique challenges, and there has been relatively little study of classifying email into folders. This paper presents an extensive benchmark study of email foldering using two large corpora of real-world email messages and foldering schemes: one from former Enron employees, another from participants in an SRI research project. We discuss the challenges that arise from differences between email foldering and traditional document classification. We show experimental results from an array of automated classification methods and evaluation methodologies, including a new evaluation method of foldering results based on the email timeline, and including enhancements to the exponential gradient method Winnow, providing top-tier accuracy with a fraction the training time of alternative methods. We also establish that classification accuracy in many cases is relatively low, confirming the challenges of email data, and pointing toward email foldering as an important area for further research. 1.