Results 1  10
of
96
An overview of similarity measures for clustering XML documents
 Chapter in Athena Vakali and George Pallis
, 2006
"... The large amount and heterogeneity of XML documents on the Web require the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and links inside and among documents. For instance, grouping together do ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
The large amount and heterogeneity of XML documents on the Web require the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and links inside and among documents. For instance, grouping together documents with similar structures has interesting applications in the context of information extraction, of heterogeneous data integration, of personalized content delivery, of access control definition, of web site structural analysis, of comparison of RNA secondary structures. Many approaches have been proposed for evaluating the structural and content similarity between treebased and vectorbased representations of XML documents. Linkbased similarity approaches developed for Web data clustering have been adapted for XML documents. This chapter discusses and compares the most relevant similarity measures and their employment for XML document clustering.
Hybrid metaheuristics for the vehicle routing problem with stochastic demands
 JOURNAL OF MATHEMATICAL
, 2006
"... ..."
The Life and Death of Statically Detected Vulnerabilities: an Empirical Study
"... Vulnerable statements constitute a major problem for developers and maintainers of networking systems. Their presence can ease the success of security attacks, aimed at gaining unauthorized access to data and functionality, or at causing system crashes and data loss. Examples of attacks caused by so ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Vulnerable statements constitute a major problem for developers and maintainers of networking systems. Their presence can ease the success of security attacks, aimed at gaining unauthorized access to data and functionality, or at causing system crashes and data loss. Examples of attacks caused by source code vulnerabilities are buffer overflows, command injections, and crosssite scripting. This paper reports on an empirical study, conducted across four networking systems, aimed at observing the evolution and decay of vulnerabilities detected by three freely available static analysis tools. In particular, the study compares the decay of different kinds of vulnerabilities, characterizes the decay likelihood through probability density functions, and reports a quantitative and qualitative analysis of the reasons for vulnerability removals. The study is performed by using a framework that traces the evolution of source code fragments across subsequent commits.
Foundations of Statistical Processing of Setvalued Data: Towards Efficient Algorithms
 Proceedings of the Fifth International Conference on Intelligent Technologies InTech’04
, 2004
"... Abstract — Due to measurement uncertainty, often, instead of the actual values xi of the measured quantities, we only know the intervals xi = [�xi − ∆i, �xi + ∆i], where �xi is the measured value and ∆i is the upper bound on the measurement error (provided, e.g., by the manufacturer of the measuring ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract — Due to measurement uncertainty, often, instead of the actual values xi of the measured quantities, we only know the intervals xi = [�xi − ∆i, �xi + ∆i], where �xi is the measured value and ∆i is the upper bound on the measurement error (provided, e.g., by the manufacturer of the measuring instrument). These intervals can be viewed as random intervals, i.e., as samples from the intervalvalued random variable. In such situations, instead of the exact value of a sample statistic such as covariance Cx,y, we can only have an interval Cx,y of possible values of this statistic. In this paper, we extend the foundations of traditional statistics to statistics of such setvalued data, and describe how this foundation can lead to efficient algorithms for computing the corresponding setvalued statistics. I. STATISTICAL ESTIMATION:
The Effectiveness of Source Code Obfuscation: an Experimental Assessment
"... Source code obfuscation is a protection mechanism widely used to limit the possibility of malicious reverse engineering or attack activities on a software system. Although several code obfuscation techniques and tools are available, little knowledge is available about the capability of obfuscation t ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Source code obfuscation is a protection mechanism widely used to limit the possibility of malicious reverse engineering or attack activities on a software system. Although several code obfuscation techniques and tools are available, little knowledge is available about the capability of obfuscation to reduce attackers ’ efficiency, and the contexts in which such an efficiency may vary. This paper reports the outcome of two controlled experiments meant to measure the ability of subjects to understand and modify decompiled, obfuscated Java code, compared to decompiled, clear code. Results quantify to what extent code obfuscation is able to make attacks more difficult to be performed, and reveal that obfuscation can mitigate the effect of factors that can alter the likelihood of a successful attack, such as the attackers’ skill and experience, or the intrinsic characteristics of the system under attack. Keywords: Empirical studies, Software Obfuscation, Program comprehension
Data preprocessing in liquid chromatographymass spectrometrybased proteomics
 Bioinformatics
, 2005
"... Motivation: In a liquid chromatographymass spectrometry (LCMS) based expressional proteomics, multiple samples from different groups are analyzed in parallel. It is necessary to develop a data mining system to perform peak quantification, peak alignment, and data quality assurance. Results: We hav ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Motivation: In a liquid chromatographymass spectrometry (LCMS) based expressional proteomics, multiple samples from different groups are analyzed in parallel. It is necessary to develop a data mining system to perform peak quantification, peak alignment, and data quality assurance. Results: We have developed an algorithm for spectrum deconvolution. A twostep alignment algorithm is proposed for recognizing peaks generated by the same peptide but detected in different samples. The quality of LCMS data is evaluated using statistical tests and alignment quality tests. Availability: Xalign software is available upon request from the author. Contact:
Measures of Deviation (and Dependence) for HeavyTailed Distributions and their Estimation under Interval and Fuzzy Uncertainty
"... techniques are based on the assumption that the random variables are normally distributed. For such distributions, a natural characteristic of the “average ” value is the mean, and a natural characteristic of the deviation from the average is the variance. However, in many practical situations, e.g. ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
techniques are based on the assumption that the random variables are normally distributed. For such distributions, a natural characteristic of the “average ” value is the mean, and a natural characteristic of the deviation from the average is the variance. However, in many practical situations, e.g., in economics and finance, we encounter probability distributions for which the variance is infinite; such distributions are called heavytailed. For such distributions, we describe which characteristics can be used to describe the average and the deviation from the average, and how to estimate these characteristics under interval and fuzzy uncertainty. We also discuss what are the reasonable analogues of correlation for such heavytailed distributions.
Interval Computations and IntervalRelated Statistical Techniques: Tools for Estimating Uncertainty of the Results of Data Processing and Indirect Measurements
"... In many practical situations, we only know the upper bound ∆ on the (absolute value of the) measurement error ∆x, i.e., we only know that the measurement error is located on the interval [−∆, ∆]. The traditional engineering approach to such situations is to assume that ∆x is uniformly distributed on ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
In many practical situations, we only know the upper bound ∆ on the (absolute value of the) measurement error ∆x, i.e., we only know that the measurement error is located on the interval [−∆, ∆]. The traditional engineering approach to such situations is to assume that ∆x is uniformly distributed on [−∆, ∆], and to use the corresponding statistical techniques. In some situations, however, this approach underestimates the error of indirect measurements. It is therefore desirable to directly process this interval uncertainty. Such “interval computations” methods have been developed since the 1950s. In this chapter, we provide a brief overview of related algorithms, results, and remaining open problems.
Parallel Distributed Genetic Fuzzy Rule Selection
 SOFT COMPUTING (SPECIAL ISSUE ON GENETIC FUZZY SYSTEMS)
"... Genetic fuzzy rule selection has been successfully used to design accurate and compact fuzzy rulebased classifiers. It is, however, very difficult to handle large data sets due to the increase in computational costs. This paper proposes a simple but effective idea to improve the scalability of genet ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Genetic fuzzy rule selection has been successfully used to design accurate and compact fuzzy rulebased classifiers. It is, however, very difficult to handle large data sets due to the increase in computational costs. This paper proposes a simple but effective idea to improve the scalability of genetic fuzzy rule selection to large data sets. Our idea is based on its parallel distributed implementation. Both a training data set and a population are divided into subgroups (i.e., into training data subsets and subpopulations, respectively) for the use of multiple processors. We compare seven variants of the parallel distributed implementation with the original nonparallel algorithm through computational experiments on some benchmark data sets.
TradeOff Between Sample Size and Accuracy: Case of Measurements under Interval Uncertainty
, 2009
"... In many practical situations, we are not satisfied with the accuracy of the existing measurements. There are two possible ways to improve the measurement accuracy: • first, instead of a single measurement, we can make repeated measurements; the additional information coming from these additional mea ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
In many practical situations, we are not satisfied with the accuracy of the existing measurements. There are two possible ways to improve the measurement accuracy: • first, instead of a single measurement, we can make repeated measurements; the additional information coming from these additional measurements can improve the accuracy of the result of this series of measurements; • second, we can replace the current measuring instrument with a more accurate one; correspondingly, we can use a more accurate (and more expensive) measurement procedure provided by a measuring lab – e.g., a procedure that includes the use of a higher quality reagent. In general, we can combine these two ways, and make repeated measurements with a more accurate measuring instrument. What is the appropriate tradeoff between sample size and accuracy? This is the general problem that we address in this paper.