Results 1 - 10
of
52
A Brief History of Generative Models for Power Law and Lognormal Distributions
- INTERNET MATHEMATICS
"... Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying ..."
Abstract
-
Cited by 192 (7 self)
- Add to MetaCart
Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying
Information Retrieval Interaction
, 1992
"... this document, text or image about?' Gradually moving from the left to the right in Figure 3.1, different understandings of this concept evolve ..."
Abstract
-
Cited by 158 (6 self)
- Add to MetaCart
this document, text or image about?' Gradually moving from the left to the right in Figure 3.1, different understandings of this concept evolve
Creating efficient codebooks for visual recognition
- In Proceedings of the IEEE International Conference on Computer Vision
, 2005
"... Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as k-means to cluster the descriptor vect ..."
Abstract
-
Cited by 111 (12 self)
- Add to MetaCart
Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as k-means to cluster the descriptor vectors of patches sampled either densely (‘textons’) or sparsely (‘bags of features ’ based on keypoints or salience measures) from a set of training images. This works well for texture analysis in homogeneous images, but the images that arise in natural object recognition tasks have far less uniform statistics. We show that for dense sampling, k-means over-adapts to this, clustering centres almost exclusively around the densest few regions in descriptor space and thus failing to code other informative regions. This gives suboptimal codes that are no better than using randomly selected centres. We describe a scalable acceptance-radius based clusterer that generates better codebooks and study its performance on several image classification tasks. We also show that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or Mutual Information based feature selection starting from a dense codebook further improves the performance. 1.
Learning Taxonomic Relations from Heterogeneous Evidence
"... We present a novel approach to the automatic acquisition of taxonomic relations. The main difference to earlier approaches is that we do not only consider one single source of evidence, i.e. a specific algorithm or approach, but examine the possibility of learning taxonomic relations by considerin ..."
Abstract
-
Cited by 63 (8 self)
- Add to MetaCart
We present a novel approach to the automatic acquisition of taxonomic relations. The main difference to earlier approaches is that we do not only consider one single source of evidence, i.e. a specific algorithm or approach, but examine the possibility of learning taxonomic relations by considering various and heterogeneous forms of evidence. In particular, we derive these different evidences by using well-known NLP techniques and resources and combine them via two simple strategies. Our approach shows very promising results compared to other results from the literature. The main aim of the work presented in this paper is (i) to gain insight into the behaviour of different approaches to learn taxonomic relations, (ii) to provide a first step towards combining these different approaches, and (iii) to establish a baseline for further research.
Random texts exhibit Zipf's-law-like word frequency distribution
- IEEE Transactions on Information Theory
, 1992
"... are scanned from a copy of the paper (apologize for the poor quality). It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf’s law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an ..."
Abstract
-
Cited by 63 (2 self)
- Add to MetaCart
are scanned from a copy of the paper (apologize for the poor quality). It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf’s law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word’s length to its rank, which stretches an exponential function to a power law function. key words: statistical linguistics, Zipf’s law, power-law distribution, random texts. Zipf observed long time ago [1] that the distribution of word frequencies in English, if the words are aligned according to their ranks, is an inverse power law with the exponent very close to 1. In other words, if the most frequently occurring word appears in the text with the frequency P(1), the next most frequently occurring word has the frequency P(2), and the rank-r word has the frequency P(r), the frequency distribution is P(r) = C rα, (1)
Hierarchical Document Clustering Using Frequent Itemsets
- IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003
, 2003
"... A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Anoth ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
Authorship Attribution with Support Vector Machines
- APPLIED INTELLIGENCE
, 2000
"... In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency v ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60-80% of the cases. In a second experiment we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVM on full word forms was remarkably robust even if the author wrote about different topics.
Towards a benchmark for Semantic Web reasoners - an analysis of the DAML ontology library
, 2003
"... Introduction Bet hmarksare one important asp ex of pe rformance e aluation. This pap e conce trate on the de e86)1 t of a re1888) tative be1 hmark forSe71 tic We87 To thise te t we pe rform a statistical analysis of available Seab tic We b ontologie] in our case the DAML ontology library, andde7F ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Introduction Bet hmarksare one important asp ex of pe rformance e aluation. This pap e conce trate on the de e86)1 t of a re1888) tative be1 hmark forSe71 tic We87 To thise te t we pe rform a statistical analysis of available Seab tic We b ontologie] in our case the DAML ontology library, andde7F e paramex]7 that can be use forthe ge1)F66x] of syntheK) ontologiex Theg syntheF1 ontologie can be use as workloads in be1 hmarks. Naturally, pe)661x]1j e aluation can also be pe1FjI#x using a re7 workload, viz. a workload that isobseK e on a re11)IF be ing use for normal ope -x7K)#6 Howe ve r, such workloads can usually not be applie ree78I7I in a controlle manne) Thee)x] synthe)j workloads are typicallyuse in pe rformance e aluations. SyntheF7 workloads should be areII77x tation or mo de ofthe re7 workload. Heloa it isne8K#K6x tome67#8 and characte]16 the workload one isting reingx7K to produce me7#Kx]1 synthejI workloads. This should allow us to syste7x]1)787 e aluate di#ee t rej#j#x] an
Evaluating Scheduling and Replica Optimisation Strategies in OptorSim
- In 4th International Workshop on Grid Computing (Grid2003
, 2003
"... Grid computing is fast emerging as the solution to the problems posed by the massive computational and data handling requirements of many current international scientific projects. Simulation of the Grid environment is important to evaluate the impact of potential data handling strategies before bei ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Grid computing is fast emerging as the solution to the problems posed by the massive computational and data handling requirements of many current international scientific projects. Simulation of the Grid environment is important to evaluate the impact of potential data handling strategies before being deployed on the Grid. In this paper, we look at the effects of various job scheduling and data replication strategies and compare them in a variety of Grid scenarios, evaluating several performance metrics. We use the Grid simulator OptorSim, and base our simulations on a world-wide Grid testbed for data intensive high energy physics experiments. Our results show that the choice of scheduling and data replication strategies can have a large effect on both job throughput and the overall consumption of Grid resources. 1

