Results 11  20
of
181
Improving Category Specific Web Search by Learning Query Modifications
 In Symposium on Applications and the Internet
, 2001
"... A user searching for documents' within a specific category using a general purpose search engine might have a difficult time finding valuable documents '. To improve category specific search, we show that a trained classifier can recognize pages of a specified category with high precision ..."
Abstract

Cited by 86 (8 self)
 Add to MetaCart
(Show Context)
A user searching for documents' within a specific category using a general purpose search engine might have a difficult time finding valuable documents '. To improve category specific search, we show that a trained classifier can recognize pages of a specified category with high precision by using tex tual content, text location, and HTML structure. We show that query modifications to web search engines increase the probability that the documents' returned are of the specific category.
A New Method of Ngram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese
 In COLING94
, 1994
"... In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (ngrams), but this statistical analysis of large text data a ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (ngrams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the shortage of text data. Taking advantage of the recent powerful computers we developed a new algorithm of ngrams of large text data for arbitrary large n and calculated successfully, within relatively short time, ngrams of some Japanese text data containing between two and thirty million characters. From this experiment it became clear that the automatic extraction or determination of words, compound words and collocations is possible by mutually comparing ngram statistics for different values of n.
F.A.: Predicting Unseen Triphones with Senones
 IEEE Transaction on Speech and Audio Processing
, 1996
"... ..."
(Show Context)
An Extensible MetaLearning Approach for Scalable and Accurate Inductive Learning
, 1996
"... Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Som ..."
Abstract

Cited by 48 (8 self)
 Add to MetaCart
Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining. One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. Moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. In this thesis we propose, investigate, and evaluate a metalearning approach to integrating the results of mul...
Fifty Years of Shannon Theory
, 1998
"... A brief chronicle is given of the historical development of the central problems in the theory of fundamental limits of data compression and reliable communication. ..."
Abstract

Cited by 38 (0 self)
 Add to MetaCart
A brief chronicle is given of the historical development of the central problems in the theory of fundamental limits of data compression and reliable communication.
Practical Implementations of Arithmetic Coding
 IN IMAGE AND TEXT
, 1992
"... We provide a tutorial on arithmetic coding, showing how it provides nearly optimal data compression and how it can be matched with almost any probabilistic model. We indicate the main disadvantage of arithmetic coding, its slowness, and give the basis of a fast, spaceefficient, approximate arithmet ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
We provide a tutorial on arithmetic coding, showing how it provides nearly optimal data compression and how it can be matched with almost any probabilistic model. We indicate the main disadvantage of arithmetic coding, its slowness, and give the basis of a fast, spaceefficient, approximate arithmetic coder with only minimal loss of compression efficiency. Our coder is based on the replacement of arithmetic by table lookups coupled with a new deterministic probability estimation scheme.
Robust temporal coding of contrast by V1 neurons for transient but not for steadystate stimuli
 J Neurosci
, 1998
"... ..."
(Show Context)
Some equivalences between Shannon entropy and Kolmogorov complexity
 IEEE Transactions on Information Theory
, 1978
"... that the average codeword length L,:, for the best onetoone (not necessBluy uniquely decodable) code for X is shorter than the average codeword length L,, for the best mdquely decodable code by no more thau (log2 log, n) + 3. Let Y be a random variable taking OII a fiite or countable number of val ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
that the average codeword length L,:, for the best onetoone (not necessBluy uniquely decodable) code for X is shorter than the average codeword length L,, for the best mdquely decodable code by no more thau (log2 log, n) + 3. Let Y be a random variable taking OII a fiite or countable number of values and having entropy H. Then it is proved that L,:,>Hlog2 (H+l)log, log2 (H+l)...6. Some relations are eatahlished amoug the Kolmogorov, Cl&in, and extension complexities. Finally it is shown that, for all computable probability distributions, the universal prefix codes associated with the conditional Chaitin complexity have expected codeword length within a constant of the Shannon entropy. I.
What’s the code? automatic classification of source code archives
 Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2002
"... There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
(Show Context)
There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class. 1.
The Effects of FeatureLabelOrder and Their Implications for Symbolic Learning
, 2009
"... Symbols enable people to organize and communicate about the world. However, the ways in which symbolic knowledge is learned and then represented in the mind are poorly understood. We present a formal analysis of symbolic learning—in particular, word learning—in terms of prediction and cue competitio ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
Symbols enable people to organize and communicate about the world. However, the ways in which symbolic knowledge is learned and then represented in the mind are poorly understood. We present a formal analysis of symbolic learning—in particular, word learning—in terms of prediction and cue competition, and we consider two possible ways in which symbols might be learned: by learning to predict a label from the features of objects and events in the world, and by learning to predict features from a label. This analysis predicts significant differences in symbolic learning depending on the sequencing of objects and labels. We report a computational simulation and two human experiments that confirm these differences, revealing the existence of FeatureLabelOrdering effects in learning. Discrimination learning is facilitated when objects predict labels, but not when labels predict objects. Our results and analysis suggest that the semantic categories people use to understand and communicate about the world can only be learned if labels are predicted from objects. We discuss the implications of this for our understanding of the nature of language and symbolic thought, and in particular, for theories of reference.