Results 1 -
7 of
7
A Wikipedia-Based Multilingual Retrieval Model
"... Abstract. This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document d ∗ i chosen from the “L-subset ” of Wikipedia. Likewise, for a second document d ′ written in language L ′ , L � = L ′, we construct a concept vector d ′ , using from the L ′-subset of the Wikipedia the topic-aligned counterparts d ′∗ i of our previously chosen documents. Since the two concept vectors d and d ′ are collection-relative representations of d and d ′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection. 1
P.: Overview of the 1st International Competition on Plagiarism Detection
- In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR-WS.org
, 2009
"... Abstract: The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract: The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups. An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors.
Understanding plagiarism linguistic patterns, textual features and detection methods
- IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
, 2011
"... Abstract—Plagiarism can be of many different natures, ranging from copying texts to adopting ideas, without giving credit to its originator. This paper presents a new taxonomy of plagiarism that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist’s behav ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract—Plagiarism can be of many different natures, ranging from copying texts to adopting ideas, without giving credit to its originator. This paper presents a new taxonomy of plagiarism that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist’s behavioral point of view. The taxonomy supports deep understanding of different linguistic patterns in committing plagiarism, for example, changing texts into semantically equivalent but with different words and organization, shortening texts with concept generalization and specification, and adopting ideas and important contributions of others. Different textual features that characterize different plagiarism types are discussed. Systematic frameworks and methods of monolingual, extrinsic, intrinsic, and cross-lingual plagiarism detection are surveyed and correlated with plagiarism types, which are listed in the taxonomy. We conduct extensive study of state-of-the-art techniques for plagiarism detection, including character n-gram-based (CNG), vector-based (VEC), syntax-based
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance
"... Abstract. Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback-Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams. 1
Information Retrieval: Concepts and Practical Considerations for Teaching a Rising Topic
"... Abstract. The last two decades have seen an enormous increase in the amount of information available, in the form of text documents as well as multimedia data such as images, speech and video. As a result, information retrieval (IR) has become a central topic of computer science and related discipli ..."
Abstract
- Add to MetaCart
Abstract. The last two decades have seen an enormous increase in the amount of information available, in the form of text documents as well as multimedia data such as images, speech and video. As a result, information retrieval (IR) has become a central topic of computer science and related disciplines and is now part of many curricula for bachelor and master programs. In this article, we outline which concepts should be integral part of IR courses depending on the orientation of the degree program (e.g. business vs. research). In addition to the theoretical content of IR courses, we also address practical considerations, based on the authors ’ extensive experience in teaching IR. We comment on the suitability of a number of tools and systems and of different forms of teaching, including e-learning, in the IR classroom. 1
Extensions to Self-Taught Hashing: Kernelisation and Supervision ABSTRACT
"... The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents a ..."
Abstract
- Add to MetaCart
The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Since each bit in the binary code for a document can be regarded as a binary feature of it, semantic hashing is essentially a process of generating a few most informative binary features to represent the documents. Recently, we have proposed a novel Self-Taught Hashing (STH) approach to semantic hashing (that is going to be published in SIGIR-2010): we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. In this paper, we present two further extensions to our STH technique: one is kernelisation (i.e., employing nonlinear kernels to achieve nonlinear hashing), and the other is supervision (i.e., exploiting the category label information to enhance the effectiveness of hashing). The advantages of these extensions have been shown through experiments on synthetic datasets and real-world datasets respectively.
HashFile: An Efficient Index Structure For Multimedia Data
"... Abstract—Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the ..."
Abstract
- Add to MetaCart
Abstract—Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval

