Results 1 -
8 of
8
Learning similarity measures in non-orthogonal space
- In CIKM ’04: Proceedings of the thirteenth ACM conference on Information and knowledge management
, 2004
"... Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization a ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing (LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity
Similarity Group-by Operators for Multi-dimensional Relational Data
"... Abstract-The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarityaware grouping provides a more realistic view o ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarityaware grouping provides a more realistic view on real-world data that coud lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently materialize this "approximate" semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in peformance over baseline methods developed to solve the same problem.
Efficient Content-Based and Metadata Retrieval in Image Database
"... Abstract: Managing image data in a database system using metadata has been practiced since the last two decades. However, describing an image fully and adequately with metadata is practically not possible. The other alternative is describing image content by its low-level features such as color, tex ..."
Abstract
- Add to MetaCart
Abstract: Managing image data in a database system using metadata has been practiced since the last two decades. However, describing an image fully and adequately with metadata is practically not possible. The other alternative is describing image content by its low-level features such as color, texture, shape, etc. and using the same for similarity-based image retrieval. However, practice has shown that using only the low-level features can not as well be complete. Hence, systems need to integrate both low-level and metadata descriptions for an efficient image data management. However, due to lack of adequate image data model, absence of a formal algebra for content-based image operations, and lack of precision of the existing image processing and retrieval techniques, no much work is done to integrate the use of lowlevel and metadata description and retrieval methods. In this paper, we first present a global image data model that supports both metadata and low-level descriptions of images and their salient objects. This allows to make multi-criteria image retrieval (context-, semantic-, and content-based queries). Furthermore, we present an image data repository model that captures all data described in the model and permits to integrate heterogeneous operations in a DBMS. In particular, similarity-based operations (similarity-based join and selection) in combination with traditional ones can be carried out. Finally, we present an image DBMS architecture that we use to develop a prototype in order to support both content-based and metadata retrieval.
A Similarity Reinforcement Algorithm for Heterogeneous Web Pages
"... Abstract. Many machine learning and data mining algorithms crucially rely on the similarity metrics. However, most early research works such as Vector Space Model or Latent Semantic Index only used single relationship to measure the similarity of data objects. In this paper, we first use an Intra- a ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Many machine learning and data mining algorithms crucially rely on the similarity metrics. However, most early research works such as Vector Space Model or Latent Semantic Index only used single relationship to measure the similarity of data objects. In this paper, we first use an Intra- and Inter-Type Relationship Matrix (IITRM) to represent a set of heterogeneous data objects and their inter-relationships. Then, we propose a novel similaritycalculating algorithm over the Inter- and Intra- Type Relationship Matrix. It tries to integrate information from heterogeneous sources to serve their purposes by iteratively computing. This algorithm can help detect latent relationships among heterogeneous data objects. Our new algorithm is based on the intuition that the intra-relationship should affect the inter-relationship, and vice versa. Experimental results on the MSN logs dataset show that our algorithm outperforms the traditional Cosine similarity. 1
unknown title
, 2003
"... An efficient query optimization strategy for spatio-temporal queries in video databases q ..."
Abstract
- Add to MetaCart
An efficient query optimization strategy for spatio-temporal queries in video databases q
An Efficient Query Optimization Strategy for Spatio-Temporal
- Journal of Systems and Software
, 2002
"... The interest for multimedia database management systems has grown rapidly due to the need for the storage of huge volumes of multimedia data in computer systems. An important building blockofamultimedia database system is the query processor, and a query optimizer embedded to the query processor ..."
Abstract
- Add to MetaCart
The interest for multimedia database management systems has grown rapidly due to the need for the storage of huge volumes of multimedia data in computer systems. An important building blockofamultimedia database system is the query processor, and a query optimizer embedded to the query processor is needed to answer user queries efficiently. Query optimization problem has been widely studied for conventional database systems, however it is a new researchareaformultimedia database systems. Due to the differences in query processing strategies, query optimization techniques used in multimedia database systems are different from those used in traditional databases. In this paper, a query optimization strategy is proposed for processing spatio-temporal queries in video database systems. The proposed strategy includes reordering algorithms to be applied on query execution tree. The performance results obtained by testing the reordering algorithms on different query sets are also presented.
ABSTRACT LEARNING SIMILARITY MEASURES IN NON-ORTHOGONAL SPACES
"... Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization a ..."
Abstract
- Add to MetaCart
Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing ( LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity metric in the non-orthogonal input space.
Salient-Object-Based Image Query By Visual Content Dawit Bulcha,
"... Content-Based Image Retrieval (CBIR) has attracted much attention of the research community. As exact matching is not possible with image retrieval, the approach is to use similarity-based matching using the global features of the entire image to compute a similarity score between two images. Equall ..."
Abstract
- Add to MetaCart
(Show Context)
Content-Based Image Retrieval (CBIR) has attracted much attention of the research community. As exact matching is not possible with image retrieval, the approach is to use similarity-based matching using the global features of the entire image to compute a similarity score between two images. Equally important is the use of salient-objects: objects in an image that are of particular interest, as the basis of similarity-based computation. However, the current works on CBIR do not address very well the issues related to salient-objects. In this work, we propose a data repository model so that spatial features of salient objects are captured. Moreover, we propose an extension to the similaritybased selection operator defined earlier to allow salient object based selection. We also propose spatial operators that can be used to compute spatial relations between an image and its contained salient objects. To demonstrate the viability of our proposals, we extend a previous system named EMIMS, to develop EMIMS-S (Extended Medical Image Management System to support Salient objects). We also experimentally evaluate the retrieval effectiveness of salient-objects-based image queries. Keywords: Salient-object-based image retrieval, image database, image data model, similarity-based algebra, spatial relation of salient-objects.