## SAXually Explicit Images: Finding Unusual Shapes (2006)

Venue: | In proceedings of the 2006 IEEE International Conference on Data Mining. Hong Kong. Dec |

Citations: | 14 - 1 self |

### BibTeX

@INPROCEEDINGS{Wei06saxuallyexplicit,

author = {Li Wei and Eamonn Keogh and Xiaopeng Xi},

title = {SAXually Explicit Images: Finding Unusual Shapes},

booktitle = {In proceedings of the 2006 IEEE International Conference on Data Mining. Hong Kong. Dec},

year = {2006},

pages = {18--22}

}

### OpenURL

### Abstract

Among the visual features of multimedia content, shape is of particular interest because humans can often recognize objects solely on the basis of shape. Over the past three decades, there has been a great deal of research on shape analysis, focusing mostly on shape indexing, clustering, and classification. In this work, we introduce the new problem of finding shape discords, the most unusual shapes in a collection. We motivate the problem by considering the utility of shape discords in diverse domains including zoology, anthropology, and medicine. While the brute force search algorithm has quadratic time complexity, we avoid this by using locality-sensitive hashing to estimate similarity between shapes which enables us to reorder the search more efficiently. An extensive experimental evaluation demonstrates that our approach can speed up computation by three to four orders of magnitude.

### Citations

232 | On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration”, Data Mining and Knowledge Discovery 7
- Keogh, Kasetty
- 2003
(Show Context)
Citation Context ...that the function be symmetric, that is, Dist(Q, C) = Dist(C, Q). As a concrete instantiation of a distance function, we define the most common distance measure for time series, Euclidean distance [6]=-=[16]-=-. Definition 5. Euclidean Distance: Given two time series Q and C of length n, the Euclidean distance between them is defined as ED n ( Q, C) ≡ ∑( q − c ) If the shapes are rotationally aligned, Eucli... |

213 | Finding motifs using random projections
- Buhler, Tompa
- 2001
(Show Context)
Citation Context ... based on sparse sampling of positions from feature vectors has been used in diverse areas for different purposes, including highdimensional search [28], multimedia indexing [38], and motif discovery =-=[37]-=-, etc. Among the rich literature, the localitysensitive hashing search technique proposed by Indyk and Motwani [12] is perhaps the most referenced in this area. Since this technique is a cornerstone o... |

190 | A Symbolic Representation of Time Series, with Implications for Streaming Algorithms
- Lin, Keogh, et al.
- 2003
(Show Context)
Citation Context ...ere are many different symbolic approximations of time series in the literature [2][9][11]. In this work, we choose the Symbolic Aggregate ApproXimation (SAX) representation introduced by Lin, et al. =-=[23]-=-, because it allows both dimensionality reduction and lower bounding. Below, we give a brief review of the SAX representation. We start with the Piecewise Aggregate Approximation (PAA) [15]. Definitio... |

167 |
Machine Vision: Theory, Algorithms and Practicalities
- Davies
- 1990
(Show Context)
Citation Context ...presentations have been shown to achieve comparable or superior accuracy in shape matching [20]. Therefore this simple representation has been used by an increasingly large fraction of the literature =-=[8]-=-[20][38]. There exist dozens of techniques to convert shapes into onedimensional representations (also known as pseudo “time series”). We refer the interested reader to [25] and [42] for excellent sur... |

165 | Dimensionality Reduction for Fast Similarity Search
- Keogh, Chakrabarti, et al.
(Show Context)
Citation Context ...in, et al. [23], because it allows both dimensionality reduction and lower bounding. Below, we give a brief review of the SAX representation. We start with the Piecewise Aggregate Approximation (PAA) =-=[15]-=-. Definition 8. Piecewise Aggregate Approximation (PAA): Given a time series C = c1, c2 ,..., c j ,..., cn and the desired lower dimensionality w, the Piecewise Aggregate Approximation of time series ... |

163 | Review of Shape Representation and Description Techniques
- Zhang, Lu
(Show Context)
Citation Context ... science [36]. Among the visual features contained in an image (e.g. shape, color, and texture), shape is of particular importance since humans can often recognize objects on the basis of shape alone =-=[42]-=-. Because of this special property, shape analysis has received much research attention in the past three decades. Most research effort in the shape analysis community is focused on indexing, clusteri... |

151 | Fast algorithms for sorting and searching strings
- Bentley, Sedgewick
- 1997
(Show Context)
Citation Context .... The intuition behind our inner heuristic is that shapes which frequently collide with each other are very likely to be highly similar (this fact is at the heart of more than twenty research efforts =-=[3]-=-[6][19][21][24][36]). As noted in observation 2, we just need to find one such shape that is similar enough (having a distance to the candidate less than the current value of the best_so_far_dist vari... |

129 | Distance-based outliers: algorithms and applications
- Knorr, Ng, et al.
- 2000
(Show Context)
Citation Context ...es [17][18][34], while we have individual time series here. Another possibility would be to simply project the shape time series into n-dimensional space and use existing outlier detection methods [5]=-=[22]-=-. The problem with this approach is that most outlier detection methods require the distance function be a metric. While the Euclidean distance is a metric, the rotation invariant Euclidean distance i... |

124 | Probabilistic discovery of time series motifs
- Chiu, Keogh, et al.
- 2003
(Show Context)
Citation Context ...re that the function be symmetric, that is, Dist(Q, C) = Dist(C, Q). As a concrete instantiation of a distance function, we define the most common distance measure for time series, Euclidean distance =-=[6]-=-[16]. Definition 5. Euclidean Distance: Given two time series Q and C of length n, the Euclidean distance between them is defined as ED n ( Q, C) ≡ ∑( q − c ) If the shapes are rotationally aligned, E... |

119 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ...endent. The definition eliminates the need of an explicit description of usual shapes. In addition, this definition requires zero parameters, which is especially suitable for data mining applications =-=[19]-=-. In particular, as we demonstrate below, we can apply our algorithm in very diverse domains without having to do any tuning or “tweaking” of any kind. Our definition of shape discord would be of litt... |

100 | Finding surprising patterns in a time series database in linear time and space
- Keogh, Lonardi, et al.
(Show Context)
Citation Context ... We cannot leverage off the existing time series novelty detection techniques because most of them assume that time series subsequences are extracted by sliding a window across a long time series [17]=-=[18]-=-[34], while we have individual time series here. Another possibility would be to simply project the shape time series into n-dimensional space and use existing outlier detection methods [5][22]. The p... |

65 | HOT SAX: efficiently finding the the most unusual time series subsequence
- Keogh, Lin, et al.
- 2005
(Show Context)
Citation Context ... no. We cannot leverage off the existing time series novelty detection techniques because most of them assume that time series subsequences are extracted by sliding a window across a long time series =-=[17]-=-[18][34], while we have individual time series here. Another possibility would be to simply project the shape time series into n-dimensional space and use existing outlier detection methods [5][22]. T... |

62 | Compressed text databases with efficient query algorithms based on the compressed suffix array - Sadakane - 2000 |

60 | State of the art in shape matching - VELTKAMP, HAGEDOORN - 2001 |

50 | Locality-Preserving Hashing in Multidimensional Spaces
- Indyk, Motwani, et al.
- 1997
(Show Context)
Citation Context ...including highdimensional search [28], multimedia indexing [38], and motif discovery [37], etc. Among the rich literature, the localitysensitive hashing search technique proposed by Indyk and Motwani =-=[12]-=- is perhaps the most referenced in this area. Since this technique is a cornerstone of our contribution, we give the formal definition of locality-sensitive hashing below. Definition 11. Locality-sens... |

45 | TSA-tree: A wavelet-based approach to improve the efficiency of multilevel surprise and trend queries on time-series data
- Shahabi, Tian, et al.
(Show Context)
Citation Context ...cannot leverage off the existing time series novelty detection techniques because most of them assume that time series subsequences are extracted by sliding a window across a long time series [17][18]=-=[34]-=-, while we have individual time series here. Another possibility would be to simply project the shape time series into n-dimensional space and use existing outlier detection methods [5][22]. The probl... |

40 | A multiscale representation method for nonrigid shapes with a single closed contour - Adamek, O’Connor |

37 | Visually mining and monitoring massive time series
- Lin, Keogh, et al.
- 2004
(Show Context)
Citation Context ...n behind our inner heuristic is that shapes which frequently collide with each other are very likely to be highly similar (this fact is at the heart of more than twenty research efforts [3][6][19][21]=-=[24]-=-[36]). As noted in observation 2, we just need to find one such shape that is similar enough (having a distance to the candidate less than the current value of the best_so_far_dist variable) to termin... |

28 | Enhanced perceptual distance functions and indexing for image replica recognition - Qamra, Meng, et al. - 2005 |

27 |
Principal Component Analysis
- T
(Show Context)
Citation Context ...ssion with techniques that use global bases, like Principal Component Analysis (PCA), since very unusual images are bound to introduce more bases (or more reconstruction error among the ‘true’ bases) =-=[14]-=-. As we have shown, the notion of unusual shapes can be useful in different domains. However, to the best of our knowledge, the problem of finding these shapes has not yet been addressed. In this pape... |

27 |
Computer vision classification of leaves from swedish trees
- Soderkvist
- 2001
(Show Context)
Citation Context ...stance measures. Table 1: The classification error rate on several datasets Dataset Name Error Rate using 1D Error Rate using other representation (%) representations (%) Swedish Leaves 13.33% 17.82% =-=[35]-=- Chicken 19.96% 20.5% [26] MixedBag 4.375% 6% [40] Diatoms 27.53% 26% [13] These experiments show that the very simple time series representations of shapes and the simple Euclidean distance can be co... |

24 | Using Signature Files for Querying Timeseries Data
- Andre-Jonsson, Badal
- 1997
(Show Context)
Citation Context ...it gives us a string representation that will be used in the subsequent step by the location-sensitive hash function. There are many different symbolic approximations of time series in the literature =-=[2]-=-[9][11]. In this work, we choose the Symbolic Aggregate ApproXimation (SAX) representation introduced by Lin, et al. [23], because it allows both dimensionality reduction and lower bounding. Below, we... |

22 | Anytime algorithm development tools
- Grass, Zilberstein
- 1996
(Show Context)
Citation Context ...e algorithm keeps the best_so_far_index variable, it always has a candidate discord to show at any point of time after the collision matrix has been built (this actually makes it an anytime algorithm =-=[10]-=-). Or we can stop when the collision matrix “converges” (the change of the values of its entries are less than some threshold). Again, for simplicity, we fix the number of iterations to 30 in this wor... |

22 | Cyclic sequence alignments: Approximate versus optimal techniques
- Mollineda, Vidal, et al.
(Show Context)
Citation Context ...The classification error rate on several datasets Dataset Name Error Rate using 1D Error Rate using other representation (%) representations (%) Swedish Leaves 13.33% 17.82% [35] Chicken 19.96% 20.5% =-=[26]-=- MixedBag 4.375% 6% [40] Diatoms 27.53% 26% [13] These experiments show that the very simple time series representations of shapes and the simple Euclidean distance can be competitive to other more co... |

12 |
A contour-Oriented Approach to Shape Analysis
- Otterloo
- 2008
(Show Context)
Citation Context ...ations have been shown to achieve comparable or superior accuracy in shape matching [20]. Therefore this simple representation has been used by an increasingly large fraction of the literature [8][20]=-=[38]-=-. There exist dozens of techniques to convert shapes into onedimensional representations (also known as pseudo “time series”). We refer the interested reader to [25] and [42] for excellent surveys. No... |

12 |
Rotation invariant indexing of shapes and line drawings
- Vlachos, Vagena, et al.
- 2005
(Show Context)
Citation Context ... rate on several datasets Dataset Name Error Rate using 1D Error Rate using other representation (%) representations (%) Swedish Leaves 13.33% 17.82% [35] Chicken 19.96% 20.5% [26] MixedBag 4.375% 6% =-=[40]-=- Diatoms 27.53% 26% [13] These experiments show that the very simple time series representations of shapes and the simple Euclidean distance can be competitive to other more complex representations an... |

11 |
Symbolic analysis of experimental data, Review of Scientific Instruments
- Daw, Finney, et al.
- 2001
(Show Context)
Citation Context ...gives us a string representation that will be used in the subsequent step by the location-sensitive hash function. There are many different symbolic approximations of time series in the literature [2]=-=[9]-=-[11]. In this work, we choose the Symbolic Aggregate ApproXimation (SAX) representation introduced by Lin, et al. [23], because it allows both dimensionality reduction and lower bounding. Below, we gi... |

11 |
K (2004) Motif discovery algorithm from motion data
- Tanaka, Uehara
(Show Context)
Citation Context ...Keywords Anomaly Detection, Shape 1. Introduction Large image databases are used in an increasing number of applications in fields as diverse as entertainment, business, art, engineering, and science =-=[36]-=-. Among the visual features contained in an image (e.g. shape, color, and texture), shape is of particular importance since humans can often recognize objects on the basis of shape alone [42]. Because... |

6 | On complementarity of cluster and outlier detection schemes
- Chen, Fu, et al.
- 2003
(Show Context)
Citation Context ...eries [17][18][34], while we have individual time series here. Another possibility would be to simply project the shape time series into n-dimensional space and use existing outlier detection methods =-=[5]-=-[22]. The problem with this approach is that most outlier detection methods require the distance function be a metric. While the Euclidean distance is a metric, the rotation invariant Euclidean distan... |

6 |
Automatic Diatom Identification using Contour Analysis by Morphological Curvature Scale Spaces
- Jalba, Wilkinson, et al.
- 2005
(Show Context)
Citation Context ...s Dataset Name Error Rate using 1D Error Rate using other representation (%) representations (%) Swedish Leaves 13.33% 17.82% [35] Chicken 19.96% 20.5% [26] MixedBag 4.375% 6% [40] Diatoms 27.53% 26% =-=[13]-=- These experiments show that the very simple time series representations of shapes and the simple Euclidean distance can be competitive to other more complex representations and distance measures. The... |

6 |
Extracting feature based on motif from a chronic hepatitis dataset
- Kitaguchi
- 2004
(Show Context)
Citation Context ...ition behind our inner heuristic is that shapes which frequently collide with each other are very likely to be highly similar (this fact is at the heart of more than twenty research efforts [3][6][19]=-=[21]-=-[24][36]). As noted in observation 2, we just need to find one such shape that is similar enough (having a distance to the candidate less than the current value of the best_so_far_dist variable) to te... |

6 |
Cladistics is useful for reconstructing Archaeological Phylogenies: Palaeoindian Points from the southeastern united states
- o’Brien, darwent, et al.
- 2001
(Show Context)
Citation Context ... challenges for data mining, particularly mining of shapes [7]. Examples of shapes which anthropologists may be interested in mining include petroglyphs, pottery [7], projectile points (“arrowheads”) =-=[30]-=-, and bones [20]. It is difficult to overstate the need for efficient algorithms when working with such datasets. As another example, the number of projectile points in the collection at the authors’ ... |

5 |
Quantitative trait loci affecting components of wing shape in drosophila melanogaster
- Zimmerman, Palsson, et al.
(Show Context)
Citation Context ...ed and the developing organism is examined for changes in physiology or behavior. Figure 2 shows a subset of wing images collected for a mutagenesis experiment carried out at Florida State University =-=[41]-=-, and the discord discovered by our algorithm. Note that the entire wing image was analyzed up to, but not including, the articulation (1 mm from wing attachment to thorax). 1st 1 Discord st Discord S... |

4 |
A survey of shape analysis techniques
- Loncarin
- 1998
(Show Context)
Citation Context ...etection approaches are not suitable for the problem at hand. 2.1 Shape Representation and Distance Measure We consider the shape of an object as a binary image representing the outline of the object =-=[25]-=-. In order to find/index/classify a shape, the shape must be described or represented in some way. However this is a difficult task as shapes may be corrupted with noise, defects, arbitrary distortion... |

2 |
Digital Archive Network for Anthropology (DANA): Three-Dimensional Modeling and Database Development for Internet Access
- CLARK, Bergstrom, et al.
- 2002
(Show Context)
Citation Context ...mpt to find this discord would have required 512,880,378 shape comparisons.sAnthropological Data Mining: Anthropology offers many interesting challenges for data mining, particularly mining of shapes =-=[7]-=-. Examples of shapes which anthropologists may be interested in mining include petroglyphs, pottery [7], projectile points (“arrowheads”) [30], and bones [20]. It is difficult to overstate the need fo... |

1 |
Fractal dimension in butterflies' wings: a novel approach to understanding wing patterns
- Castrejon-Pita, Sarmiento-Galan, et al.
- 2005
(Show Context)
Citation Context ...terfly wings are an interesting domain in which to test image mining algorithms. Depending on the area of research, the shape, color, texture or even fractal dimension of the wings may be of interest =-=[4]-=-. Here we restrict our attention to shape. The large size of such collections motivates the use of scalable algorithms. For example, the Morphbank archive [27] currently has approximately 2,000 butter... |

1 |
Adaptive query processing for timeseries Data
- S
- 1999
(Show Context)
Citation Context ...es us a string representation that will be used in the subsequent step by the location-sensitive hash function. There are many different symbolic approximations of time series in the literature [2][9]=-=[11]-=-. In this work, we choose the Symbolic Aggregate ApproXimation (SAX) representation introduced by Lin, et al. [23], because it allows both dimensionality reduction and lower bounding. Below, we give a... |

1 |
LB_Keogh allows exact indexing of shapes under rotation invariance with arbitrary representations and distance measures
- Keogh, Wei, et al.
- 2006
(Show Context)
Citation Context ...data mining, particularly mining of shapes [7]. Examples of shapes which anthropologists may be interested in mining include petroglyphs, pottery [7], projectile points (“arrowheads”) [30], and bones =-=[20]-=-. It is difficult to overstate the need for efficient algorithms when working with such datasets. As another example, the number of projectile points in the collection at the authors’ institution exce... |

1 |
Gapped Local Similarity Search with Provable Guarantees
- Narayanan, Karp
- 2004
(Show Context)
Citation Context ...arities between all shapes. Estimation of similarity based on sparse sampling of positions from feature vectors has been used in diverse areas for different purposes, including highdimensional search =-=[28]-=-, multimedia indexing [38], and motif discovery [37], etc. Among the rich literature, the localitysensitive hashing search technique proposed by Indyk and Motwani [12] is perhaps the most referenced i... |

1 |
Personal Communication
- Philip
- 2006
(Show Context)
Citation Context ...rstate the need for efficient algorithms when working with such datasets. As another example, the number of projectile points in the collection at the authors’ institution exceeds one million objects =-=[31]-=-. We collected more than 16,000 projectile point images for an unrelated project, but can consider this dataset with our discord mining algorithm. As Figure 3 shows, the dataset comes from diverse sou... |

1 |
Discovering representative models in large time series models
- Rombo, Terracina
- 2004
(Show Context)
Citation Context ...buckets. The good news is that there is little freedom for the |Σ| parameter. Extensive experiments carried out by the current authors [6][17][18][19] and dozens of other researchers worldwide [3][21]=-=[32]-=-[36] suggest that a value of either 3 or 4 is best for virtually any task on any dataset. After empirically confirming this on the current problem with experiments on more than 50 datasets, we will si... |