DMCA
SmartStore: A New Metadata Organization Paradigm with (2009)
Cached
Download Links
Venue: | Semantic-Awareness ,” FAST Work-in-Progress Report and Poster Session |
Citations: | 19 - 5 self |
Citations
3772 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...he D-dimensional attribute space. Since the computational costs for all attributes are unacceptably high in practice, we use a simple but effective semantic tool, i.e., Latent Semantic Indexing (LSI) =-=[12, 13]-=- to generate semantically correlated groups as shown in Section 3. The notion of semantic correlation has been used in many systems designs, optimizations and real-world applications. In what follows ... |
3631 | Authoritative sources in a hyperlinked environment
- Kleinberg
(Show Context)
Citation Context ...contents, while ignoring filescontext that is utilized by most users in organizing and searching their data [38]. Furthermore, typical techniques successful for the web search, such as HITS algorithm =-=[39]-=- and Googlessearch engine [40], leverage tagged and contextual links thatsdo not inherently, let alone explicitly, exist in large-scale filessystems.s6.2 Directory-based Subtree PartitioningsSubtree-p... |
3326 | Handbook of Applied Cryptography
- Menezes, Oorschot, et al.
- 1997
(Show Context)
Citation Context ...antics, denoted as R-tree. On thesother hand, each Bloom filter embedded within an R-tree nodesfor point query is assigned 1024 bits with k = 7 hash functions tosfit memory constraints. We select MD5 =-=[34]-=- as the hash functionsfor its relatively fast implementation. The value of an attributesis hashed into 128 bits by calculating its MD5 signature, whichsis then divided into four 32-bit values. We set ... |
2750 | R-trees: a dynamic index structure for spatial searching
- Guttman
- 1984
(Show Context)
Citation Context ...ponents respectively from user and system views with automatic configuration to match query patterns.s2.1 OverviewsA semantic R-tree as shown on the right of Figure 1 issevolved from classical R-tree =-=[26]-=- and consists of index unitss(i.e., non-leaf nodes) containing location and mapping information and storage units (i.e., leaf nodes) containing file metadata, both of which are hosted on a collection ... |
2096 | Space/time trade-offs in hash coding with allowable errors
- Bloom
- 1970
(Show Context)
Citation Context ...either anson-line multicast-based approach or an off-line pre-computation-based approach to locating the corresponding R-tree node.sSpecifically, for a point query, the home unit checks Bloom filters =-=[27]-=- stored locally in a way similar to the group-based hierarchical Bloom-filter array approach [28] and, for a complexsquery, the home unit checks the Minimum Bounding Rectangless(MBR) [26] to determine... |
1028 | Bigtable: a distributed storage system for structured data - Chang, Dean, et al. - 2006 |
653 | The ubiquitous B-tree
- Comer
- 1979
(Show Context)
Citation Context .... Similar workload scale-up approaches havesalso been used in other studies [28, 32].sWe compare SmartStore with two baseline systems. The firstsone is a popular database approach that uses a B+ tree =-=[33]-=- to index each metadata attribute, denoted as DBMS that here doessnot take into account database optimization. The second one issa simple, non-semantic R-tree-based database approach that organizes ea... |
322 | S.: Latent semantic indexing: A probabilistic analysis
- Papadimitriou, Raghavan, et al.
- 1998
(Show Context)
Citation Context ...he D-dimensional attribute space. Since the computational costs for all attributes are unacceptably high in practice, we use a simple but effective semantic tool, i.e., Latent Semantic Indexing (LSI) =-=[12, 13]-=- to generate semantically correlated groups as shown in Section 3. The notion of semantic correlation has been used in many systems designs, optimizations and real-world applications. In what follows ... |
299 |
Algorithm AS 136: A k-means clustering algorithm. Applied statistics
- Hartigan, Wong
- 1979
(Show Context)
Citation Context ...r inner product.sDue to space limitation, this paper only gives basic introductionsto LSI and more details can be found in [12, 13].sWhile there are other tools available for grouping, such as Kmeans =-=[30]-=-, we choose LSI because of its high efficiency and easysimplementation. The K-means [30] algorithm exploits multi-dimensional attributes of n items to cluster them into K (K ≤ n)spartitions. While the... |
275 | Ceph: A Scalable, High-Performance Distributed File System,‖ In
- Weil, Brandt, et al.
- 2006
(Show Context)
Citation Context ...mputing [1]. As the storage capacity is approaching Exabytes and the number of files stored is reaching billions, directory-tree based metadata management widely deployed in conventional file systems =-=[2, 3]-=- can no longer meet the requirements of scalability and functionality. For the next-generation large-scale storage systems, new metadata organization schemes are desired to meet two critical goals: (1... |
269 | A comparison of file system workloads.
- Roselli, Lorch, et al.
- 2000
(Show Context)
Citation Context ... the next-generation file systems, metadata accesses will very likely become a severe performance bottleneck as metadata-based transactions not only account for over 50% of all file system operations =-=[4, 5]-=- but also result in billions of pieces of metadata in directories. Given the sheer scale and complexity of the data and Permission to make digital or hard copies of all or part of this work for Permis... |
267 | Semantic file systems.
- Gifford, Jouvelot, et al.
- 1991
(Show Context)
Citation Context ...ta, which exploits higherdimensional static and dynamic attributes, or higher-dimensional localities than the simple temporal or spatial locality utilized in existing approaches. Semantic correlation =-=[10]-=- comes from the exploitation of highdimensional attributes of metadata. To put things in perspective, linear brute-force approach uses 0-dimensional correlation while spatial/temporal locality approac... |
225 | From databases to dataspaces: a new abstraction for information management
- Franklin, Halevy, et al.
- 2005
(Show Context)
Citation Context ...nce.sDatabase research community has argued that existingsDBMS for general-purpose applications would not be a “onessize fit all” solution [44] and improvements may result from semantic-based designs =-=[45]-=-.s7. ConclusionsThe paper presents a new paradigm for organizing file metadata for next-generation file systems, called SmartStore, by exploiting file semantic information to provide efficient and sca... |
163 | Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
- Zhu, Li, et al.
- 2008
(Show Context)
Citation Context ...0 files that are closest to this description?". From a system’s point of view, SmartStore may help optimize storage system designs such as de-duplication, caching and prefetching. Data de-duplication =-=[22, 23]-=- aims to effectively and efficiently remove redundant data and compress data into a highly compact form for the purpose of data backup and archiving. One of the key problems is how to identify multipl... |
108 | Metadata Efficiency in Versioning File Systems,”
- Soules, Goodson, et al.
- 2003
(Show Context)
Citation Context ...anges that have not been directly updated in the original replicas. This method eliminates many small, random and frequentsvisits to the index and has been widely used in most versioningsfile systems =-=[8, 9, 31]-=-.sIn order to maintain semantic correlation and locality, Smart-sStore creates versions for every group, represented as the firstlevel index unit that has been replicated to other index units. Atstime... |
98 | One size fits all”: An idea whose time has come and gone.
- Stonebraker, Cetintemel
- 2005
(Show Context)
Citation Context ...cess locality and “hot spot” data, to enhance system performance.sDatabase research community has argued that existingsDBMS for general-purpose applications would not be a “onessize fit all” solution =-=[44]-=- and improvements may result from semantic-based designs [45].s7. ConclusionsThe paper presents a new paradigm for organizing file metadata for next-generation file systems, called SmartStore, by expl... |
91 | Connections: using context to enhance file search.
- Soules, Ganger
- 2005
(Show Context)
Citation Context ...ncy of content-based search heavily depends on files thatscontain explicitly understandable contents, while ignoring filescontext that is utilized by most users in organizing and searching their data =-=[38]-=-. Furthermore, typical techniques successful for the web search, such as HITS algorithm [39] and Googlessearch engine [40], leverage tagged and contextual links thatsdo not inherently, let alone expli... |
83 | A framework for evaluating storage system security.
- Riedel, Kallahalla, et al.
- 2002
(Show Context)
Citation Context ...ile metadata prefetching performance. The probability of inter-file access is found to be up to 80% when considering four typical file system traces. Our preliminary results based on these and the HP =-=[17]-=-, MSN [18], and EECS [19] traces further show that exploiting semantic correlation of multi-dimensional attributes can help prune up to 99.9% search space [20]. Therefore, in this paper we proposed a ... |
82 | Measurement and analysis of large-scale network file system workloads.
- Leung, Pasupathy, et al.
- 2008
(Show Context)
Citation Context ...of real traces where 45% requests from all 11,568,086 requests visit only 6.5% files from all 65,536 files that are sorted by file popularity. Measurement of large-scale network file system workloads =-=[15]-=- further verifies that fewer than 1% clients issue 50% file requests and over 60% re-open operations take place within one minute. Semantic correlation can be exploited to optimize system performance.... |
55 | A nine year study of file system and storage benchmarking,"
- Traeger, Zadok, et al.
- 2008
(Show Context)
Citation Context ... the next-generation file systems, metadata accesses willsvery likely become a severe performance bottleneck as metadata-based transactions not only account for over 50% of all filessystem operations =-=[4, 5]-=- but also result in billions of pieces ofsmetadata in directories. Given the sheer scale and complexity ofsthe data and metadata in such systems, we must seriously ponder a few critical research probl... |
53 |
Characterization of storage workload traces from production Windows Servers.
- KAVALANEKAR, WORTHINGTON, et al.
- 2008
(Show Context)
Citation Context ...ta prefetching performance. The probability of inter-file access is found to be up to 80% when considering four typical file system traces. Our preliminary results based on these and the HP [17], MSN =-=[18]-=-, and EECS [19] traces further show that exploiting semantic correlation of multi-dimensional attributes can help prune up to 99.9% search space [20]. Therefore, in this paper we proposed a novel dece... |
36 | DB2 Parallel Edition
- Baru, Fecteau, et al.
- 1995
(Show Context)
Citation Context ... billions of records. Some databasesvendors developed parallel databases to support large-scalesdata management, such as Oracle’s Real Application Cluster database [42] and IBM’s DB2 Parallel Edition =-=[43]-=-, by using a complete relational model with transactions. Although successfulsfor managing relational databases, existing database management systems (DBMS) do not fully satisfy the requirements ofsme... |
29 | Hierarchical file systems are dead.
- Seltzer, Murphy
- 2009
(Show Context)
Citation Context ...00. and/or a fee. SC 09 November 14-20, 2009, Portland, Oregon, USA (c) 2009 ACM 978-1-60558-744-8/09/11 ...$10.00. metadata in such systems, we must seriously ponder a few critical research problems =-=[6, 7]-=- such as “How to efficiently extract useful knowledge from an ocean of data?", “How to manage the enormous number of files that have multi-dimensional or increasingly higher dimensional attributes?", ... |
22 | Giga+: scalable directories for shared file systems
- Patil, Gibson, et al.
- 2007
(Show Context)
Citation Context ...et alone explicitly, exist in large-scale filessystems.s6.2 Directory-based Subtree PartitioningsSubtree-partitioning based approaches have been widelysused in recent studies, such as Ceph [3], GIGA+ =-=[41]-=-, Farsites[2] and Spyglass [8]. Ceph [3] maximizes the separation between data and metadata management by using a pseudorandom data distribution function to support a scalable andsdecentralized placem... |
16 | File grouping for scientific data management: Lessons from experimenting with real traces
- Doraimani, Iamnitchi
- 2008
(Show Context)
Citation Context ... body of published work. Spyglass [8] reports that the locality ratios are below 1% in many given traces, meaning that correlated files are contained in less than 1% of the directory space. Filecules =-=[14]-=- reveals the existence of file grouping by examining a large set of real traces where 45% requests from all 11,568,086 requests visit only 6.5% files from all 65,536 files that are sorted by file popu... |
15 |
Evaluation in (XML) information retrieval: expected precision-recall with user modelling (EPRUM
- Piwowarski, Dupret
- 2006
(Show Context)
Citation Context ... served accurately by Bloom filters.sFigure 9. Average hit rate for point query. 5.4.2 Complex QueriessWe adopt “Recall” as a measure for complex query qualitysfrom the field of information retrieval =-=[35]-=-. Given a query q, wesdenote T(q) the ideal set of K nearest objects and A(q) the actualsneighbors reported by SmartStore. We define recall assrecall =s|T(q) ∩ A(q)|sT(q)s(a) Top-8 NN query.s(b) Range... |
12 | Hba: Distributed metadata management for large cluster-based storage systems,”
- Zhu, Jiang, et al.
- 2008
(Show Context)
Citation Context ...he number of sub-traces replayed concurrently is denoted as the Trace Intensifying Factor (TIF) as shownsin Table 1, 2 and 3. Similar workload scale-up approaches havesalso been used in other studies =-=[28, 32]-=-.sWe compare SmartStore with two baseline systems. The firstsone is a popular database approach that uses a B+ tree [33] to index each metadata attribute, denoted as DBMS that here doessnot take into ... |
10 | Nexus: A Novel WeightedGraph-Based Prefetching Algorithm for Metadata Servers
- Gu, Zhu, et al.
- 2006
(Show Context)
Citation Context ...loitation of highdimensional attributes of metadata. To put things in perspective, linear brute-force approach uses 0-dimensional correlation while spatial/temporal locality approaches, such as Nexus =-=[11]-=- and Spyglass [8], use 1-dimensional correlation, which can be considered as special cases of our proposed approach that considers higher dimensional correlation. The main benefit of using semantic co... |
6 | Tap: Tablebased prefetching for storage caches
- Li, Varki, et al.
(Show Context)
Citation Context ...acent groups where duplicate copies can be placed together with high probability to narrow the search space and further facilitate fast identification.sOn the other hand, caching [24] and prefetching =-=[25]-=- areswidely used in storage systems to improve I/O performancesby exploiting spatial or temporal access locality. However, theirsperformance in terms of hit rate varies largely from applicationsto app... |
5 | FARMER: A novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance
- Xia, Feng, et al.
- 2008
(Show Context)
Citation Context ...n operations take place within one minute. Semantic correlation can be exploited to optimize system performance. Our research group has proposed metadata prefetching algorithms, Nexus [11] and FARMER =-=[16]-=-, in which both file access sequences and semantic attributes are considered in the evaluation of the correlation among files to improve file metadata prefetching performance. The probability of inter... |
3 | CLIC: CLient-Informed Caching for Storage Servers
- Liu, Aboulnaga, et al.
- 2009
(Show Context)
Citation Context ... into the same or adjacent groups where duplicate copies can be placed together with high probability to narrow the search space and further facilitate fast identification. On the other hand, caching =-=[24]-=- and prefetching [25] are widely used in storage systems to improve I/O performance by exploiting spatial or temporal access locality. However, their performance in terms of hit rate varies largely fr... |
2 |
End Computing File System and
- Nunez, “High
- 2008
(Show Context)
Citation Context ...ing complex queries in large-scale file systems. 1. INTRODUCTION Fast and flexible metadata retrieving is a critical requirement in the next-generation data storage systems serving high-end computing =-=[1]-=-. As the storage capacity is approaching Exabytes and the number of files stored is reaching billions, directory-tree based metadata management widely deployed in conventional file systems [2, 3] can ... |
2 |
New Challenges in Petascale Scientific Databases,” Keynote Talk
- Szalay
- 2008
(Show Context)
Citation Context ...00. and/or a fee. SC 09 November 14-20, 2009, Portland, Oregon, USA (c) 2009 ACM 978-1-60558-744-8/09/11 ...$10.00. metadata in such systems, we must seriously ponder a few critical research problems =-=[6, 7]-=- such as “How to efficiently extract useful knowledge from an ocean of data?", “How to manage the enormous number of files that have multi-dimensional or increasingly higher dimensional attributes?", ... |
1 |
Distributed Directory Service
- Douceur, Howell
- 2006
(Show Context)
Citation Context ...mputing [1]. As the storage capacity is approaching Exabytes and the number of files stored is reaching billions, directory-tree based metadata management widely deployed in conventional file systems =-=[2, 3]-=- can no longer meet the requirements of scalability and functionality. For the next-generation large-scale storage systems, new metadata organization schemes are desired to meet two critical goals: (1... |