Results 1 -
4 of
4
A parallel approach to xml parsing
- In The 7th IEEE/ACM International Conference on Grid Computing
, 2006
"... Abstract — A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has a reputation for poor performance, and a number of op ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Abstract — A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has a reputation for poor performance, and a number of optimizations have been developed to address this performance problem from different perspectives, none of which have been entirely satisfactory. In this paper, we present a seemingly quixotic, but novel approach: parallel XML parsing. Parallel XML parsing leverages the growing prevalence of multicore architectures in all sectors of the computer market, and yields significant performance improvements. This paper presents our design and implementation of parallel XML parsing. Our design consists of an initial preparsing phase to determine the structure of the XML document, followed by a full, parallel parse. The results of the preparsing phase are used to help partition the XML document for data parallel processing. Our parallel parsing phase is a modification of the libxml2 [1] XML parser, which shows that our approach applies to real-world, production quality parsers. Our empirical study shows our parallel XML parsing algorithm can improved the XML parsing performance significantly and scales well. I.
A static load-balancing scheme for parallel xml parsing on multicore cpus
- In CCGrid’07 (IEEE International Symposium on Cluster Computing and the Grid ), Rio de Janeiro
, 2007
"... A number of techniques to improve the parsing performance of XML have been developed. Generally, however, these techniques have limited impact on the construction of a DOM tree, which can be a significant bottleneck. Meanwhile, the trend in hardware technology is toward an increasing number of cores ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
A number of techniques to improve the parsing performance of XML have been developed. Generally, however, these techniques have limited impact on the construction of a DOM tree, which can be a significant bottleneck. Meanwhile, the trend in hardware technology is toward an increasing number of cores per CPU. As we have shown in previous work, these cores can be used to parse XML in parallel, resulting in significant speedups. In this paper, we introduce a new static partitioning and load-balancing mechanism. By using a static, global approach, we reduce synchronization and load-balancing overhead, thus improving performance over dynamic schemes for a large class of XML documents. Our approach leverages libxml2 without modification, which reduces development effort and shows that our approach is applicable to real-world, production parsers. Our scheme works well with Sun’s Niagara class of CMT architectures, and shows that multiple hardware threads can be effectively used for XML parsing. 1.
ParaXML: A Parallel XML Processing Model on the Multicore CPUs
"... performance and scale well on a multicore machine. XML has emerged as the de facto standard interoperable data format for the web service, the database and document processing systems. The processing of the XML documents, however, has been recognized as the performance bottleneck in those systems; a ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
performance and scale well on a multicore machine. XML has emerged as the de facto standard interoperable data format for the web service, the database and document processing systems. The processing of the XML documents, however, has been recognized as the performance bottleneck in those systems; as a result the demand for highperformance XML processing grows rapidly. On the hardware front, the multicore processor is increasingly becoming available on desktop-computing machines with quadcore shipping now and 16 core system within two or three years. Unfortunately almost all of the present XML processing algorithms are still using serial processing model, thus being unable to take advantage of the multicore resource. We believe a parallel XML processing model should be a cost-effective solution for the XML performance issue in the multicore era. In this paper, we present a generalpurpose parallel XML processing model, ParaXML, designed for multicore CPUs. General speaking, ParaXML treats the XML document as the general tree structure and the XML processing task as the extension from the parallel tree traversal algorithm for the classic discrete optmization problems. The XML processing, however, has quite distinct characteristics from the classic discrete optmization problems, thus demanding the special treatments and the finegrained tuning technologies. ParaXML internally adopts a fine-grained work-stealing scheme to dynamically control the load balance among the parallel-running threads, and a novel approach is also introduced to trace the stealing actions and the running results to facilitate the reducing of those parallel-running results. Besides, ParaXML provides the tuning options, particularly for the large XML documents, to control the trade-off between the parallelism gain and task-partitioning overhead. To show the feasibility and effectiveness of the ParaXML model, we demonstrate our parallel implementations of three fundamental XML processing tasks based on the ParaXML: traversal, serializing and parsing. The empirical study in this paper shows that those parallel implementations substantially improved the 1
Storing Semi-structured Data on Disk Drives 1
"... Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery- they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing st ..."
Abstract
- Add to MetaCart
Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery- they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing storage mechanisms provide readily available solutions, there is a need to more closely examine their suitability to this class of data. Particularly, retrofitting existing solutions for semi-structured data can result in a mismatch between the tree structure of the data and the access characteristics of the underlying storage device (disk drive). This study explores various possibilities in the design space of native storage solutions for semi-structured data by exploring alternative approaches that match application data access characteristics to those of the underlying disk drive. For evaluating the effectiveness of the proposed native techniques in relation to the existing solution, we experiment with XML data using the XPathMark benchmark. Extensive evaluation reveals the strengths and weaknesses of the proposed native data layout techniques. While the existing solutions work really well for deep-focused queries into a semi-structured document (those that result in retrieving entire subtrees), the proposed native solutions substantially outperform for the non-deep-focused queries, which we demonstrate are at least as important as the deep-focused. We believe that native data layout techniques offer a unique direction for improving the performance of semi-structured data stores for a variety of important workloads. However, given that the proposed native techniques require circumventing current storage stack abstractions, further investigation is warranted before they can be applied to general purpose storage systems.

