• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The SPLASH-2 programs: characterization and methodological considerations. ISCA, (1995)

by S C Woo, M Ohara, E Torrie, J P Singh, A Gupta
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 1,420
Next 10 →

The PARSEC benchmark suite: Characterization and architectural implications

by Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, Kai Li - IN PRINCETON UNIVERSITY , 2008
"... This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited ..."
Abstract - Cited by 518 (4 self) - Add to MetaCart
This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.
(Show Context)

Citation Context

...ng benchmark suites fall short of the presented requirements and must thus be considered unsuitable for evaluating CMP performance. SPLASH-2 SPLASH-2 is a suite composed of multi-threaded applications=-=[44]-=- and hence seems to be an ideal candidate to measure performance of CMPs. However, its program collection is skewed towards HPC and graphics programs. It thus does not include parallelization models s...

The Landscape of Parallel Computing Research: A View from Berkeley

by Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick - TECHNICAL REPORT, UC BERKELEY , 2006
"... ..."
Abstract - Cited by 487 (25 self) - Add to MetaCart
Abstract not found

Logtm: Log-based transactional memory

by Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, David A. Wood - in HPCA , 2006
"... Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (r ..."
Abstract - Cited by 282 (11 self) - Add to MetaCart
Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values “in place” (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, Log-based Transactional Memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32way multiprocessor support our decision to optimize for commit by showing that only 1-2 % of transactions abort. 1.
(Show Context)

Citation Context

...section describes the simulation of LogTM and a baseline system using spin locks (Section 3.1) and compares them using a microbenchmark (Section 3.2) and parallel applications from the SPLASH-2 suite =-=[32]-=- (Section 3.3). 3.1. Target System & Simulation Model LogTM and the baseline system share the same basic SPARC/Solaris multiprocessor architecture, summarized in Table 3. Each system has 32 processors...

Memory System Characterization of Commercial Workloads

by Luiz André Barroso, Kourosh Gharachorloo, Edouard Bugnion - In Proceedings of the 25th annual international symposium on Computer architecture , 1998
"... Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as ..."
Abstract - Cited by 253 (5 self) - Add to MetaCart
Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers. This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches. 1
(Show Context)

Citation Context

...ark suite. Similarly, design of multiprocessor architectures, along with academic research in this area, have been heavily influenced by popular scientific and engineering benchmarks such as SPLASH-2 =-=[26]-=- and STREAMS [13], with only a handful of published architectural studies that have in some way tried to address issues specific to commercial workloads [3, 7, 9, 12, 14, 16, 20, 21, 24]. The lack of ...

Disco: Running commodity operating systems on scalable multiprocessors

by Edouard Bugnion, Scott Devine, Mendel Rosenblum - ACM Transactions on Computer Systems , 1997
"... In this paper we examine the problem of extending modern operating systems to run efficiently on large-scale shared memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 1970s, virtual machine monitors. We use virtual machines to run multiple c ..."
Abstract - Cited by 253 (10 self) - Add to MetaCart
In this paper we examine the problem of extending modern operating systems to run efficiently on large-scale shared memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 1970s, virtual machine monitors. We use virtual machines to run multiple commodity operating systems on a scalable multiprocessor. This solution addresses many of the challenges facing the system software for these machines. We demonstrate our approach with a prototype called Disco that can run multiple copies of Silicon Graphics ’ IRIX operating system on a multiprocessor. Our experience shows that the overheads of the monitor are small and that the approach provides scalability as well as the ability to deal with the non-uniform memory access time of these systems. To reduce the memory overheads associated with running multiple operating systems, we have developed techniques where the virtual machines transparently share major data structures such as the program code and the file system buffer cache. We use the distributed system support of modern operating systems to export a partial single system image to the users. The overall solution achieves most of the benefits of operating systems customized for scalable multiprocessors yet it can be achieved with a significantly smaller implementation effort. 1
(Show Context)

Citation Context

...ate of a remote virtual processor, for example TLB shootdowns and posting of an interrupt to a given virtual CPU. Overall, Disco is structured more like a highly tuned and scalable SPLASH application =-=[27]-=- than like a general-purpose operating system. 4.2.1 Virtual CPUs Like previous virtual machine monitors, Disco emulates the execution of the virtual CPU by using direct execution on the real CPU. To ...

Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory

by Daniel J. Scales, Kourosh Gharachorloo, Chandramohan A. Thekkath
"... This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granulari ..."
Abstract - Cited by 236 (5 self) - Add to MetaCart
This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient
(Show Context)

Citation Context

...lementation. We primarily focus on characterizing the overhead of miss checks by presenting the static overheads for individual miss checks, the dynamic overheads for all of the SPLASH-2 applications =-=[22]-=-, and the frequency of instrumented accesses in these applications. In addition, we present preliminary parallel performance results for some of the SPLASH-2 applications running on a cluster of Alpha...

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

by Ravi Rajwar, James R. Goodman , 2001
"... Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot f ..."
Abstract - Cited by 227 (10 self) - Add to MetaCart
Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot fully exploit such parallelism because they do not have mechanisms to dynamically detect such false inter-thread dependences. We propose Speculative Lock Elision (SLE), a novel micro-architectural technique to remove dynamically unnecessary lock-induced serialization and enable highly concurrent multithreaded execution. The key insight is that locks do not always have to be acquired for a correct execution. Synchronization instructions are predicted as being unnecessary and elided. This allows multiple threads to concurrently execute critical sections protected by the same lock. Misspeculation due to inter-thread data conflicts is detected using existing cache mechanisms and rollback is used for recovery. Successful speculative elision is validated and committed without acquiring the lock. SLE can be implemented entirely in microarchitecture without instruction set support and without system-level modifications, is transparent to programmers, and requires only trivial additional hardware support. SLE can provide programmers a fast path to writing correct high-performance multithreaded programs.
(Show Context)

Citation Context

...a critical section may in fact not conflict, and such accesses do not require serialization. Two such examples are shown in Figure 1. Figure 1a shows an example from a multithreaded applicationsocean =-=[40]-=-. Since a store instruction (line 3) to a shared object is present, the lock is required. However, most dynamic executions do not perform the store operation and thus do not require the lock. Addition...

Transactional Lock-Free Execution of Lock-Based Programs

by Ravi Rajwar, James R Goodman - In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems , 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract - Cited by 201 (9 self) - Add to MetaCart
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
(Show Context)

Citation Context

...efore a successive local lock re-acquire, thus reducing unfairness. 5.2 Applications We use barnes, cholesky, and mp3d from SPLASH [34] and radiosity, water-nsq, ocean-cont, and raytracesfrom SPLASH2 =-=[39]-=-. A locking version of mp3d is used to study the impact of TLR on a lock-intensive benchmark [16]. This version of mp3d does frequent synchronization to largely uncontended locks and lock access laten...

AVIO: Detecting Atomicity Violations via Access Interleaving Invariants

by Shan Lu, Joseph Tucek, Feng Qin, Yuanyuan Zhou - In ASPLOS , 2006
"... Abstract Concurrency bugs are among the most difficult to test and diagnoseof all software bugs. The multicore technology trend worsens this ..."
Abstract - Cited by 193 (26 self) - Add to MetaCart
Abstract Concurrency bugs are among the most difficult to test and diagnoseof all software bugs. The multicore technology trend worsens this
(Show Context)

Citation Context

...s are actually implemented using data races. Examples include barriers, flag synchronization, producer-consumer queues, etc. Figure 2 gives two such examples from the SPLASH2 parallel benchmark suite =-=[40]-=-, in which user-defined barrier and flag synchronization are achieved via races on the shared variables b→synch and r, respectively. Furthermore, sometimes programmers intentionally choose to allow a ...

Embra: Fast and Flexible Machine Simulation

by Emmett Witchel, Mendel Rosenblum - In Measurement and Modeling of Computer Systems , 1996
"... This paper describes Embra, a simulator for the processors, caches, and memory systems of uniprocessors and cache-coherent multiprocessors. When running as part of the SimOS simulation environment, Embra models the processors of a MIPS R3000/R4000 machine faithfully enough to run a commercial operat ..."
Abstract - Cited by 182 (3 self) - Add to MetaCart
This paper describes Embra, a simulator for the processors, caches, and memory systems of uniprocessors and cache-coherent multiprocessors. When running as part of the SimOS simulation environment, Embra models the processors of a MIPS R3000/R4000 machine faithfully enough to run a commercial operating system and arbitrary user applications. To achieve high simulation speed, Embra uses dynamic binary translation to generate code sequences which simulate the workload. It is the first machine simulator to use this technique. Embra can simulate real workloads such as multiprocess compiles and the SPEC92 benchmarks running on Silicon Graphic's IRIX 5.3 at speeds only 3 to 9 times slower than native execution of the workload, making Embra the fastest reported complete machine simulator. Dynamic binary translation also gives Embra the flexibility to dynamically control both the simulation statistics reported and the simulation model accuracy with low performance overheads. For example, Embra...
(Show Context)

Citation Context

... Andrew Benchmark [Ousterhout90], while the MAB pmake is a slightly modified form of the MAB (described in [Rosenblum95a]). The other multiprocessor workloads are taken from the Splash benchmark suite=-=[Woo95]-=-. Those applications run with settings different from the default have their arguments shown. of these workloads, reflected by the large percentage of time spent in the translation cache, accounts for...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University