Results 1 - 10
of
15
A Dependency Parser for Variable-Word-Order Languages
, 1990
"... This paper presents a new approach to the recognition of sentence structure by computer in human languages that have variable word order. In a sense, the algorithm is not new; there is good evidence that it was known 700 years ago (Covington 1984). But it has not been implemented on computers, and t ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
This paper presents a new approach to the recognition of sentence structure by computer in human languages that have variable word order. In a sense, the algorithm is not new; there is good evidence that it was known 700 years ago (Covington 1984). But it has not been implemented on computers, and the modern implementations that are most like it fail to realize its crucial advantage for dealing with variable word order. 1 In fact, present-day parsing technology is so tied to the fixed word order of English that researchers in Germany and Japan customarily build parsers for English rather than their own languages. The new
Code Optimizers and Register Organizations for Vector Architectures
, 1992
"... A major challenge facing computer architects today is designing cost-effective hardware that executes multiple operations simultaneously. The goal of such designs is to improve performance by taking advantage of fine-grain parallelism. In this dissertation, I study vector architectures, the oldest o ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
A major challenge facing computer architects today is designing cost-effective hardware that executes multiple operations simultaneously. The goal of such designs is to improve performance by taking advantage of fine-grain parallelism. In this dissertation, I study vector architectures, the oldest of several processor designs that support fine-grain parallelism. Because implementing a cost-effective processor that performs well requires studying not only the design of processors but also the design of algorithms for compilers, this dissertation encompasses aspects of both hardware and software design. In the first half of this dissertation, I demonstrate that a vector architecture is a cost-effective processor that supports fine-grain parallelism. I show that implementing a vector architecture is no more costly than implementing a superscalar architecture, which is currently popular among designers of VLSI microprocessors. I then show that programs that are rich in parallelism tend als...
DASC cache
, 1995
"... For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty (measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operati ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty (measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operating system (or some specific hardware) has to manage data consistency of caches and memory. Unfortunately, conciliating physical indexing of the cache and low cache hit time is very difficult. In this paper, we propose the Direct-mapped Access Set-associative Check cache (DASC) for addressing both difficulties. On a DASC cache, the cache array is direct-mapped, so the cache hit time is low. However the tag array is set-associative and the external miss ratio on a DASC cache is the same as the miss ratio on a set-associative cache. When the size of an associativity degree of the tag array is tied to the minimum page size, a virtually indexed but physically tagged DASC cache correctly handles a...
Cache Refill/Access Decoupling for Vector Machines
- In MICRO 37: Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
, 2004
"... Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor with a Vector Refill Unit (VRU) to quickly pre-execute vector memory commands and issue any needed cache line refills ahead of regular execution. The VRU reduces costs by eliminating much of the outstanding miss state required in traditional vector architectures and by using the cache itself as a cost-effective prefetch buffer. We also introduce vector segment accesses, a new class of vector memory instructions that efficiently encode two-dimensional access patterns. Segments reduce address bandwidth demands and enable more efficient refill/access decoupling by increasing the information contained in each vector memory command. Our results show that refill/access decoupling is able to achieve better performance with less resources than more traditional decoupling methods. Even with a small cache and memory latencies as long as 800 cycles, refill/access decoupling can sustain several kilobytes of in-flight data with minimal access management state and no need for expensive reserved element buffering.
Functional Implementation Techniques for CPU Cache Memories
- IEEE TRANS. ON COMPUTERS, SPECIAL ISSUE ON CACHE MEMORY
, 1999
"... As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache memories are needed to bridge the gap. In this paper, we consider some of the issues that are involved in the implementation of highly optimized cache memories and survey t ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache memories are needed to bridge the gap. In this paper, we consider some of the issues that are involved in the implementation of highly optimized cache memories and survey the techniques that can be used to help achieve the increasingly stringent design targets and constraints of modern processors. In particular, we consider techniques that enable the cache to be accessed quickly and still achieve a good hit ratio. We also consider issues such as area cost and bandwidth requirements. Trace-driven simulations of a TPC-C-like workload and selected applications from the SPEC95 benchmark suite are used in the paper to compare the performance of some of the techniques.
Implementation Issues in Modern Cache Memory
, 1998
"... Abstract|As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache memories are needed to bridge the gap. In this paper, we consider some of the issues that are involved in the implementation of highly optimized cache memories and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract|As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache memories are needed to bridge the gap. In this paper, we consider some of the issues that are involved in the implementation of highly optimized cache memories and survey the techniques that can be used to help achieve the increasingly stringent design targets and constraints of modern processors. In particular, we consider techniques that enable the cache to be accessed quickly and still achieve a good hit ratio. We also consider issues such as area cost and bandwidth requirements. Trace-driven simulations of a TPC-C-like workload and selected applications from the SPEC95 benchmark suite are used in the paper to compare the performance of some of the techniques.
Ring-oriented Block Matrix Factorization Algorithms for Shared and Distributed Memory Architectures
- Report UMINF-92.04, Inst. of Information Processing, Univ. of Umea, S-901 87 Umea
, 1992
"... Utilizing experiences from the implementations on shared memory multiprocessors (SMM) and distributed memory multicomputers (DMM), general ring-oriented routines are developed for the LU , Cholesky, and QR factorizations. Since, all machine dependencies are comprised to a small set of communication ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Utilizing experiences from the implementations on shared memory multiprocessors (SMM) and distributed memory multicomputers (DMM), general ring-oriented routines are developed for the LU , Cholesky, and QR factorizations. Since, all machine dependencies are comprised to a small set of communication routines, the same factorization routines can be used on both the SMM and DMM architectures. The algorithms are described on high level with focus on the portability aspects. Further, detailed implementations of the LU factorization and machine specific communication routines for the Alliant FX2816, Intel iPSC/2, and IBM 3090VF/600J are enclosed. Timing results show that the performance of machine specific implementations are preserved for the general ring-oriented block algorithms. Keywords: Block matrix factorizations, parallel algorithms, portability, shared and distributed memory architectures. 1 Introduction With the introduction of advanced parallel computer architectures a demand ...
Software Exploitation of a Fault-Tolerant Computer with a Large Memory
, 1998
"... The DM/6000 hardware (a prototype, faulttolerant RS/6000 built at the TJ Watson Research Center) provides fault tolerance and a large, nonvolatile main memory. Running a commercial, general-purpose operating system on it, of itself, does nothing to increase software availability. In fact, the time t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The DM/6000 hardware (a prototype, faulttolerant RS/6000 built at the TJ Watson Research Center) provides fault tolerance and a large, nonvolatile main memory. Running a commercial, general-purpose operating system on it, of itself, does nothing to increase software availability. In fact, the time to rebuild the contents of a large memory may decrease availability. We describe our techniques for hiding most of the main memory, which requires the operating system to access it only by way of services separate from the operating system. This can allow the memory and those access services to achieve much higher availability, which, in turn, increases the availability of the system as a whole. We also performed simulation studies to determine those conditions where this system organization can lead to improved performance for recoverable database applications. 1 Introduction The DM/6000 [1] is a prototype, fault-tolerant 4way multiprocessor RS/6000 with a large main memory built at the TJ...
ARCHITECTURE-INDEPENDENT ENVIRONMENT FOR DEVELOPING ENGINEERING SOFTWARE ON MIMD COMPUTERS BY
"... Engineers are constantly faced with solving problems of increasing complexity and detail. They frequently rely upon numerical methods to solve these problems, and their insatiable appetite for improved performance from computing hardware has reached a point where the computational requirements excee ..."
Abstract
- Add to MetaCart
Engineers are constantly faced with solving problems of increasing complexity and detail. They frequently rely upon numerical methods to solve these problems, and their insatiable appetite for improved performance from computing hardware has reached a point where the computational requirements exceed reasonable expectations of the performance of Von-Neumann (serial) computers. Multiple Instruction stream Multiple Data stream (MIMD) computers have been devel-oped to overcome the performance limitations of serial computers. The hardware architec-tures of MIMD computers vary considerably and are much more sophisticated than serial computers. Developing large scale software for a variety of MIMD computers is difficult and expensive. There is a need to provide tools that facilitate programming these machines. The first part of this report examines the issues that must be considered to develop those tools. The two main areas of concern were architecture independence and data man-agement. Architecture independent software facilitates software portability and improves the longevity and utility of the software product. It provides some form of insurance for the

