Praise for Embedded Computing: A VLIW
Approach to Architecture, Compilers
and Tools
There is little doubt that embedded computing is the new frontier of computer research.
There is also a consensus that VLIW technology is extremely powerful in this domain.
This book speaks with an authoritative voice on VLIW for embedded with true technical
depth and deep wisdom from the pioneering experiences of the authors. This book will
find a place on my shelf next to the classic texts on computer architecture and compiler
optimization. It is simply that good.
Tom Conte Center for Embedded Systems Research, North Carolina State University
Written by one of the field’s inventors with his collaborators, this book is the first complete
exposition of the VLIW design philosophy for embedded systems. It can be read as a
stand-alone reference on VLIW — a careful treatment of the ISA, compiling and program
analysis tools needed to develop a new generation of embedded systems — or as a series
of design case studies drawn from the authors’ extensive experience. The authors’ style
is careful yet informal, and the book abounds with “flames,” debunked “fallacies” and
other material that engages the reader in the lively interplay between academic research
and commercial development that has made this aspect of computer architecture so
exciting. Embedded Computing: A VLIW Approach to Architecture, Compilers, and
Tools will certainly be the definitive treatment of this important chapter in computer
architecture.
Richard DeMillo Georgia Institute of Technology
This book does a superb job of laying down the foundations of VLIW computing and con-
veying how the VLIW principles have evolved to meet the needs of embedded computing.
Due to the additional attention paid to characterizing a wide range of embedded appli-
cations and development of an accompanying toolchain, this book sets a new standard
both as a reference and a text for embedded computing.
Rajiv Gupta The University of Arizona
A wealth of wisdom on a high-performance and power-efficient approach to embedded
computing. I highly recommend it for both engineers and students.
Computing
A VLIW Approach to
Architecture, Compilers and Tools
Joseph A. Fisher
Paolo Faraboschi
Cliff Young
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
TEAM LinG - Live, Informative, Non-cost and Genuine !
Publisher Denise E. M. Penrose
Publishing Services Manager Simon Crump
Senior Production Editor Angela Dooley
Editorial Assistant Valerie Witte
Cover Design Hannus Design
Cover Image Santiago Calatrava’s Alamillo Bridge
Text Design Frances Baca Design
Composition CEPHA
Technical Illustration Dartmouth Publishing
Copyeditor Daril Bentley
Proofreader Phyllis Coyne & Associates
Indexer Northwind Editorial
Interior printer The Maple-Vail Manufacturing Group
Cover printer Phoenix Color, Inc.
Morgan Kaufmann Publishers is an imprint of Elsevier. 500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper.
© 2005 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks.
In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or
To the memory of my late parents Silvio and Gina,
to my wife Tatiana and our daughter Silvia.
Paolo Faraboschi
To the women of my family:
Yueh-Jing, Dorothy, Matilda, Joyce, and Celeste.
Cliff Young
To Bob Rau, a VLIW pioneer and true visionary,
and a wonderful human being.
We were privileged to know and work with him.
The Authors
TEAM LinG - Live, Informative, Non-cost and Genuine !
TEAM LinG - Live, Informative, Non-cost and Genuine !
About the Authors
JOSEPH A. FISHER is a Hewlett-Packard Senior Fellow at HP Labs, where he has
worked since 1990 in instruction-level parallelism and in custom embedded VLIW pro-
cessors and their compilers. Josh studied at the Courant Institute of NYU (B.A., M.A.,
and then Ph.D. in 1979), where he devised the trace scheduling compiler algorithm
and coined the term instruction-level parallelism. As a professor at Yale University, he
created and named VLIW architectures and invented many of the fundamental tech-
nologies of ILP. In 1984, he started Multiflow Computer with two members of his Yale
team. Josh won an NSF Presidential Young Investigator Award in 1984, was the 1987
Connecticut Eli Whitney Entrepreneur of the Year, and in 2003 received the ACM/IEEE
Eckert-Mauchly Award.
PAOLO FARABOSCHI is a Principal Research Scientist at HP Labs. Before joining
Hewlett-Packard in 1994, Paolo received an M.S. (Laurea) and Ph.D. (Dottorato di
Ricerca) in electrical engineering and computer science from the University of Genoa
(Italy) in 1989 and 1993, respectively. His research interests skirt the boundary of
hardware and software, including VLIW architectures, compilers, and embedded sys-
tems. More recently, he has been looking at the computing aspects of demanding
content-processing applications. Paolo is an active member of the computer architec-
was interesting, but mostly in the same sense that studying cosmology was interesting:
intellectually challenging, but what does it have to do with me?
I should have known better. I don’t think Josh Fisher can write boring text. He
doesn’t know how. (I still consider his “Very Long Instruction Word Architectures and
the ELI-512” paper from ISCA-10 to be the finest conference publication I have ever read.)
And he seems to have either found like-minded coauthors in Faraboschi and Young or
has taught them well, because Embedded Computing: A VLIW Approach to Architecture,
Tools and Compilers is enthralling in its clarity and exhilarating in its scope. If you are
involved in computer system design or programming, you must still read this book,
because it will take you to places where the views are spectacular, including those
looking over to where you usually live. You don’t necessarily have to agree with every
point the authors make, but you will understand what they are trying to say, and they
will make you think.
One of the best legacies of the classic Hennessy and Patterson computer architecture
textbooks is that the success of their format and style has encouraged more books like
theirs. In Embedded Computing: A VLIW Approach to Architecture, Tools and Compil-
ers, you will find the pitfalls, controversies, and occasional opinion sidebars that made
xi
TEAM LinG - Live, Informative, Non-cost and Genuine !
xii Foreword
H&P such a joy to read. This kind of technical exposition is like vulcanology done while
standing on an active volcano. Look over there, and see molten lava running under a
new fissure in the rocks. Feel the heat; it commands your full attention. It’s immersive,
it’s interesting, and it’s immediate. If your Vibram soles start melting, it’s still worth it.
You probably needed new shoes anyway.
I first met Josh when I was a grad student at Carnegie-Mellon in 1982. He spent an
hour earnestly describing to me how a sufficiently talented compiler could in principle
find enough parallelism, via a technique he called trace scheduling, to keep a really
wild-looking hardware engine busy. The compiler would speculatively move code all
over the place, and then invent more code to fix up what it got wrong. I thought to myself
An Introduction to Embedded Processing 1
1.1 What Is Embedded Computing? 3
1.1.1 Attributes of Embedded Devices 4
1.1.2 Embedded Is Growing 5
1.2 Distinguishing Between Embedded and General-Purpose Computing 6
1.2.1 The “Run One Program Only” Phenomenon 8
1.2.2 Backward and Binary Compatibility 9
1.2.3 Physical Limits in the Embedded Domain 10
1.3 Characterizing Embedded Computing 11
1.3.1 Categorization by Type of Processing Engine 12
Digital Signal Processors 13
Network Processors 16
1.3.2 Categorization by Application Area 17
The Image Processing and Consumer Market 18
The Communications Market 20
The Automotive Market 22
1.3.3 Categorization by Workload Differences 22
1.4 Embedded Market Structure 23
1.4.1 The Market for Embedded Processor Cores 24
xiii
TEAM LinG - Live, Informative, Non-cost and Genuine !
xiv Contents
1.4.2 Business Model of Embedded Processors 25
1.4.3 Costs and Product Volume 26
1.4.4 Software and the Embedded Software Market 28
1.4.5 Industry Standards 28
1.4.6 Product Life Cycle 30
1.4.7 The Transition to SoC Design 31
Effects of SoC on the Business Model 34
Centers of Embedded Design 35
Attached Signal Processors 72
Horizontal Microcode 72
2.5.2 The Development of ILP Code Generation in the 1980s 73
Acyclic Microcode Compaction Techniques 73
Cyclic Techniques: Software Pipelining 75
TEAM LinG - Live, Informative, Non-cost and Genuine !
Contents xv
2.5.3 VLIW Development in the 1980s 76
2.5.4 ILP in the 1990s and 2000s 77
2.6 Exercises 78
CHAPTER 3
An Overview of ISA Design 83
3.1 Overview: What to Hide 84
3.1.1 Architectural State: Memory and Registers 84
3.1.2 Pipelining and Operational Latency 85
3.1.3 Multiple Issue and Hazards 86
Exposing Dependence and Independence 86
Structural Hazards 87
Resource Hazards 89
3.1.4 Exception and Interrupt Handling 89
3.1.5 Discussion 90
3.2 Basic VLIW Design Principles 91
3.2.1 Implications for Compilers and Implementations 92
3.2.2 Execution Model Subtleties 93
3.3 Designing a VLIW ISA for Embedded Systems 95
3.3.1 Application Domain 96
3.3.2 ILP Style 98
3.3.3 Hardware/Software Tradeoffs 100
3.4
Instruction-set Encoding 101
Fixed-point Multiplication 133
Integer Division 135
Floating-point Operations 136
Saturated Arithmetic 137
4.1.4 Micro-SIMD Operations 139
Alignment Issues 141
Precision Issues 141
Dealing with Control Flow 142
Pack, Unpack, and Mix 143
Reductions 143
4.1.5 Constants 144
4.2 Registers and Clusters 144
4.2.1 Clustering 145
Architecturally Invisible Clustering 147
Architecturally Visible Clustering 147
4.2.2 Heterogeneous Register Files 149
4.2.3 Address and Data Registers 149
4.2.4 Special Register File Features 150
Indexed Register Files 150
Rotating Register Files 151
4.3 Memory Architecture 151
4.3.1 Addressing Modes 152
4.3.2 Access Sizes 153
4.3.3 Alignment Issues 153
4.3.4 Caches and Local Memories 154
Prefetching 154
Local Memories and Lockable Caches 156
4.3.5 Exotic Addressing Modes for Embedded Processing 156
4.4
Branch Architecture 156
5.3
VLIW Fetch, Sequencing, and Decoding 191
5.3.1 Instruction Fetch 191
5.3.2 Alignment and Instruction Length 192
5.3.3 Decoding and Dispersal 194
5.3.4 Decoding and ISA Extensions 195
5.4 The Datapath 195
5.4.1 Execution Units 197
5.4.2 Bypassing and Forwarding Logic 200
5.4.3 Exposing Latencies 202
5.4.4 Predication and Selects 204
5.5
Memory Architecture 206
5.5.1 Local Memory and Caches 206
5.5.2 Byte Manipulation 209
5.5.3 Addressing, Protection, and Virtual Memory 210
5.5.4 Memories in Multiprocessor Systems 211
5.5.5 Memory Speculation 213
5.6 The Control Unit 214
5.6.1 Branch Architecture 214
5.6.2 Predication and Selects 215
TEAM LinG - Live, Informative, Non-cost and Genuine !
xviii Contents
5.6.3 Interrupts and Exceptions 216
5.6.4 Exceptions and Pipelining 218
Drain and Flush Pipeline Models 218
Early Commit 219
Delayed Commit 220
5.7 Control Registers 221
5.8 Power Considerations 221
6.4.2 Compiled Simulation 259
Memory 262
Registers 263
Control Flow 263
Exceptions 266
TEAM LinG - Live, Informative, Non-cost and Genuine !
Contents xix
Analysis of Compiled Simulation 267
Performance Measurement and Compiled Simulation 268
6.4.3 Dynamic Binary Translation 268
6.4.4 Trace-driven Simulation 270
6.5 System Simulation 271
6.5.1 I/O and Concurrent Activities 272
6.5.2 Hardware Simulation 272
Discrete Event Simulation 274
6.5.3 Accelerating Simulation 275
In-Circuit Emulation 275
Hardware Accelerators for Simulation 276
6.6 Validation and Verification 276
6.6.1 Co-simulation 278
6.6.2 Simulation, Verification, and Test 279
Formal Verification 280
Design for Testability 280
Debugging Support for SoC 281
6.7 Further Reading 282
6.8 Exercises 284
CHAPTER 7
Embedded Compiling and Toolchains 287
7.1 What Is Important in an ILP Compiler? 287
7.2 Embedded Cross-Development Toolchains 290
7.6 DSP-Specific Compiler Optimizations 320
7.6.1 Compiler-visible Features of DSPs 322
Heterogeneous Registers 322
Addressing Modes 322
Limited Connectivity 323
Local Memories 323
Harvard Architecture 324
7.6.2 Instruction Selection and Scheduling 325
7.6.3 Address Computation and Offset Assignment 327
7.6.4 Local Memories 327
7.6.5 Register Assignment Techniques 328
7.6.6 Retargetable DSP and ASIP Compilers 329
7.7 Further Reading 332
7.8 Exercises 333
CHAPTER 8
Compiling for VLIWs and ILP 337
8.1 Profiling 338
8.1.1 Types of Profiles 338
8.1.2 Profile Collection 341
8.1.3 Synthetic Profiles (Heuristics in Lieu of Profiles) 341
8.1.4 Profile Bookkeeping and Methodology 342
8.1.5 Profiles and Embedded Applications 342
8.2 Scheduling 343
8.2.1 Acyclic Region Types and Shapes 345
Basic Blocks 345
Traces 345
Superblocks 345
Hyperblocks 347
Treegions 347
Percolation Scheduling 348
The Run-time System 399
9.1 Exceptions, Interrupts, and Traps 400
9.1.1 Exception Handling 400
9.2 Application Binary Interface Considerations 402
9.2.1 Loading Programs 404
9.2.2 Data Layout 406
9.2.3 Accessing Global Data 407
9.2.4 Calling Conventions 409
Registers 409
Call Instructions 409
Call Sites 410
Function Prologues and Epilogues 412
9.2.5 Advanced ABI Topics 412
Variable-length Argument Lists 412
Dynamic Stack Allocation 413
Garbage Collection 414
Linguistic Exceptions 414
TEAM LinG - Live, Informative, Non-cost and Genuine !
xxii Contents
9.3 Code Compression 415
9.3.1 Motivations 416
9.3.2 Compression and Information Theory 417
9.3.3 Architectural Compression Options 417
Decompression on Fetch 420
Decompression on Refill 420
Load-time Decompression 420
9.3.4 Compression Methods 420
Hand-tuned ISAs 421
Ad Hoc Compression Schemes 421
RAM Decompression 422
Restricted Pointers 456
Fixed-point Data Types 459
Circular Arrays 461
TEAM LinG - Live, Informative, Non-cost and Genuine !
Contents xxiii
Matrix Referencing and Operators 462
10.1.7 Pragmas, Intrinsics, and Inline Assembly Language Code 462
Compiler Pragmas and Type Annotations 462
Assembler Inserts and Intrinsics 463
10.2 Performance, Benchmarking, and Tuning 465
10.2.1 Importance and Methodology 465
10.2.2 Tuning an Application for Performance 466
Profiling 466
Performance Tuning and Compilers 467
Developing for ILP Targets 468
10.2.3 Benchmarking 473
10.3 Scalability and Customizability 475
10.3.1 Scalability and Architecture Families 476
10.3.2 Exploration and Scalability 477
10.3.3 Customization 478
Customized Implementations 479
10.3.4 Reconfigurable Hardware 480
Using Programmable Logic 480
10.3.5 Customizable Processors and Tools 481
Describing Processors 481
10.3.6 Tools for Customization 483
Customizable Compilers 485
10.3.7 Architecture Exploration 487
Dealing with the Complexity 488
Other Barriers to Customization 488
Engine Control Units 520
In-vehicle Networking 520
11.3.3 Hard Disk Drives 522
Motor Control 524
Data Decoding 525
Disk Scheduling and On-disk Management Tasks 526
Disk Scheduling and Off-disk Management Tasks 527
11.3.4 Networking and Network Processors 528
Network Processors 531
11.4 Further Reading 535
11.5 Exercises 537
APPENDIX A
The VEX System 539
A.1 The VEX Instruction-set Architecture 540
A.1.1 VEX Assembly Language Notation 541
A.1.2 Clusters 542
A.1.3 Execution Model 544
A.1.4 Architecture State 545
A.1.5 Arithmetic and Logic Operations 545
Examples 547
A.1.6 Intercluster Communication 549
A.1.7 Memory Operations 550
A.1.8 Control Operations 552
Examples 553
A.1.9 Structure of the Default VEX Cluster 554
Register Files and Immediates 555
A.1.10 VEX Semantics 556
A.2 The VEX Run-time Architecture 558
A.2.1 Data Allocation and Layout 559
A.2.2 Register Usage 560