Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 36 2009-10-13
36 Model-Based Design for Embedded Systems
a translation of the binary code into the SystemC code generates a fast code
compared to an interpreting ISS, as no decoding of instructions is needed and
the generated SystemC code can be easily used within a SystemC simulation
environment. However, this approach has some major disadvantages. One
main drawback is that the same problems that have to be solved in the static
compilation (binary translation) have to be solved here (e.g., addresses of
calculated branch targets have to be determined). Another disadvantage is
that the automatically generated code is not very easily read by humans.
2.4.1 Back-Annotation of WCET/BCET Values
In this section, we will describe our approachinmore detail. Figure 2.6 shows
an overview of the approach.
First, the C source code has to be taken and translated using an ordinary
C (cross)-compiler into the binary code for the embedded processor (source
processor). After that, our back-annotation tool reads the object file and a
description of the used source processor. This description contains both a
description of the architecture and a description of the instruction set of the
processor.
Figure 2.4 shows an example for the description of the architecture. It
contains information about the resources of the processor (Figure 2.4a). This
information is used for the modeling of the pipeline. Furthermore, it contains
a description of the properties of the instruction (Figure 2.4b) and data caches
(Figure 2.4c). Furthermore, such a description can contain information about
the branch prediction of the processor.
Annotation of C code for a basic block
Architectural
model
C code corresponding to the cache
analysis blocks of the basic block
Cache
<cachelinesize>8</cachelinesize>
<cachesize>4096</cachesize> (c)
<replacement>lru</replacement>
<writebackpolicy>write-back</writebackpolicy>
</dcache>
</architecture>
FIGURE 2.4
Example for a description of the architecture.
Figure 2.5 shows an example for the description of the instruction set.
This description contains information about the structure of the bit image of
the instruction code (Figure 2.5c). It also contains information to determine
the timing behavior of instructions and the timing behavior of instructions
that are executed in context with other instructions (Figure 2.5d). Further-
more, for debugging and documentation purposes more information about
the instruction can be given (Figure 2.5a and b).
Using this description, the object code is decoded and translated into
an intermediate representation consisting of a list of objects. Each of these
objects represents one intermediate instruction.
In the next step, the basic blocks of this program are determined using the
intermediate representation consisting of a list of objects. As a result, using
this list, a list of basic blocks is built.
After that, the execution time is statically calculated for each basic block
with respect to the provided pipeline model of the proposed source proces-
sor. This calculation step is described in more detail in Section 2.4.3.
Subsequently, the back-annotation correspondences between the C
source code and the binary code are identified. Then, the back-annotation
process takes place. This is done by automated code instrumentation for
cycle generation and dynamic cycle correction. The structure and functional-
ity of this code are described in Section 2.4.2.
Not every impact of the processor architecture on the number of cycles
</instruction>
.
.
.
</processor>
FIGURE 2.5
Example for a description of an instruction.
additional code needs to be added. Further details concerning this code are
described in Section 2.4.5.
During back-annotation, the C program is transformed into a cycle-
accurate SystemC program that can be compiled to be executed on the pro-
cessor of the simulation host (target processor).
One advantage of this approach is a fast execution of the annotated code
as the C source code does not need major changes for back-annotation. More-
over, the generated SystemC code can be easily used within a SystemC sim-
ulation environment. The difficulty in using this approach is to find the cor-
responding parts of the binary code in the C source code if the compiler opti-
mizes or changes the structure of the binary code too much. If this happens,
recompilation techniques [4] have to be used to find the correspondences.
2.4.2 Annotation of SystemC Code
On the left-hand side of Figure 2.3, there is the necessary annotation of a
piece of the C code that corresponds to a basic block. The right-hand side of
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 39 2009-10-13
SystemC-Based Performance Analysis of Embedded Systems 39
Binary code
Annotated SystemC program
Find correspondences between
C source code and binary code
Construction of intermediate
representation
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 40 2009-10-13
40 Model-Based Design for Embedded Systems
If there is a conditional branch at the end of a basic block, branch predic-
tion has to be considered and possible correction cycles have to be added.
This is described in more detail in Section 2.4.5.
As shown in Figure 2.3, the back-annotation tool adds a call to the
consume function that performs cycle generation at the end of each basic
block code. If necessary, this instruction generates the number of cycles this
basic block would need on the source processor. How this consume function
works is described in Section 2.4.7.
In order to guarantee both—as fast as possible the execution of the code
as well as the highest possible accuracy—it is possible to choose different
accuracy levels of the generated code that parameterize the annotation tool.
The first and the fastest one is a purely static prediction. The second one
additionally includes the modeling of the branch prediction. And the third
one takes also the dynamic inclusion of instruction caches into account.
The cycle calculation in these different levels will be discussed in more
detail in the following sections.
2.4.3 Static Cycle Calculation of a Basic Block
In modern architectures, pipeline effects, superscalarity, and caches have an
important impact on the execution time. Because of this, a calculation of the
execution time of a basic block by summing the execution or latency times of
the single instructions of this block is very inaccurate.
Therefore, the incorporation of a pipeline model per basic block becomes
necessary [21]. This model helps statically predict pipeline effects and the
effects of superscalarity. For the generation of this model, informations about
the instruction set and the pipelines of the used processor are needed. These
informations is contained in the processor description that is used by the
annotation tool. With regard to this, the tool uses a modeling of the pipeline
to determine which instructions of the basic block will be executed in parallel
were used to detect conflicts for the scheduling of instructions [25]. In a reser-
vation table, the vertical dimension represents the pipeline stages and the
horizontal dimension represents the time. Figure 2.7 shows an example of
a basic block and the corresponding reservation table. In the figure, every
entry in the reservation table shows that the corresponding pipeline stage
is used in the particular time slot. The entry consists of the number of the
instruction that uses the resource. The timing interdependencies between the
instructions of a basic block are analyzed using the composition of their basic
block.
In the reservation table, not only conflicts that occur because of the differ-
ent pipeline stages, but also data dependencies between the instructions can
be considered.
76
5
543
21
21
521
521
521
43
43
43
43
int-Pipeline
EX
DI
FI
DI
EX
cuted instructions, register accesses should be in the same order as they are in
the program. This restriction is comparable with the usage of pipeline stages
in the order they are in the program, and, therefore, it can be modeled by an
extension of the reservation table.
2.4.4.1.3 Control Hazards
Some processors (like the MIPS R3000 [12]) use delayed branches to avoid
the waiting cycle that otherwise would occur because of the control hazard.
This can be modeled by adding a delay slot to the basic block with the branch
instruction. Such a modeling is possible, because the instruction in the delay
slot is executed regardless of the result of the branch instruction.
2.4.4.2 Calculation of Pipeline Overlapping
In order to be able to model the impact of architectural components such
as pipelines, the state of these components has to be known when the basic
block is entered. If the state is known, then it is possible to find out the gain
that results from the use of this component.
If it is known that in the control-flow graph of the program, node e
i
is the
predecessor of node e
j
, and the pipeline state after the execution of node e
i
is also known, then the information about this state can be used to calculate
the execution time of node e
j
. This means the gain resulting from the fact that
node e
i
is executed before node e
j
additional cycles can be needed if a correctly predicted branch is taken, as
the branch target has to be calculated and loaded in the program counter.
This problem is solved by implementing a model of the branch prediction
and by a comparison of the predicted branch behavior with the executed
branch behavior. If dynamic branch prediction is used, a model of the under-
lying state machine is implemented and its results are compared with the
executed branch behavior. The cycle count of each possible case is calcu-
lated and added to the cumulative cycle count before the next basic block is
entered.
2.4.5.2 Instruction Cache
Figure 2.3 shows that for the simulation of the instruction cache, every basic
block of the translated program has to be divided into several cache analysis
blocks. This has to be done until the tag changes or the basic block ends. After
that, a function call to the cache handling model is added. This code uses a
cache model to find out possible cache hits or misses.
The cache simulation will be explained in more detail in the next few
paragraphs. This explanation will start with a description of the cache
model.
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 44 2009-10-13
44 Model-Based Design for Embedded Systems
cycleCalcICache
C program Binary code Cache model
datalru
tagv
asm_inst
1
asm_inst
l+1
asm_inst
n
into a single cache line.
As every machine language instruction in such a cache analysis block has
the same tag and the same cache index, the addresses of the instructions can
be used to determine how a basic block has to be divided into cache analysis
blocks. This is because each address consists of the tag information and the
cache index.
The cache index information (iStart to iEnd in Figure 2.3) is used to deter-
mine at which cache position the instruction with this address is cached. The
tag information is used to determine which address was cached, as there can
be multiple addresses with the same cache index. Therefore, a changed cache
tag can be easily determined during the traversal of the binary code with
respect to the cache parameters. The block offset information is not needed
for the cache simulation, as no real caching of data takes place.
After the tag has been changed or at the end of a basic block, a function
call that handles the simulated cache and the calculation of the additional
cycles of cache misses are added to this block. More details about this func-
tion are described in the next section.
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 45 2009-10-13
SystemC-Based Performance Analysis of Embedded Systems 45
✞ ☎
int cycleCalculationICache( tag, iStart, iEnd )
{
for index = iStart to iEnd
{
if tag is found in index and valid bit is set then
{ // cache hit
renew lru information
return 0
}
else
tag has to be overwritten. After that, the new tag has to be written instead of
the found old one, and the valid bit for this tag has to be set. The lru infor-
mation has to be renewed as well. In the final step, the additional cycles are
returned and added to the cycle correction counter.
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 46 2009-10-13
46 Model-Based Design for Embedded Systems
2.4.6 Consideration of Task Switches
In modern embedded systems, software performance simulation has to han-
dle task switching and multiple interrupts. Cooperative task scheduling can
already be handled by the previously mentioned approach since the pre-
sented cache model is able to cope with nonpreemptive task switches. Inter-
rupts, and cooperative and nonpreemptive task scheduling can be handled
similarly because the task preemption is usually implemented by using soft-
ware interrupts. Therefore, the incorporation of interrupts is discussed in the
following.
Software interrupts had to be included in the SystemC model. This has
been achieved by the automatic insertion of dedicated preemption points
after cycle calculation. This approach provides an integration of different
user-defined task scheduling policies, and a task switch generates a soft-
ware interrupt. Since the cycle calculation is completed before a task switch
is executed and a global cache and branch prediction model is used, no other
changes are necessary. A minor deviation of the cycle count for certain pro-
cesses can occur because of the actual task switch that is carried out with a
small delay caused by the projection of the task preemption at the binary-
code level to the C/C++ source-code level. But, nevertheless, the cumulative
cycle count is still correct. The accuracy can be increased by the insertion of
the cycle calculation code after each C/C++ statement.
If the additional delay caused by the context switch itself has to be
included, the (binary) code of the context switch routine can be treated like
any other code.
return taskTime∗t_PERIOD;
}
void resetTaskTime()
{
taskTime=0;
}
✌
✝
✆
Listing 2.2
The delay function.
If a software task calls the consume function with a time value, T,as
a parameter, it decrements the time only if the calling software task is in
the state RUNNING. If the execution unit is withdrawn by the RTOS sched-
uler by a change of the execution state, the decrementation of the time in the
consume function will be suspended. By changing the state to RUNNING by
the scheduler, the software task can allocate an execution unit again, lead-
ing to a continuation of the decrementation of the time that was suspended
before.
2.5 Experimental Results
In order to test the execution speed and the accuracy of the translated code,
a few examples were compiled using a C compiler into an object code for the
Infineon TriCore processor [15]. This object code was also used to generate
an annotated SystemC code from the C code, as described in Section 2.4.1. As
a reference, the execution speed and the cycle count of the TriCore code have
been measured on a TriCore TC10GP evaluation board and on a TriCore
ISS [16].
The examples consist of two filters (fir and ellip) and two programs
that are part of audio-decoding routines (dpcm and subband).
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 48 2009-10-13
100
110
120
130
140
150
160
dpcm fir ellip subband
Tricore eva-
luation board
Annotated
SystemC 1
Annotated
SystemC 2
Tricore ISS
Million instructions per second
FIGURE 2.9
Comparison of speed. (Copyright: ACM. Used with permission.)
Figure 2.9 shows the comparison of the execution speed of the generated
code with the execution speed of the TriCore evaluation board and the ISS.
The execution speed in this figure is represented by million instructions of
the TriCore Processor per second. The Athlon 64 processor running the Sys-
temC code and the ISS had a clock rate of 2.4 GHz. The TriCore processor of
the evaluation board ran at 48 MHz.
Using the annotated SystemC code, two different types of annotations
have been used: the first one generates the cycles after the execution of each
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 49 2009-10-13
SystemC-Based Performance Analysis of Embedded Systems 49
basic block, the second one adds cycles to a cycle counter after each basic
block. The cycles are only generated when it is necessary (e.g., when com-
27500
30000
32500
35000
37500
dpcm fir ellip subband
Cycles
Tricore eva-
luation board
Annotated
SystemC 2
Tricore ISS
FIGURE 2.10
Comparison of cycle accuracy. (Copyright: ACM. Used with permission.)
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 50 2009-10-13
50 Model-Based Design for Embedded Systems
prediction and caches included) compared to the measured cycle count from
the evaluation board ranges between 4% for the program fir to 7% for the
program dpcm. This is in the same range as it is using conventional ISS.
2.6 Outlook
As clock frequencies cannot be increased as linearly as the number of cores,
modern processor architectures can exploit multiple cores to satisfy increas-
ing computational demands. The different cores can share architectural
resources such as data caches to speed up the access to common data. There-
fore, access conflicts and coherency protocols have a potential impact on the
runtimes of tasks executing on the cores.
The incorporation of multiple cores is directly supported by our SystemC
approach. Parallel tasks can easily be assigned to different cores, and the
code instrumentation by cycle information can be carried out independently.
However, shared caches can have a significant impact on the number of exe-
2003.
4. C. Cifuentes. Reverse compilation techniques. PhD thesis, Queensland
University of Technology Brisbane, Australia, November 19, 1994.
5. CoWare Inc. CoWare Processor Designer. http://www.coware.com/
PDF/products/ProcessorDesigner.pdf.
6. L. B. de Brisolara, Marcio F. da S. Oliveira, R. Redin, L. C. Lamb, L. Carro,
and F. R. Wagner. Using UML as front-end for heterogeneous software
code generation strategies. In Proceedings of the Design, Automation and
Test in Europe (DATE) Conference, Munich, Germany, pp. 504–509, 2008.
7. A. Donlin. Transaction level modeling: Flows and use models. In Pro-
ceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/
Software Codesign and System Synthesis (CODES+ISSS), San Jose, CA, pp.
75–80, 2004.
8. T. Grötker, S. Liao, G. Martin, and S. Swan. System Design with SystemC.
Kluwer, Dordrecht, the Netherlands, 2002.
9. M. González Harbour, J. J. Gutiérrez García, J. C. Palencia Gutiérrez, and
J. M. Drake Moyano. MAST: Modeling and analysis suite for real time
applications. In Proceedings of the 13th Euromicro Conference on Real-Time
Systems (ECRTS), Delft, the Netherlands, pp. 125–134, 2001.
10. H. Heinecke. Automotive open system architecture – An industry-wide
initiative to manage the complexity of emerging automotive E/E archi-
tectures. In Convergence International Congress & Exposition On Transporta-
tion Electronics, Detroit, MI, 2004.
11. R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst. Sys-
tem level performance analysis—the SymTA/S approach. IEE Proceed-
ings Computers and Digital Techniques, 152(2):148–166, March 2005.
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 52 2009-10-13
52 Model-Based Design for Embedded Systems
12. Y. Hur, Y. H. Bae, S S. Lim, S K. Kim, B D. Rhee, S. L. Min,
C. Y. Park, H. Shin, and C S. Kim. Worst case timing analysis
the Design, Automation and Test in Europe (DATE) Conference,Munich,
Germany, pp. 236–241, 2006.
21. S S. Lim, Y. H. Bae, G. T. Jang, B D. Rhee, S. L. Min, C. Y. Park, H. Shin,
K. Park, S M. Moon, and C. S. Kim. An accurate worst case timing
analysis for RISC processors. IEEE Transactions on Software Engineering,
21(7):593–604, 1995.
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 53 2009-10-13
SystemC-Based Performance Analysis of Embedded Systems 53
22. R. Marculescu and A. Nandi. Probabilistic application modeling for
system-level performance analysis. In Proceedings of the Conference on
Design, Automation and Test in Europe (DATE), Munich, Germany, pp.
572–579, 2001.
23. M. Ajmone Marsan, G. Conte, and G. Balbo. A class of generalized
stochastic petri nets for the performance evaluation of multiprocessor
systems. ACM Transactions on Computer Systems, 2(2):93–122, 1984.
24. The MathWorks, Inc. Real-Time Workshop
R
Embedded Coder 5, Sep-
tember 2007.
25. Steven S. Muchnick. Advanced Compiler Design and Implementation.Mor-
gan Kaufmann Publishers, San Francisco, CA, 1997.
26. A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and A. Hoff-
mann. A universal technique for fast and flexible instruction-set archi-
tecture simulation. In Proceedings of the 39th Design Automation Conference
(DAC), New York, pp. 22–27, 2002.
27. C. Norström, A. Wall, and W. Yi. Timed automata as task models for
event-driven systems. In Proceedings of the Sixth International Conference
on Real-Time Computing Systems and Applications (RTCSA), Hong Kong,
China, pp. 182–189, 1999.
timing simulation of embedded software. In Proceedings of the 45th Design
Automation Conference (DAC), Anaheim, CA, pp. 290–295, June 2008.
38. J. Schnerr, G. Haug, and W. Rosenstiel. Instruction set emulation for
rapid prototyping of SoCs. In Proceedings of the Design, Automation and
Test in Europe (DATE) Conference, Munich, Germany, pp. 562–567, 2003.
39. A. Siebenborn, O. Bringmann, and W. Rosenstiel. Communication analy-
sis for network-on-chip design. In International Conference on Parallel Com-
puting in Electrical Engineering (PARELEC), Dresden, Germany, pp. 315–
320, 2004.
40. A. Siebenborn, O. Bringmann, and W. Rosenstiel. Communication anal-
ysis for system-on-chip Design. In Proceedings of the Design, Automation
and Test in Europe (DATE) Conference, Paris, France, pp. 648–655, 2004.
41. A. Siebenborn, A. Viehl, O. Bringmann, and W. Rosenstiel. Control-flow
aware communication and conflict analysis of parallel processes. In Pro-
ceedings of the 12th Asia and South Pacific Design Automation Conference
(ASP-DAC), Yokohama, Japan, pp. 32–37, 2007.
42. E. W. Stark and S. A. Smolka. Compositional analysis of expected delays
in networks of probalistic I/O Automata. In IEEE Symposium on Logic in
Computer Science, Indianapolis, IN, pp. 466–477, 1998.
43. Synopsys, Inc. Synopsys Virtual Platforms. http://www.synopsys.com/
products/designware/virtual_platforms.html.
44. L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for
scheduling hard real-time systems. In IEEE International Symposium on
Circuits and Systems (ISCAS), Geneva, Switzerland, volume 4, pp. 101–
104, 2000.
Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 55 2009-10-13
SystemC-Based Performance Analysis of Embedded Systems 55
45. VaST Systems Technology. CoMET
R
3.3 FromDistributedSystemstoMPSoCs 64
3.3.1 Deriving Output Event Models 65
3.3.2 Response Time Analysis in the Presence of Shared Memory
Accesses 66
3.3.3 Deriving Aggregate Busy Time 68
3.4 HierarchicalCommunication 69
3.5 Scenario-AwareAnalysis 73
3.5.1 Echo Effect 74
3.5.2 Compositional Scenario-Aware Analysis 75
3.6 Sensitivity Analysis 76
3.6.1 Performance Characterization 76
3.6.2 Performance Slack 77
3.7 RobustnessOptimization 79
3.7.1 Use-Cases for Design Robustness 79
3.7.2 Evaluating Design Robustness 80
3.7.3 Robustness Metrics 81
3.7.3.1 Static Design Robustness 81
3.7.3.2 Dynamic Design Robustness 81
3.8 Experiments 82
3.8.1 Analyzing Scenario 1 . 85
3.8.2 Analyzing Scenario 2 . 86
3.8.3 Considering Scenario Change 86
3.8.4 Optimizing Design 87
3.8.5 System Dimensioning 87
3.9 Conclusion 87
References 88
57
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 58 2009-10-13
58 Model-Based Design for Embedded Systems
3.1 Introduction
example, formal performance analysis methods are also becoming increas-
ingly important in the domain of tightly integrated multiprocessor system-
on-chips (MPSoCs). Although such components promise to deliver higher
performance at a reduced production cost and power consumption, they
introduce a new level of integration complexity. Like in distributed embed-
ded systems, multiprocessing comes at the cost of higher timing com-
plexity of interdependent computation, communication, and data storage
operations.
Also, many embedded systems (distributed or integrated) feature com-
munication layers that introduce a hierarchical timing structure into
the communication. This is addressed in this chapter with a formal
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 59 2009-10-13
Formal Performance Analysis 59
representation and accurate modeling of the timing effects induced during
transmission.
Finally, today’s embedded systems deliver a multitude of different
software functions, each of which can be particularly important in a spe-
cific situation (e.g., in automotives: an electronic stability program (ESP)
and a parking assistance). A hardware platform designed to execute all of
these functions at the same time will be expensive and effectively over-
dimensioned given that the scenarios are often mutually exclusive. Thus,
in order to supply the desired functions at a competitive cost, systems are
only dimensioned for subsets of the supplied functions, so-called scenarios,
which are investigated individually. This, however, poses new pitfalls when
dimensioning distributed systems under real-time constraints. It becomes
mandatory to also consider the scenario-transition phase to prevent timing
failures.
This chapter presents an overview of a general, modular, and formal per-
formance analysis framework, which has successfully accommodated many
extensions. First, we present its basic procedure in Section 3.2. Several exten-
gather all relevant data, such as the execution time. This information can then
be used to derive the behavior within individual components, accounting for
local scheduling interference. Finally, the system-level timing is derived on
the basis of the lower-level results.
For an efficient system-level performance verification, embedded systems
are modeled with the highest possible level of abstraction. The smallest unit
modeling performance characteristics at the application level is called a task.
Furthermore, to distinguish computation and communication, tasks are cat-
egorized into computational and communication tasks. The hardware plat-
form is modeled by computational and communication resources, which are
referred to as CPUs and buses, respectively. Tasks are mapped on resources
in order to be executed. To resolve conflicting requests, each resource is asso-
ciated with a scheduler.
Tasks are activated and executed due to activating events that can be gen-
erated in a multitude of ways, including timer expiration, and task chaining
according to inter-task dependencies. Each task is assumed to have one input
first-in first-out (FIFO) buffer. In the basic task model, a task reads its acti-
vating data solely from its input FIFO and writes data into the input FIFOs
of dependent tasks. This basic model of a task is depicted in Figure 3.1a. Var-
ious extensions of this model also exist. For example, if the task may be sus-
pended during its execution, this can be modeled with the requesting-task
model presented in Section 3.3. Also, the direct task activation model has
been extended to more complex activation conditions and semantics [10].
3.2.2 Event Streams
The timing properties of the arrival of workload, i.e., activating events, at the
task inputs are described with an activation model. Instead of considering
each activation individually, as simulation does, formal performance anal-
ysis abstracts from individual activating events to event streams. Generally,
(b) System-level transactions
Local task execution