Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 186 2009-10-2
186 Model-Based Design for Embedded Systems
object request broker (HORBA) when the support of small-grain parallelism
is needed.
Our most recent developments in MultiFlex are mostly focused on the
support of the streaming programming model, as well as its interaction
with the client–server model. SMP subsystems are still of interest, and they
are becoming increasingly well supported commercially [14,21]. Moreover,
our focus is on data-intensive applications in multimedia and communica-
tions. For these applications, our focus has been primarily on streaming and
client–server programming models for which explicit communication centric
approaches seem most appropriate.
This chapter will introduce the MultiFlex framework specialized at sup-
porting the streaming and client–server programming models. However,
we will focus primarily on our recent streaming programming model and
mapping tools.
7.2.1 Iterative Mapping Flow
MultiFlex supports an iterative process, using initial mapping results to
guide the stepwise refinement and optimization of the application-to-
platform mapping. Different assignment and scheduling strategies can be
employed in this process.
An overview of the MultiFlex toolset, which supports the client–server
and streaming programming models, is given in Figure 7.2. The design
methodology requires three inputs:
Application specification
Client/server Streaming
Abstract
platform
specification
Performance
analysis
model or client–server programming model semantics.
• Application-specific information (e.g., quality-of-service requirements,
measured or estimated execution characteristics of the application,
data I/O characteristics, etc.).
• The abstract platform specification—this information includes the
main characteristics of the target platform which will execute the
application.
An intermediate representation (IR) is used to express the high-level appli-
cation in a language-neutral form. It is translated automatically from one or
more user-level capture environments. The internal structure of the appli-
cation capture is highly inspired by the Fractal component model [23].
Although we have focused mostly on the IR-to-platform mapping stages, we
have experimented with graphical capture from a commercial toolset [7], and
a textual capture language similar to StreamIt [3] has also been experimented
with.
In the MultiFlex approach, the IR is mapped, transformed, and sched-
uled; finally the application is transformed into targeted code that can run
on the platform. There is a flexibility or performance trade-off between what
can be calculated and compiled statically, and what can be evaluated at run-
time. As shown on Figure 7.2, our approach is currently implemented using
a combination of both, allowing a certain degree of adaptive behaviors, while
making use of more powerful offline static tools when possible. Finally, the
MultiFlex visualization and performance analysis tools help to validate the
final results or to provide information for the improvement of the results
through further iterations.
7.2.2 Streaming Programming Model
As introduced above, the streaming programming model [1] has been
designed for use with data-dominated applications. In this computing
model, an application is organized into streams and computational kernels
to expose its inherent locality and concurrency. Streams represent the flow of
7.3 MultiFlex Streaming Mapping Flow
The MultiFlex technology includes support for a range of streaming pro-
gramming model variants. Streaming applications can be used alone or in
interoperation with client–server applications. The MultiFlex streaming tool
flow is illustrated in Figure7.3. The different stages of this flow will be
described in the next sections.
The application mapping begins with the assignment of the application
blocks to the platform resources. The IR transformations consist mainly in
splitting and/or clustering the application blocks; they are performed for
optimization purposes (e.g., memory optimization); the transformations also
imply the insertion of communication mechanisms (e.g., FIFOs, and local
buffers).
The scheduling defines the sharing of a processor between several blocks
of the application. Most of the IR mapping, transforming, and scheduling
is realized statically (at compilation time), rather than dynamically (at run-
time).
The methodology targets large-scale multicore platforms including a
uniform layered communication network based on STMicroelectronics’
network-on-chip (NoC) backbone infrastructure [18] and a small number
of H/W-based communication IPs for efficient data transfer (e.g., stream-
oriented DMAs or message-passing accelerators [9]). Although we consider
our methodology to be compatible with the integration of application-
specific hardware accelerators using high-level hardware synthesis, we are
not targeting such platforms currently.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 189 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 189
Application functional capture
Filter core
C functions
Filter
Target platform
Video
platform
Mobile
platform
Multimedia
platform
FIGURE 7.3
MultiFlex tool flow for streaming applications.
7.3.1 Abstraction Levels
In the MultiFlex methodology, a data-dominated application is gradually
mapped on a multicore platform by passing through several abstractions:
• The application level—at this level, the application is organized as a
set of communicating blocks. The targeted architecture is completely
abstracted.
• The partitioning level—at this level the application blocks are grouped
in partitions; each partition will be executed on a PE of the target
architecture. PEs can be instruction-set programmable processors,
reconfigurable hardware or standard hardware.
• The communication level—at this level, the scheduling and the com-
munication mechanisms used on each processor between the different
blocks forming a partition are detailed.
• The target architecture level—at this level, the final code executed on
the targeted platforms is generated.
Table 7.2 summarizes the different abstractions, models, and tools provided
by MultiFlex in order to map complex data-oriented applications onto mul-
tiprocessor platforms.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 190 2009-10-2
190 Model-Based Design for Embedded Systems
TABLE 7.2
the results.
• Synchronous client–server block: This block needs to perform one or
many remote procedural calls before being able to push data in the
output interface. It must therefore be scheduled differently than the
simple data-flow block.
• Server block: This block can be executed once all the arguments of the
call are available. Often this type of block can be used to model a H/W
coprocessor.
• Delay memory: This type of block can be used to store a given number
of data tokens (an explicit state).
Figure 7.4 gives the graphical representation of a streaming application cap-
ture which interacts with a client–server application. Here, we focus mostly
on streaming applications.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 191 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 191
Sync. dataflow
semantics
Composite
Memory delay
App interface
Dataflow
interface
Init
Process
End
State
Max elem
Data type
Token rate
Synchronous
c. Assign two blocks to any two different processors
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 192 2009-10-2
192 Model-Based Design for Embedded Systems
7.3.4 The High-Level Platform Specification
The high-level platform specification is an abstraction of the processing,
communication, and storage resources of the target platform. In the current
implementation, the information stored is as follows:
• Number and type of PEs.
• Program and data memory size constraints (for each programmable
PE).
• Information on the NoC topology. Our target platform uses the
STNoC, which is based on the “Spidergon” topology [18]. We include
the latency measures for single and multihop communication.
• Constraints on communication engines: Number of physical links
available for communication with the NoC.
7.3.5 Intermediate Format
MultiFlex relies on intermediate representations (IRs) to capture the applica-
tion, the constraints, and high-level platform descriptions. The topology of
the application—the block declaration and their connectivity—is expressed
using an XML-based intermediate format. It is also used to store task anno-
tations, such as the block execution semantics. Other block annotations are
used for the application profiling and block assignments. Edges are anno-
tated with the communication volume information.
The IR is designed to support the refinement of the application as it
is iteratively mapped to the platform. This implies supporting the mul-
tiple abstraction levels involved in the assignment and mapping process
described in the next sections.
7.3.6 Model Assumptions and Distinctive Features
In this section, we provide more details about the streaming model. This
background information will help in explaining the mapping tools in the next
tasks that use a mix of streaming and client–server constructs. Such tasks
are able to invoke remote services via client–server (DSOC) calls, including
synchronous methods (with return values) that cause the caller task to block,
waiting for an answer.
In addition, we are evaluating the pros and cons of supporting tasks with
unrestricted I/O and very fine-grain communication. To be able to eventu-
ally run several tasks of this nature on the same processor, we may need a
software kernel or make use of hardware threading if the underlying plat-
form provides it.
To be able to choose the correct scheduler to deploy on each PE, we have
introduced semantic tags, which describe the high-level behavior type of each
task. This information is stored in the IR. We have defined a small set of
task types, previously listed in Section7.3.2. This allows a mix of execution
models and firing conditions, thus providing a rich programming environ-
ment. Having clear semantic tags is a way to ensure the mapping tools can
optimize the scheduling and communications on each processor, rather than
systematically supporting all features and be designed for the worst case.
The nonblocking execution is only one characteristic of streaming com-
pared to our DSOC client–server message-passing programming model. As
opposed to DSOC, our streaming programming model does not provide data
marshaling (although, in principle, this could be integrated in the case of het-
erogeneous streaming subsystems).
When compared to asynchronous concurrent components, another dis-
tinction of the streaming model is the data-driven scheduling. In event-
based programming, asynchronous calls (of unknown size) can be generated
during the execution of a single reaction, and those must be queued. The
quantity of events may result in complex triggering protocols to be defined
and implemented by the application programmer. This remains to be a well-
known drawback of event-based systems. With the data-flow approach, the
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 194 2009-10-2
B2
B2
Application graph
B4
B4
B5
B5
B3
B3
Assignment
directives
Communication
volume
Platform
specification
Number and
types of PE
Commn.
resources
and topology
Storage
resources
User
assignment
directives
FIGURE 7.5
MpAssign tool.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 195 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 195
The inter-processor communication cost is given by the data volume
proc
+w
2
∗C
comm
+w
3
∗C
succ
(7.1)
where
C
proc
is the additional average processing cost required when the task t is
assigned to processor p
C
comm
is the communication cost required for the communication of task t
with the preceding tasks
C
succ
represents a look-ahead cost concerning the successor tasks, the min-
imal cost estimate of mapping a number of successor tasks
This assumes state space exploration for a predefined look-ahead depth. w
i
represents the weight associated with each cost factor (C
proc
, C
comm
,and
B3
B3
B1
B2
LB
1/2
Control
I/F
PE1 PE2
PE1
PE2
Abstract comm.
services
Scheduler
shell
Local
bindings
(buffer)
Global
bindings
(fifo)
FIGURE 7.6
MpCompose tool.
• The application capture
• The platform description
• The set of directives, optionally generated by the MpAssign tool
The MpCompose tool relies on a library of abstract communication services
that provide different communication mechanisms that can be inserted in
the application graph. Three types of services are currently supported by
MpCompose:
cating processors use the platform-specific features to actually implement
the buffer-based communication at runtime. The set of independent compo-
nent definitions allow a monoprocessor component–based infrastructure to
be used for compilation.
7.4.3 Component Back-End Compilation
Starting from the set of processor graphs, the component back-end gener-
ates the targeted code that can run on the platform. MultiFlex tools currently
target the Fractal component model, and more specifically its C implemen-
tation [19]. Even though this toolset supports features such as a binding
controller and a life cycle manager to allow dynamic insertion–removal of
components in the graph at runtime, we are not currently using any of the
dynamic features of components, such as runtime elaboration, introspection,
etc., mainly for code size reasons. Nevertheless, we expect multimedia appli-
cation requirements to push toward this direction. Until then, we mainly use
the component model as a back-end to represent the software architecture to
be built on each processor. MpCompose generates one architecture (.fractal)
file describing the components and their topology for each CPU. The Fractal
tools will generate the required C glue code to bind components, to create
block instance structures and will compile all the code into an executable
for the specified processor by invoking the target cross-compiler. This build
process is invoked for each PE, thus producing a binary for each processor.
7.4.4 Runtime Support Components
The main services provided by the MultiFlex components at runtime are
scheduling and communication. The scheduler in fact controls both the com-
munication components and the application tasks.
The scheduler interleaves communication and processing at the block
level. For each input port, the scheduler scans if there is available data in the
local memory. If not, it checks if the input FIFO is empty. If not, the sched-
uler orders the input FIFO to perform the transfer into local memory. This is
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 198 2009-10-2
(70)
rx_vtd_1
(205)
rx_vtd_0
(205)
tx_crc_1
(70)
aa
a
aa
b
bc
bb
b
b
ccd
dcc
aa
b
a
tx_vtc_0
(75)
tx_vtc_1
(75)
tx_rm
(40)
tx_fi
(80)
rx_fi
(80)
e
FIGURE 7.7
3G application block diagram. The communication volumes are a = 260,
b = 3136, c = 1280, and d = 768.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 199 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 199
are annotated with numbers that represent the estimated processing load,
while each edge has an estimated communication volume given in the figure
caption. These numbers are extracted from [13], where the computation cost
corresponds to a latency (in microseconds) for a PE to execute one iteration of
the corresponding functional block, while the edge cost corresponds to the
volume of data (in 16-bit words) transferred at each iteration between the
connected functional blocks.
A manual and static mapping of this application is presented in [13],
using a 2D-mesh of 46 PEs, where each PE is executing only one of the
functional blocks, for which some of them are duplicated to expose more
potential parallel processing. We use this example in this chapter mainly for
illustrative purposes, to show that MpAssign can be used to explore auto-
matically different mappings where, optionally, multiple functional blocks
can be mapped on the same PE to balance the processing load. To expose
more potential parallel processing, we create a set of functionally equivalent
application graphs of the above reference application in which we duplicate
the transmitter and receiver processing chains several times. In our experi-
ments, four versions have been explored:
• v1: 1 transmitter and 1 receiver (original reference application)
• v2: 2 transmitters and 2 receivers
• v3: 3 transmitters and 3 receivers
• v4: 4 transmitters and 4 receivers
The version v1 will be mapped on a 16 processor architecture (v1/16). The
version v2 will be mapped on a 16 processor architecture (v2/16) and a
load and communications.
• c3
(
w1 = 10, w2 = 1000
)
: This weight combination tends to minimize
the communications.
Each of the six configurations described above will be tested with these three
weight parameter combinations, which results in a total of 18 experiments.
For each experiment, we will extract the following statistics:
• Load variance (LV), given by Equation 7.2, where for each mapping
solution x, load
(
PE
i
)
is the sum of the task costs assigned to the PE
i
,
avgload is the average load defined by the sum of all task costs divided
by the number of PE, and p is the PE number.
LV(x) =
p−1
i=0
Figure 7.10 presents the resulting TC statistic of the different application
configurations and mapping weight combinations. This time, the best results
are given by the mapping weight combination c3. This is predictable because
c3 promotes solutions with low communication costs.
Figure 7.11 presents the resulting MC statistic for the different application
configurations and mapping weight combinations. Contrary to Figure 7.10,
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 201 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 201
400
300
200
100
0
v1/16 v2/16 v2/32 v3/32 v3/48 v4/48
c1
c2
c3
FIGURE 7.8
Load variance.
1500
1000
500
0
v1/16 v2/16 v2/32 v3/32 v3/48 v4/48
c3
c2
c1
FIGURE 7.9
Maximal load (at any PE).
0
heuristics that needs improvement.
These results show that the selection of the final task assignment solution
really depends on the target performance, architecture hardware budget,
and acceptable bandwidth for communications. Nevertheless, the MpAs-
sign tool, by generating an interesting subset of mapping solutions, allows
architects to concentrate on the more detailed and time consuming analysis,
rather than trying to find task assignment solutions. At this level, the various
costs remain estimates based on platform and application abstractions and
assumptions. For a candidate solution, the refinement can continue down to
the target architecture level.
7.5.2 Refinement and Simulation
For a given solution, MpAssign provides an output text file that contains
the task assignment directives, the resulting average load of each processor
and the cost for each inter-processor communication. The mapping results
are also available in a graphical representation. For the purpose of a sim-
pler display, we have created a mapping example of only eight processors
that is shown in Figure 7.12. We see that the dataflow source and sink blocks
have been assigned to dedicated I/O processors (the first and the last respec-
tively), thus keeping the data intensive tasks on the remaining other 6 PEs.
The intent was to isolate on different processors the interesting task set that
we will want to profile later on, during the simulation.
Starting from the mapping solution presented in Figure 7.12, MpCom-
pose uses those task assignment directives to perform the software synthesis.
Final compilation is carried out by the component back-end.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 203 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 203
Geometric location corresponds
to PE assignment (here PE2)
Block height proportional to
task load (here 80us)
In this chapter, we elaborated on these challenges and introduced the
MP-SoC platform mapping technologies under development at STMicroelec-
tronics called MultiFlex, with emphasis on assignment and scheduling of
streaming applications. The use of these tools, integrated in a design flow
that proposes a stepwise mapping refinement, was illustrated with a 3G
base-station application example.
While keeping the application code unmodified, we have seen how the
MultiFlex tools refine an IR of the application, based on information related
to the application properties, the high-level platform characteristics, as well
as user mapping constraints.
The MpAssign tool provides mapping solutions that minimize (1) the
communications on the NoC and (2) the processing Load Variance. By chang-
ing the weight factors, the user can direct the heuristics to favor different
classes of solutions.
For a chosen solution, the MpCompose tool provides the required intra-
and inter-processor communications and local task schedulers, by instantiat-
ing generic components with the proper specific configuration.
Finally, for each processor, a self-contained component description gen-
erated by MpCompose can be given to the component compilation back-end,
which takes care of implementing the low-level component glue code and
invoking the compiler for final compilation and linking, ready for simula-
tion and analysis.
This methodology supports a user-driven, iterative approach that auto-
mates the mechanical mapping transformations. Performance and profil-
ing information obtained from a given platform mapping iteration can be
exploited by the user and the mapping tools to guide the next optimization
cycle.
7.6.1 Outlook
We are currently looking at alternative approaches for the implementation
of the MpAssign tool, such as evolutionary algorithms, to cope with the scal-
8. Data marshalling, http://www.webopedia.com/TERM/D/data_mars-
halling.html
9. P. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, and D. Lyonnard,
A multi-processor SoC platform and tools for communications applica-
tions, in Embedded Systems Handbook, CRC Press, Boca Raton, FL, 2004.
10. J. Hu and R. Marculescu, Energy-aware communication and task
scheduling for network-on-chip architectures under real-time con-
straints, in Proceedings of DATE 2004, Pairs, France.
11. M. Paganini, Nomadik
R
: A Mobile multimedia application proces-
sor platform, in Proceedings of ASP-DAC (Asia and South Pacific Design
Automation Conference), Yokohama, Japan, January 2007, pp. 749–750.
12. P.G. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, D. Lyonnard,
O. Benny, B. Lavigueur, D. Lo, G. Beltrame, V. Gagné, and G. Nicolescu,
Parallel programming models or a multi-processor SoC platform applied
to networking and multimedia, IEEE Transactions on VLSI Journal, 14(7),
July 2006, 667–680.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 206 2009-10-2
206 Model-Based Design for Embedded Systems
13. D. Wiklund and D. Liu, Design, mapping, and simulations of a 3G WCD-
MA/FDD base station using network on chip, in Proceedings of the Fifth
International Workshop on System-on-Chip for Real-Time Applications, Banff,
Canada, July 2005, pp. 252–256.
14. ARM Cortex-A9 MPCore, available online at http://www.arm.com/
products/CPUs/ARMCortex-A9_MPCore.html
15. D.R. Butenhof, Programming with POSIX Threads, Addison-Wesley,
Reading, MA, 1997.
16. R. Ben-Natan, CORBA: A Guide to Common Object Request Broker Architec-
CONTENTS
8.1 Introduction 207
8.2 RelatedWork 209
8.3 ProposedWorkflowofEmbeddedSoftwareDevelopment 211
8.4 CommonIntermediateCode 212
8.4.1 Task Code 212
8.4.2 Architecture Information File 214
8.5 CICTranslator 215
8.5.1 Generic API Translation 216
8.5.2 HW-Interfacing Code Generation 217
8.5.3 OpenMP Translator 217
8.5.4 Scheduling Code Generation . 218
8.6 PreliminaryExperiments 220
8.6.1 Design Space Exploration 220
8.6.2 HW-Interfacing Code Generation 221
8.6.3 Scheduling Code Generation . 223
8.6.4 Productivity Analysis 224
8.7 Conclusion 227
References 228
8.1 Introduction
As semiconductor and communication technologies improve continuously,
we can make very powerful embedded hardware by integrating many pro-
cessing elements so that a system with multiple processing elements inte-
grated in a single chip, called MPSoC (multiprocessor system on chip), is
becoming popular. While extensive research has been performed on the
This chapter is an updated version of the following paper: S. Kwon, Y. Kim, W. Jeun, S. Ha,
and Y Paek, A retargetable parallel-programming framework for MPSoC, ACM Transactions on
Design Automation of Electronic Systems (TODAES), Vol. 13, No. 3, Article 39, July 2008.
207
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 208 2009-10-14
specific target architecture and design constraints. If the task partition or
communication architecture is changed, significant coding effort is needed
to rewrite the optimized code. Another difficulty of programming with MPI
and OpenMP is that it is the programmer’s responsibility to confirm the satis-
faction of the design constraints, such as memory requirements and real-time
constraints, in the manually designed code.
The current practice of parallel-embedded software is multithreaded pro-
gramming with lock-based synchronization, considering all target specific
features. The same application should be rewritten if the target is changed.
Moreover, it is well known that debugging and testing a multithreaded pro-
gram is extremely difficult. Another effort of parallel programming is to use
a parallelizing compiler that creates a parallel program from a sequential C
code. But automatic parallelization of a C code has been successful only for
a limited class of applications after a long period of extensive research [7].
In order to increase the design productivity of embedded software, we
propose a novel methodology for embedded software design based on a
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 209 2009-10-14
Retargetable, Embedded Software Design Methodology 209
parallel programming model, called a common intermediate code (CIC). In
a CIC, the functional and data parallelism of application tasks are specified
independent of the target architecture and design constraints. Information on
the target architecture and the design constraints is separately described in
an xml-style file, called the architecture information file. Based on this informa-
tion, the programmer maps tasks to processing components either manually
or automatically. Then, the CIC translator automatically translates the task
codes in the CIC model into the final parallel code, following the partition-
ing decision. If a new partitioning decision is made, the programmer need
not modify the task codes, only the partitioning information. The CIC trans-
lator automatically generates the newly optimized code from the modified
architecture information file.
To be retargetable, the interface code between tasks should be automat-
ically generated after a partitioning decision on the target architecture is
made. Since the interfacing between the processing units is one of the most
important factors that affect system performance, some research has focused
on this interfacing (including HW–SW components). Wolf et al. [10] defined
a task-transaction-level (TTL) interface for integrating HW–SW components.
In the logical model for TTL intertask communication, a task is connected
to a channel via a port, and it communicates with other tasks through chan-
nels by transferring tokens. In this model, tasks call target-independent TTL
interface functions on their ports to communicate with other tasks. If the TTL
interface functions are defined optimally for each target architecture, the pro-
gram becomes retargetable. This approach can be integrated in the proposed
framework.
For retargetable interface code generation, Jerraya et al. [11] proposed a
parallel programming model to abstract both HW and SW interfaces. They
defined three layers of SW architecture: hardware abstraction layer (HAL),
hardware-dependent software (HdS), and multithreaded application. To
interface between software and hardware, translation to application
programming interfaces (APIs) of different abstraction models should be
performed. This work is complementary to ours.
Compared with related work, the proposed approach has the following
characteristics that make it more suitable for an MPSoC architecture:
1. We specifically concentrate on the retargetability of the software devel-
opment framework and suggest CIC as a parallel programming model.
The main idea of CIC is the separation of the algorithm specification and
its implementation. CIC consists of two sections: the tasks codes and the
architecture information file. An application programmer writes for all
tasks considering the potential parallelism of the application itself, inde-
pendent of the target architecture. Based on the target architecture, we
determine how to exploit the parallelism in implementation.