The challenges of next-gen multicore networks-on-chip systems: Part 6 | Network Systems Designline

Get the latest news, products and how-to information on network systems. Sign up for the Network Systems DesignLine newsletter, a weekly e-mail guide dedicated to the needs of engineers developing networking equipment and components. Here is our RSS feed.








 Network Systems DesignLine » How-To » IP Networking

 
 HOW-TO : IP Networking

The challenges of next-gen multicore networks-on-chip systems: Part 6

Communications-exposed programming
Print This Story Send As Email Discuss This Story Reprints



Courtesy of Embedded.com

Rate this article
WORSE | BETTER
1 2 3 4 5
Parallel programming libraries as described in the previous sections are a viable short-term answer to the quest for increased software productivity for NoC platforms, but ultimately they are only partial extensions to traditional sequential programming languages.

If we reason more abstractly about parallel programming environments as syntactic frameworks for expressing concurrency, today we are in parallel computing at a level similar to sequential programming at the dawn of the structured programming era. Locks and threads are basic structured constructs of concurrency. Similarly to what happened to sequential programming languages with the object-oriented revolution, there is a need for higher-level abstractions for building large-scale concurrent programs.

The evolution toward high-level concurrent software development environments is going to be a difficult but unavoidable step, because parallelism exploitation will be essential for achieving higher performance. Commercial and systems programming languages will be assessed based on their support for concurrent programming. Existing languages, such as C, are likely to be extended with concurrency support beyond simple library extensions.

Languages that will not evolve to support parallel programming will gradually become useless in high-performance application. On the other hand, parallel programming is demonstrably more difficult than sequential. For example, simultaneous analysis of program context and synchronization is proved to be undecidable [31]. Moreover, human programmers experience many difficulties in reasoning in terms of concurrency, especially when analyzing the side effects of the many possible interleaving of parallel tasks.

A well-tried approach to raise the abstraction level without compromising efficiency is to reduce the expressiveness of the language. Taking a more positive viewpoint, this approach implies focusing the software development framework on a model of computation which is both abstract and efficient. The price to be paid is in lack of generality and flexibility, but this is an acceptable loss if the model of computation is well-suited to describe a wide spectrum of market-relevant applications.

The most successful example of this approach in the area of parallel programming for NoCs is stream computing. The key idea in stream computing is to organize an application into streams and computational kernels to expose its inherent locality and concurrency. Streams represent the flow of data, while kernels are computational tasks that manipulate and transform data. Many signal- and media-processing algorithms can easily be seen as sequences of transformations applied on a data stream.

In this case, streams are a natural way of expressing the functionality of the application, while at the same time, they match very well the hardware execution engines.

More formally, a stream computation can be represented by a directed graph, where nodes represent computations and edge represent data communication. Both nodes and edges are annotated. Nodes are annotated with the function that they execute, while edges are annotated with the data they carry. The stream model fully exposes inter-task communication. Intuitively, this is a very desirable feature because it forces the programmer to focus on communication, not only on computation.

By analyzing the communication edges in a stream computation, it is possible to obtain precise estimates of the communication requirements for a given application. This greatly simplifies analysis and mapping of application onto parallel architectures. In fact, one of the most challenging issues in parallel programming is the analysis of communication, which is critical for performance optimization, but it is often obscured by the computation-oriented semantic of traditional programming languages.

Stream-oriented models of computations have been studied since the early 1960s (the interested reader is referred to the good survey by Stephen [32]), under the name of dataflow. Numerous languages (e.g., LUCID) have been developed to support dataflow semantics. Probably, the best-known semantic for dataflow has been proposed by Kahn, and it is known as kahn process network (KPN) [32]. In the KPN formalism, computational nodes perform arbitrary computations and communicate through unbounded FIFOs associated with the edges.

A computation consumes one or more data tokens (of arbitrary size) on the input FIFOs and produces one or more data tokens on the output. Since FIFOs are unbounded, there is no problem in blocking or overflow of full FIFOs. Computational nodes can be blocked only on empty FIFOs. The KPN have been proved to be Turing-equivalent, hence they are expressive as any sequential semantic, but they have a much more explicit model for parallel execution of tasks and inter-task communication.

One important result proved by Kahn is that the result of the execution of a KPN is invariant in the time order (schedule) of the node computations, provided that the partial order rules implicitly created by the FIFO semantics on the channels are respected.

Clearly KPNs are a theoretical model, which is not implementable in practice. The most obvious unrealistic assumption is that of unbounded FIFOs. Any real-life system must use bounded FIFOs. Ensuring schedulability of a computation with bounded FIFO space is a much tougher theoretical problem, and it often solved by restricting the execution semantic of the KPN model.

One of the most well-known KPN specializations featuring an efficient schedulability analysis is Synchronous Dataflow (SDF). In SDF computational nodes (called actors) have static, non-uniform rates of execution (firing), firing is atomic and data driven. This model is implementable on real single- and multi-processors and its properties have been studied extensively [33]. Unfortunately, SDF is not Turing equivalent, in other words there are computations that can be expressed by KPN and cannot be performed in and SDF model.

The mathematical properties of dataflow have been extensively studied by the theoretical computer science community [32], and the importance of a solid theoretical foundation for program verification is widely recognized and understood. Dataflow/streaming languages and programming environments move from these solid theoretical foundations, but they usually make compromises to provide flexibility and to ease code development.

Graphical dataflow programming environments
Many commercial environments supporting stream programming are available and widely adopted (e.g., Mathlab/Simulink, SPW, COSSAP, etc.) and even more viable prototypes have been demonstrated in research (e.g., Ptolemy). Developing streaming applications in these environments is relatively straightforward. The programmer usually works using a graphic user interface (GUI) to specify the dataflow graph nodes and dependencies. A large library of computational nodes, implementing common processing tasks is available (for instance, an FIR filter block, with programmable coefficients).

For instance, the Mathlab/Simulink environment provides hundreds of pre-defined actors that can be readily instantiated without any code development. When the behavior of an actor is not available in the library, the programmer moves to a text-based interface where it can specify the node behavior, using pre-specified function calls to read input tokens and produce output tokens. Computation is usually specified in an imperative language, such as C.

Graphical dataflow environments are very effective in enhancing the productivity of the software developer, and this explains their widespread usage in software prototyping. However, in practice they have not been used for final product-code implementation, because they generate quite inefficient code for the target MPSoC architectures. In order to understand the reasons for this shortcoming, we need to look deeper in implementation details.

Probably the most critical issue in translating dataflow specifications into efficient NoC platform code is memory management. Let us consider as an example a video-processing pipeline that operates on a large image frame (which can reach multi-megabyte size for high-definition color images). Each actor in the pipeline performs processing on an incoming frame and passes it to the following actor.

Let us assume that, we parallelize the execution by allocating each frame-processing actor on a different processor on a multi-core architecture. The token-passing semantic hides a performance pitfall: passing a frame token implicitly specifies a multi-megabyte data transfer. If the processors have private memories, passing a frame requires a very expensive memory-to-memory copy.

Figure 7.9. Partitioning a large memory frame to improve streaming performance.

Even worse, if the frame does not fit into the on-chip private memory of each processor, an external memory overflow area will have to be defined, and each token passing will then imply a significant amount of off-chip memory traffic, which is extremely slow and energy-inefficient. Moreover, it creates a destination contention bottleneck that no NoC architecture can alleviate, as all background memory transfers must go through a singleNoCtarget interface to off-chip memory.

As a result, the convenient token-passing abstraction breaks down in terms of efficiency, as execution becomes severely memory and communication bound. Intuitively, it is quite clear that this memory bottleneck can be eased if computation is specified in a more memory-aware fashion.

For instance, the computation can be performed in sub-sections of the image by finergranularity task, as shown in Figure 7.9 above. In this case, the full frame resides in main (off-chip) memory, and the image is processed one sub-unit ad a time (e.g., stripe and sub-square). Tokens have much finer granularity and they can be passed along without clogging the on-chip interconnects and the Ios.

Unfortunately, traditional prototyping-oriented dataflow environments do not offer analysis and optimization features that help programmers in this difficult balancing act. Hence, the traditional design flow is that an executable specification obtained in a dataflow programming environment will have to manually translated in efficient code in a slow and painful manual tuning process.

Another critical issue with dataflow programming is global datadependent control flow. Control flow (conditionals, loops) can be specified in a dataflow environment as a part of actor behavior. For instance, conditional execution of an actor's computation can be specified by guarding its code with a conditional. However, in many practical applications a single control-flow decision impacts the execution of a number of actors.

Consider, for example, MPEGvideo stream processing: different computations are performed on different frames, depending on the frame type (B, I, T). Even worse, asynchronous events (e.g., the user pressing a button) may imply significant changes in the computation to be performed by actors. The "pure'' dataflow semantics have been extended to deal with global control flow and asynchronous events, but the efficient mapping of these extensions on target platform hardware is still not mature.

Stream Languages
Stream languages were developed with the purpose of closing the gap between abstraction and efficiency. In other words, they aim at providing high-level abstraction to the programmer for controlling stream manipulation and processing. At the same time they syntactically enforce a set of rules to facilitate the creation of complex programs that can be compiled very efficiently onto the hardware platform targets.

On one hand, developing a complex stream-processing application using a stream language should be as fast and productive as using a graphical prototyping environment. On the other hand, the compilation of such a program should produce a very efficient executable without needing slow and error-prone manual translation.

The basic rationale of this approach is to create languages that are a better match for communication-dominated NoC-based platforms than traditional imperative languages such as plain C or C++. This matching is obtained by making parallelism and communication explicit at the language level, while at the same time providing high-level primitives to manage them.

Efficient automated generation of executable code is facilitated by restricting the freedom given to the programmer in accessing global resources (e.g., global variables), and in specifying arbitrarily complex control and data flow. In order to give concreteness to these observations, we will now analyze in some details one of the most welldeveloped stream languages, StreamIt, proposed by the MIT Group lead by S. Amarasinghe [34].

A case study: StreamIt
StreamIt is a language explicitly designed for stream application programming. It defines a set constructs that expose the parallelism and communication of streaming applications, but it avoids explicit references to the topology of the target architecture and the granularity of its computational engines.

A program consists of a number of simple blocks which communicate data in a limited set of patterns. The atomic block in StreamIt is called Filter: it specifies a single-input"single-output computations. A filter body is specified using a single-threaded JAVA-like procedural language. This is the work function, which specifies the Filter's steady-state computation. An example of a simple filter is shown in Figure 7.10, below.

Figure 7.10. Filter construct in StreamIT

A Filter communicates with connected blocks through input and output FIFO queues. These channels support three primitive access functions: (i) pop(), which removes an item from the output end of the channel and returns its value, (ii) peek(i), which returns the value of the item i spaces from the end of the FIFO, without removing it and (iii) push(x) writes the value of x to the front of the channel. If x is an object, a copy is enqueued on the channel.

Filter input"output operations have statically (compile-time) specified rates of production and consumption of FIFOs elements. Note that the work function is triggered when enough samples are in the input FIFO to allow all input queue peek operations specified by its body. Filters also contain initialization functions, which serve two purposes: they set the initial state of the Filter (such as, for instance, the taps coefficients in a FIR filter), second, they specify the Filter's I/O types and data rates to the compiler.

Filters are compositional. They can be connected via three composite blocks: pipelines, split-joins and feedback loops. Each of these structures also has a single entry and exit points, allowing modular recursive composition. The Pipeline construct builds a sequence of streams. It has an initialization function, whose primary purpose is to add component streams in a specified sequence.

There is no work function, because the behavior of a pipeline is fully specified by its components. The SplitJoin construct specifies independent parallel streams that diverge from a common splitter will eventually merge into a common joiner. Similarly to Pipeline, the components of a SplitJoin are specified with successive calls to add from the init function. The splitter specifies how to distribute items from the input of the SplitJoin to the components.

Three types of split are supported: (i) Duplicate, which replicates each data item and sends a copy to each parallel stream, (ii) RoundRobin, which assigns a weight to each parallel stream and sends a number of items to each stream equal to its weight before moving to the next and (iii) Null, which means that the parallel components do not require inputs.

The joiner indicates how the outputs of the parallel streams should be interleaved on the output channel of the SplitJoin. There are two joiner types: RoundRobin and Null. The FeedbackLoop construct enable the specification cycles in a stream graph. It contains:

(i) a body stream, which is the block around which the "feedback path'' will be closed,
(ii) a loop stream, to perform computation on the feedback path,
(iii) a splitter, to distribute items between the feedback path and the output channel at the bottom of the loop, (iii) a joiner, which merges items between the feedback path and the input channel. All these components are specified in the initialization function. Upon initialization, there are no items on the feedback path and they can be provided by an initPath function.

An important characteristic of the StreamIt language is that it cannot specify arbitrary data flow graph. Only hierarchical composition of pipelines, splitjoins and feedback loops can be specified. This greatly facilitates compiler analysis and optimization, while forcing the programmer to build well-organized code.

StreamIt also support constructs for passing irregular, low-volume control information, called messages, between filters and streams. It is possible to send a message from within the body of a filter work function to modify a parameter setting in other filters. The sender continues execution while the message is being delivered: message delivery is asynchronous, and it returns no value.

One important characteristics of the message delivery abstraction is that it is possible to specify latency bounds for the delivery times. Recall that execution is parallel and asynchronous in dataflow languages, and there is no global notion of time. Hence, latencies can be specified only relatively to data token.

It is therefore possible to specify a latency bound in terms of number of data token starting from a reference token. For instance, it is possible to specify that a message arrives to the receivers work function K evaluations after the arrival of the data token that has triggered message issue by the sender. Various types of messages are supported, including broadcast messages and re-initialization messages.

From this cursory view of the StreamIt language, we can draw a few observations. First, data communication between filters and filter compositions are completely explicit can be analyzed at compile time. Second, data production and consumption is statically defined at compile time. This greatly eases allocation and scheduling of hardware resources at compilation time (e.g., FIFO sizing an channel sizing).

Moreover, the language's strictly defined firing rules make it much easier to schedule computations and to guard them. Overall, the language is expressive enough to model non-trivial computations, and its strong structure makes it much easier to generate highly efficient executable on an NoC-based tiled architecture.

However, there are some limitations that clearly point to the need for significant additional work in this area. First, there is no way to keep the size of data tokens under control. Hence, the programmer can easily specify a behavior where very large data tokens are passed around. If data tokens cannot fit on local storage associated to computational nodes, a lot of traffic from"to global memories will be created, with obvious efficiency losses. Second, the language constructs are quite limited and a lot of the behavior has to be fully at compile time.

Data-dependent global control flow is very difficult to express (messages help, but they are cumbersome). It is quite clear than many choices have been made to limit expressiveness in an effort to ease the job of the compiler. The suitability of StreamIt to describe truly complex application behaviors is therefore not yet proved.

Next in Part 7: Computer-aided software development tools
To read Part 1, go to Why on-chip networking?
To read Part 2, go to SoC objectives and NoC needs.
To read Part 3, go to Once over lightly
To read Part 4 go to  Networks-on-chip programming issues
To read Part 5, go to Task-Level  Parallel  Programming

Used with the permission of the publisher, Newnes/Elsevier, this series of seven articles is based on material from "Networks On Chips: Technology and Tools," by Luca Benini and Giovanni De Micheli.

Luca Benini is professor at the Department of Electrical Engineering and Computer Science at the University of Bologna, Italy. Giovanni De Micheli is professor and director of the Integrated Systems  Center at EPF in Lausanne, Switzerland.

References
[1] F. Boekhorst, "Ambient Intelligence, the Next Paradigm for Consumer Electronics: How will it Affect Silicon?,'' International Solid-State Circuits Conference, Vol. 1, 2002, pp. 28"31.

[2] G. Declerck, "A Look into the Future of Nanoelectronics,'' IEEE Symposium on VLSI Technology, 2005, pp. 6"10.

[3] W. Weber, J. Rabaey and E. Aarts (Eds.), Ambient Intelligence. Springer, Berlin, Germany, 2005.

[4] S. Borkar, et al., "Platform 2015: Intel Processor and Platform Evolution for the Next Decade,'' INTEL White Paper 2005 (PDF).

[5] D. Culler and J. Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers, 1999.

[6] L. Hennessy and D. Patterson, Computer Architecture " A Quantitative Approach, 3rd edition, Morgan Kaufmann Publishers, 2003.

[7] Philips Semiconductor, Philips Nexperia Platform.

[8] M. Rutten, et al., "Eclipse: Heterogeneous Multiprocessor Architecture for Flexible Media Processing,'' International Conference on Parallel and Distributed Processing, 2002, pp. 39"50.

[9] ARM Ltd, MPCore Multiprocessors Family

[10] B. Ackland, et al., "A Single Chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP,'' IEEE Journal of Solid State Circuits, Vol. 35, No. 3, 2000, pp. 412"424.

[11] G. Strano, S. Tiralongo and C. Pistritto, "OCP/STBUS Plug-in Methodology,'' GSPX Conference 2004.

[12] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini and R. Zafalon, "Analyzing On-Chip Communication in a MPSoC Environment,'' Design and Test in Europe Conference (DATE) 2004, pp. 752"757.

[13] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor and J. M. Mendias, "An Integrated Hardware/Software Approach For Run-Time Scratch-pad Management,'' in Design Automation Conference, Vol. 2, 2004, pp. 238"243.

[14] D. Pham, et al., "The Design and Implementation of a First-generation CELL Processor,'' in IEEE International Solid-State Circuits Conference, Vol. 1, 2005, pp. 184"592.

[15] L. Hammond, et al., "The Stanford Hydra CMP,'' IEEE Micro, Vol. 20, No. 2, 2000, pp. 71"84.

[16] L. Barroso, et al., "Piranha: A Scalable Architecture Based on Single-chip Multiprocessing,'' in International Symposium on Computer Architecture, 2000, pp. 282"293.

[17] Intel Semiconductor, IXP2850 Network Processor,

[18] STMicroelectronics, Nomadik Platform

[19] Texas Instruments, "OMAP5910 Platform''

[20] M. Banikazemi, R. Govindaraju, R. Blackmore and D. Panda. "MP-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP systems'', IEEE transactions Parallel and Distributed Systems, Vol. 12, No. 10, 2001, pp. 1081"1093.

[21] W. Lee, W. Dally, S. Keckler, N. Carter and A. Chang, "An Efficient Protected Message Interface,'' IEEE Computer, Vol. 31, No. 11, 1998, pp. 68"75.

[22] U. Ramachandran, M. Solomon and M. Vernon, "Hardware Support for Interprocess Communication,'' IEEE transactions Parallel and Distributed Systems, Vol. 1, No. 3, 1990, pp. 318"329.

[23] H. Arakida et al., "A 160 mW, 80 nA Standby, MPEG-4 Audiovisual LSI 16Mb Embedded DRAM and a 5 GOPS Adaptive Post Filter,'' IEEE International Solid-State Circuits Conference 2003, pp. 62"63.

[24] F. Gilbert, M. Thul and N. When, "Communication Centric Architectures for Turbo-decoding on Embedded Multiprocessors,'' Design and Test in Europe Conference 2003, pp. 356"351.

[25] S. Hand, A. Baghdadi, M. Bonacio, S. Chae and A. Jerraya, "An Efficient Scalable and Flexible Data Transfer Architectures for Multiprocessor SoC with Massive Distributed Memory,'' Design Automation Conference 2004, pp. 250"255.

[26] P. Paulin, C. Pilkington, E. Bensoudane, "StepNP: A system-level Exploration Platform for Network Processors,'' IEEE Design and Test of Computers, Vol. 19, No. 6, 2002, pp. 17"26.

[27] P. Stenström, "A Survey of Cache Coherence Schemes for Multiprocessors,'' IEEE Computer, Vol. 23, No. 6, 1990, pp. 12"24.

[28] M. Tomasevic, V.M. Milutinovic, "Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors,'' IEEE Micro, Vol. 14, No. 5"6, 1994, pp. 52"59.

[29] I. Tartalja, V.M. Milutinovic, "Classifying Software-Based Cache Coherence Solutions,'' IEEE Software, Vol. 14, No. 3, 1997, pp. 90"101.

[30] P. Kongetira, K. Aingaran and K. Olukotun, "Niagara: a 32-way Multithreaded Sparc Processor,'' IEEE Micro, Vol. 25, No. 2, 2005, pp. 21"29.

[31] H. Sutter and J. Larus, "Software and the Concurrency Revolution,'' ACM Queue, Vol. 3, No. 7, 2005, pp. 54"62.

[32] R. Stephens, "A Survey of Stream Processing,'' Acta Informatica, Vol. 34, No. 7, 1997, pp. 491"541.

[33] E. Lee and D. Messerschmitt, "Pipeline Interleaved Programmable DSP's: Synchronous Data Flow Programming,'' IEEE Transactions on Signal Processing, Vol. 35, No. 9, 1987, pp. 1334"1345.

[34] W. Thies, et al., "Language and Compiler Design for Streaming Applications,'' IEEE International Parallel and Distributed Processing Symposium, 2004, pp. 201.

[35] Silicon Hive, AVISPA-CH

[36] M. Bekoij, Constraint Driven Operation Assignment for Retargetable VLIW Compilers, Ph.D. Dissertation, 2004.

[37] R. Allen, K. Kennedy, Optimizing Compilers for Modern Architectures: A Dependence-based Approach, Morgan-Kaufman, 2001.

[38] P. Faraboschi, J. Fisher and C. Young, "Instruction Scheduling for Instruction Level Parallel Processors,'' Proceedings of the IEEE, vol. 89, No. 11, 2001, pp. 1638"1659.

[39] M. Taylor, W. Lee, S. Amarasinghe and A. Agarwal, "Scalar Operand Networks,'' IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 2, 2005, pp. 145"162.

[40] P. Mattson, et al., "Communication Scheduling,'' ACM Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 82"92.

[41] M. Sato, "OpenMP: Parallel Programming API for Shared Memory Multiprocessors and On-chip Multiprocessors,'' IEEE International Symposium on System Synthesis, 2002, pp. 109"111.

[42] N. Genko, D. Atienza, G. De Micheli, J. Mendias, R. Hermida, F. Catthoor, "A Complete Network-on-chip Emulation Framework,'' Design, Automation and Test in Europe, Vol. 1, 2005, pp. 246"251.

[43] C. Shin, et al., "Fast Exploration of Parameterized Bus Architecture for Communication-centric SoC Design,'' Design, Automation and Test in Europe, Vol. 1, 2004, pp. 352"357.

[44] M. Michael, "Scalable, Lock-free Dynamic memory Allocation,'' ACM Conference on Programming Languages Design and Implementation, 2004, pp.110"122.




Print This Story Send As Email Discuss This Story Reprints

 
eSearch  

 Top 5 Most Read
 How-To Stories
1. 2. 3. 4. 5.

 Top 5 Most Read
 News Stories
1. 2. 3. 4. 5.

  • Introduction to Optical Transmission Systems

  • Optimizing Embedded Systems for Broadband 10 Gigabit Ethernet Connectivity

  • Interfacing a DS3231 with an 8051-Type Microcontroller

  • The entire library >>  

     
     Top 5 Most Read
     Product Stories
    1. 2. 3. 4. 5.

     Sponsor

    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts More career-related news, resources and job postings for technology professionals

     Tech Library
    ¤ Looking for the appropriate Industry Association? This comprehensive, up-to-date list will take you to the right Web site for the help you need.

    ¤ Got a question about a standard? Here are direct links to resources detailing the industry's most important communications standards.

    ¤ Freshen up on technology, new and old, with these links to interesting and informative tutorials.

    More from TechLibrary

    Welcome to our DesignLine network of web communities. On these sites, we provide practical how-to technical information for engineers and engineering managers involved in Automotive,audio, DSP, DTV, EDA, Industrial Control, Mobile Handset, Power Management, Programmable Logic,RF,Video, and Wireless networking design. Check out the sites and let us know your thoughts.
     



    Career Center | CommsDesign.com | Embedded.com | EE Times | TechOnline
    Planet Analog | DeepChip | eeProductCenter | Electronic Supply & Manufacturing | Webinars