High Performance Computing
(cod. 90535, 9 CFU, autumn semester)

Goals

The traditional sequential computer model is now confined to application environments where low information processing capabilities are sufficient (industrial equipment, domestic appliances, some mobile phones). In all other cases, computing platforms make use of multiple computing units simultaneously. These parallel computers require specific programming techniques, in order to obtain more processing power through the (explicit or implicit) control of cooperation between the various computing units.

For a computer scientist it is therefore important to know the various types of parallel architectures available today and to be able to program parallel computers, and generally have some knowledge of the high performance computing sector, considering also today's emphasis of large-size and high-dimensional datasets.

The aim of this course is to provide basic knowledge on the architecture of parallel processing systems, along with a minimum of programming skills essential to using such systems. As far as the programming part is concerned, the emphasis is on parallel message-passing programming using MPI, but the OpenMP shared memory paradigm is also practiced. GPU architecture is also presented, and GPU programming with OpenCL is practiced too. Lastly, the problem of parallel I/O is dealt with in an introductory way.

Topics

Some of the topics below might not have been considered during this years' edition of the course.

  1. Performance of a computer:
    Direct and inverse, absolute and relative performance indices. Benchmarks.
  2. Pipeline processor architecture:
    Performance of a pipeline system and its analytical evaluation. Overall structure of a pipeline processor. Pipeline hazards (structural, data, control) and their impact on performance. Reducing hazards and/or their impact: hardware techniques. Instruction-level parallelism in sequential programs. How to make better use of instruction-level parallelism when using pipeline processors: loop unrolling, instruction reordering.
  3. Advanced pipeline processors:
    Dynamic instruction scheduling with Tomasulo architecture.
    Branch prediction.
    Speculative execution of instructions.
  4. Superscalar processors:
    Multiple-issue processors. Scheduling instructions on multiple-issue processors. VLIW processors (omitted this year).
  5. Cache Memory and Computer Performance:
    Hardware techniques to reduce cache miss penalties. Hardware and software techniques to reduce cache miss frequency. Hardware techniques to hide cache miss overheads. Practicals with matrix multiplication.
  6. Multiprocessor computers:
    Purposes of parallel computers. Limits of parallel computers: Amdahl's law, communication delays. MIMD computers: shared memory, distributed memory with shared address space, distributed memory with message-passing.
  7. MIMD shared-memory computers:
    Overall organization. Reducing memory access contention. Cache coherency: snooping-based protocols.
  8. MIMD computers with distributed memory and shared address space:
    Overall organization. Directory-based cache coherence protocols.
  9. Cooperation among processes on shared address space:
    Communication and synchronization. Synchronization algorithms: lock/unlock and barrier synchronization on shared address space and their performance.
  10. Consistency in main memory. Weak consistency models and their performance benefits. Memory fence.
  11. High-level parallel programming on shared address space: the OpenMP directives. Practicals with OpenMP.
  12. Message-passing MIMD computers:
    Overall organization. Cooperation among processes: message-passing communication and synchronization. Blocking vs. non-blocking, point-to-point vs. collectives, implementation aspects. Non-blocking communication and instruction reordering.
  13. High-level parallel programming with message-passing:
    the SPMD paradigm, the Message Passing Interface (MPI) standard. Practicals with MPI.
  14. The concept of load balancing and its impact on performance.
    Dynamic load balancing: the "farm" parallel structure and its performance analysis.
  15. SIMD parallel computers: vector processors and vector extensions (omitted this year).
  16. Parallel computing on GPUs:
    GPU architecture, GPU programming with OpenCL. Practicals with OpenCL on a GPU.
  17. Parallel I/O and MPI-I/O (omitted this year).

Exam

The exam consists of a written essay on topics of the course suggested by the instructor. Each student will also undertake an individual final project. The final project should be completed before a deadline proposed by the instructor; delayed completion degrades the grading of project by 0.5 points per day of delay. The final grading is the weighted arithmetic mean between the grading of the oral discussion (weight 0.8) and the grading of the final project (weight 0.2), plus a small and unspecified delta that takes into account the work done during the practicals. Those who do not participate in the practicals or do not undertake the final project shall accomplish an additional practical exam on parallel programming, or, at discretion of the instructor, shall take additional assignments.

Textbooks

John Hennessy, David Patterson: Computer architecture: a quantitative approach (fifth edition), Morgan Kaufmann.
Order online the book at a modest price (less than 30 euros) from India.
The website of the book contains the appendices (not present in the paper book) along with additional material.

Ian Foster: Designing and Building Parallel Programs, Addison Wesley. Online here.

Slides, tutorials, and code samples, that can be found on Aulaweb, integrate but do not replace the textbooks.

Prerequisites

Basic knowledge of computer architecture, fair programming skills in C / C++.
Last update: January 2018