OpenCL将计算系统视为组成自一组“计算设备”,它们可以是CPU或是附加的“加速器”比如GPU。它定义了一种类C语言(英语:List of C-family programming languages)用来写程序。在OpenCL设备上执行的函数叫做“内核”[30]:17。一个单一的计算设备典型的组成自一些“计算单元”,它们依次又包含很多“处理元素”(PE)。一个单一的内核执行可以在所有或多个PE上并行运行。OpenCL定义了API,允许运行于主机上的程序,启动在计算设备上的内核,并管理设备内存,它至少在概念上分离于主机内存。用OpenCL语言写的程序预期会被即时编译,所以使用OpenCL的应用程序在针对各种设备的实现之间是可移植的[31]。
^ 6.06.1Michael McCool; James Reinders; Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation(PDF). Elsevier. 2013 [2019-12-12]. (原始内容存档(PDF)于2018-11-23). Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines.
^Gropp, William; Lusk, Ewing; Skjellum, Anthony. A High-Performance, Portable Implementation of the MPI Message Passing Interface. Parallel Computing. 1996, 22 (6): 789–828. doi:10.1016/0167-8191(96)00024-5.
^Sur, Sayantan; Koop, Matthew J.; Panda, Dhabaleswar K. MPI and communication---High-performance and scalable MPI over Infini Band with reduced memory usage. High-performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-depth Performance Analysis. ACM. 4 August 2017: 105. ISBN 978-0769527000. doi:10.1145/1188455.1188565.
^Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. (原始内容存档于2019-12-24). SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.
^Flynn, Michael J.Some Computer Organizations and Their Effectiveness(PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. (原始内容(PDF)存档于2022-04-11). 2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV(英语:ILLIAC)). 3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board(英语:Plugboard) machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI). 4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
^Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation(PDF). Elsevier. 2013: 52 [2019-12-12]. (原始内容存档(PDF)于2018-11-23). There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously. Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. (原始内容存档于2019-12-01). On architectures prior to Volta, warps used a singleprogram counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained(英语:Granularity (parallel computing)) sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from. Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity(英语:Granularity (parallel computing)), either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
^Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. (原始内容存档于2019-12-24). MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning).
^McBurney, D. L., and M. Ronan Sleep. "Transputer-based experiments with the ZAPP architecture." PARLE Parallel Architectures and Languages Europe. Springer Berlin Heidelberg, 1987. Hammond, Kevin. Parallel functional programming: An introduction. In International Symposium on Parallel Symbolic Computation, p. 46. 1994. “The Alfalfa project implemented serial combinators for the Intel iPSC[44]. …… Buckwheat re-implemented this model for the Encore Multimax, a shared-memory multiprocessor. ……[45].”
^Blelloch, Guy. Programming Parallel Algorithms(PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档(PDF)于2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms(PDF). Spring 2002 [2019-12-06]. (原始内容存档(PDF)于2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages(PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档(PDF)于2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell: for Pi , 1 ≤ i ≤ n pardo A(i) := B(i) This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form for i , 1 ≤ i ≤ α pardo A(i) := B(C(i)) where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
^Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSP Programming Library(PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. (原始内容存档(PDF)于2019-11-04). Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process.
^Introduction to Apache Giraph. [2019-12-09]. (原始内容存档于2019-12-19). Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps.
^Apache Hama. [2019-12-09]. (原始内容存档于2019-12-18). Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief.
^OpenSHMEM. [2019-12-09]. (原始内容存档于2019-12-09). The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.
^Armstrong, Joe. Erlang. Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. (原始内容存档于2020-06-09). Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.
^CUDA Programming Guide 1.0(PDF). 6/23/2007 [2019-12-09]. (原始内容存档(PDF)于2018-04-17). CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.请检查|date=中的日期值 (帮助) Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. (原始内容存档于2019-12-09). OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics. September 2009 OpenCL Public Downloads. [2019-12-09]. (原始内容存档于2019-12-09). Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation. MPI Solutions for GPUs. [2019-12-09]. (原始内容存档于2019-12-09). MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. OpenSHMEM Communication on GPUs. [2019-12-09]. (原始内容存档于2019-12-09). NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
^Brook Language. [2019-12-09]. (原始内容存档于2019-11-19). Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams.
^MapReduce: Simplified Data Processing on Large Clusters(PDF). googleusercontent.com. [2019-12-05]. (原始内容存档(PDF)于2020-06-15). MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
^MapReduce: Simplified Data Processing on Large Clusters(PDF). googleusercontent.com. [2019-12-05]. (原始内容存档(PDF)于2020-06-15). Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
^Lämmel, R. Google's MapReduce programming model — Revisited(PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. (原始内容存档(PDF)于2019-12-07). The following overview lists more detailed questions and summarizes our findings: Is MAP essentially the map combinator? NO. Is MAP essentially an application of the map combinator? NO. Does MAP essentially serve as the argument of map? YES. Is REDUCE essentially the reduce combinator? NO. Is REDUCE essentially an application of the reduce combinator? TYPICALLY. Does REDUCE essentially serve as the argument of reduce? NO. Does REDUCE essentially serve as the argument of map? YES. Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
^Vishkin, Uzi, Using simple abstraction to reinvent computing for parallelism(PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757, The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows. For i =0 to n−1 do X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X …… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code. For i =0 to n−1 pardo X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i ) …… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section: spawn (0 ,n−1) { var x x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x } The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM’ s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
^Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], (原始内容存档于2017-09-21), This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL(英语:NESL)) and relies on the ability of the hardware to suspend, and later reactive, threads.
J. Darlinton; M. Ghanem; H. W. To, Structured Parallel Programming(PDF), In Programming Models for Massively Parallel Computers. IEEE Computer Society Press, 1993