如果这篇博客帮助到你,可以请我喝一杯咖啡~
CC BY 4.0 (除特别声明或转载文章外)
课后作业
Why is it difficult to construct a true shared-memory computer? What is the minimum number of switches for connecting p processors to a shared memory with b words (where each word can be accessed independently)?
共享内存的使用大大降低了在大规模数据处理过程中内存的消耗,但是共享内存的使用中有很多的陷阱,使理解和管理数据的局部性变得困难,一不注意就很容易导致程序崩溃。数据在读写过程中会更透明,但是数据写入进程或数据读出进程中,需要附加的数据结构控制。
每个word都需要一个开关,一共需要$b$个开关。
Consider a memory system with a level 1 cache of 32 KB and DRAM of 512 MB with the processor operating at 1 GHz. The latency to L1 cache is one cycle and the latency to DRAM is 100 cycles. In each memory cycle, the processor fetches four words (cache line size is four words). What is the peak achievable performance of a dot product of two vectors? Note: Where necessary, assume an optimal cache placement policy.
/* dot product loop */ for (i = 0; i < dim; ++i) dot_prod += a[i] * b[i];
最佳的内存策略是cache的一半用于缓存a,另一半用于缓存b。在这样的情况下,i每增加4,会发生两次cache miss,访问2次DRAM和6次cache。因此,完成上述循环,需要$\frac{dim}{4}\times(2\times 100+6\times 1)\times\frac{1}{1GHz}=dim\times 5.15\times 10^{-8}s$
Homework-asst-1
Consider the following code where each line within the function represents a single function.
typedef struct { float x, y; } Point; inline void innerProduct(const Point *a, const Point *b, float *result) { float x1 = a->x, //Uses a load instruction x2 = b->x, product1 = x1 * x2, y1 = a->y, y2 = b->y, product2 = y1 * y2, inner = product1 + product2; *result = inner; //Uses a store instruction } void computeInnerProduct(const Point a[], const Point b[], float result[], int n) { for (int i = 0; i < n; ++i) innerProduct(&a[i], &b[i], &result[i]); }
In the following questions, you can assume the following:
- $N$ is very large($>10^6$).
- The machines described have modern CPUs, providing out-of-order execution, speculative execution, branch prediction, etc.
- There are no cache misses.
- The overhead of updating the loopindex i is negligible.
- The overhead due to procedure calls, as well as starting and ending loops, is negligible.
Problem 1: Instruction-Level Parallelism Suppose you have a machine $M_1$ with two load/store unites that can each load or store a single value on each clock cycle, and one arithmetic unit can perform one arithmetic operation(e.g., multiplication or addition) on each clock cycle. A. Assume that the load/store and arithmetic have latencies of one cycle. How many clock cycles would be required to execute
computeInnerProduct
as a function ofn
? Explain what limits the performance.
做一次innerProduct
需要7次load,1次store,3次arithmetic. 而computeInnerProduct
的循环中有n次innerProduct
,如果不考虑循环处的消耗的话需要$11n$个cycle。
老师上课说,这一问也考虑了不同指令的并行,所以还是3n。
B. Now assume that the load/store and arithmetic unit have latencies of 10 clock cycles, but they are fully pipelined, able to initiate new operations every clock cycle. How many cycles would be required to execute
computeInnerProduction
as a function ofn
?Explain how this relates your answer to Part-A.
最理想的情况下不发生竞争,完全并行,因此不考虑循环处的消耗的话需要3n
个cycle。和A部分相比,由于指令完全流水线,因此每次循环时的load/store/arithmetic间没有间断。
Problem 2: SIMD with ISPC Consider running the following ISPC code.
export void computeInnerProductISPC( uniform point[] a, uniform point[] b, uniform float[] result, uniform int n ) { foreach (i = 0...n) result[i] = a[i].x * b[i].x + a[i].y * b[i].y; }
Suppose machine $M_2$ has one 8-wide SIMD load/store unit, and one 8-wide SIMD arithmetic unit. both have latencies of one clock cycle. A. How many clock cycles would be required to execute
computeInnerProductISPC
as a function ofn
? Explain what lmits the performance.
对于循环中的每个i,发生可4次load,1次store,3次arithmetic。所以一共需要$8n$个cycle。
老师上课说是$\frac{5}{8}n$,因为5次内存操作是瓶颈,8路同时运算。
B. If we run the
computeInnerProductISPC
on a five-core machine $M_3$, where each core has the same SIMD capabilities as $M_2$, what would be the best speedup it could achieve over the single-core performance of Part-A? Explain.
最理想的情况下能够加速5倍,因为每个进程间的计算是彼此独立的。
由于使用了SIMD指令集,这是一套单盒的指令集,没有多核并行的内容,所以没有加速,一倍。太坑了!!!
Homework-asst-2
Parallel Fractal Generation Using Pthreads
Leverage the sample code provided in the course web site.
Build and run the code in the prob1_mandelbrot_threads directory of the Assignment 1 code base.This program produces the image file mandelbrot-vV -serial.ppm, where V is the view index. This image is a visualization of a famous set of complex numbers called the Mandelbrot set. As you can see in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to a value in the complex plane, and the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set—white pixels required the maximum (256) number of iterations, dark ones only a few iterations, and colored pixels were somewhere in between. (See function mandel() defined in mandelbrot.cpp.) You can learn more about the definition of the Mandelbrot set at en.wikipedia.org/wiki/Mandelbrot set. Use the command option 「–view V 」 for V between 0 and 6 to get the different images. You can click the links below to see the different images on a browser. Take the time to do this—the images are quite striking. (View 0 is not shown— it is all white.)
Your job is to parallelize the computation of the images using Pthreads. The command-line option 「–threads T」 specifies that the computation is to be partitioned over T threads. In function mandelbrotThread(), located in mandelbrot.cpp, the main application thread creates T-1 additional thread using pthread_create(). It waits for these threads to complete using pthread_join(). Currently, neither the launched threads nor the main thread do any computation, and so the program generates an error message. You should add code to the workerThreadStart() function to accomplish this task. You will not need to use of any other Pthread API calls in this assignment. What you need to do:
- Modify the code in mandelbrot.cpp to parallelize the Mandelbrot generation using two cores. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors.
- Extend your code to utilize T threads for {2, 4, 8,16} , partitioning the image generation work into the appropriate number of horizontal blocks. You will need to modify the code in function workerThreadStart, to partition the work over the threads.
- To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). How do your measurements explain the speedup graph you previously created?
问题本身还是比较容易的,只在mandelbrot.cpp
中加入了$126\sim 134$行和$164\sim 174$行就解决了问题,详见源代码。以上三个文件都在文件夹prog1_mandelbrot_threads\
下,虽然老师给的代码有点乱,但好在写了Makefile
。于是在终端中执行下列指令编译并运行。
cd prog1_mandelbrot_threads
make
./mandelbrot -t 2
得到如下输出。
[mandelbrot serial]: [708.508] ms
Wrote image file mandelbrot-serial.ppmHello world from thread 0
Hello world from thread 1
Hello world from thread 0
Hello world from thread 1
Hello world from thread 0
Hello world from thread 1
Hello world from thread 0
Hello world from thread 1
Hello world from thread 0
Hello world from thread 1
[mandelbrot thread]: [336.797] ms
Wrote image file mandelbrot-thread.ppm++++ (2.10x speedup from 2 threads)
以上输出表明并行计算的版本在双线程时比单线程快了2.1倍。生成了两个ppm格式的文件mandelbrot-serial.ppm
、mandelbrot-thread.ppm
,但是并看不了。使用下述指令进行转码:
ffmpeg -i mandelbrot-serial.ppm mandelbrot-serial.png
ffmpeg -i mandelbrot-thread.ppm mandelbrot-thread.png
得到如下两张图片,可以看到也是一模一样的。
mandelbrot-serial.png | mandelbrot-thread.png |
假如要使用更多的线程进行计算,只需要修改./mandelbrot -t 2
中-t
后的参数即可。
$ ./mandelbrot -t 4
[mandelbrot serial]: [713.990] ms
Wrote image file mandelbrot-serial.ppmHello world from thread 1
Hello world from thread 2
Hello world from thread 0
Hello world from thread 3
Hello world from thread 1
Hello world from thread 2
Hello world from thread 0
Hello world from thread 3
Hello world from thread 1
Hello world from thread 0
Hello world from thread 2
Hello world from thread 3
Hello world from thread 1
Hello world from thread 2
Hello world from thread 0
Hello world from thread 3
Hello world from thread 1
Hello world from thread 2
Hello world from thread 0
Hello world from thread 3
[mandelbrot thread]: [291.995] ms
Wrote image file mandelbrot-thread.ppm++++ (2.45x speedup from 4 threads)
4线程的时候只快了2.45倍。
$ ./mandelbrot -t 8
[mandelbrot serial]: [707.881] ms
Wrote image file mandelbrot-serial.ppmHello world from thread 1
Hello world from thread 2
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
Hello world from thread 6
Hello world from thread 0
Hello world from thread 7
Hello world from thread 1
Hello world from thread 0
Hello world from thread 3
Hello world from thread 4
Hello world from thread 2
Hello world from thread 5
Hello world from thread 6
Hello world from thread 7
Hello world from thread 1
Hello world from thread 0
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
Hello world from thread 6
Hello world from thread 2
Hello world from thread 7
Hello world from thread 1
Hello world from thread 2
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
Hello world from thread 0
Hello world from thread 6
Hello world from thread 7
Hello world from thread 1
Hello world from thread 2
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
Hello world from thread 6
Hello world from thread 0
Hello world from thread 7
[mandelbrot thread]: [196.575] ms
Wrote image file mandelbrot-thread.ppm++++ (3.60x speedup from 8 threads)
8线程的时候3.6倍。
$ ./mandelbrot -t 16
[mandelbrot serial]: [738.481] ms
Wrote image file mandelbrot-serial.ppmHello world from thread 1
Hello world from thread 4
Hello world from thread 13
Hello world from thread 5
Hello world from thread 6
Hello world from thread 2
Hello world from thread 7
Hello world from thread 8
Hello world from thread 9
Hello world from thread 10
Hello world from thread 11
Hello world from thread 0
Hello world from thread 12
Hello world from thread 14
Hello world from thread 15
Hello world from thread 3
Hello world from thread 1
Hello world from thread 6
Hello world from thread 3
Hello world from thread 4
Hello world from thread 5
Hello world from thread 7
Hello world from thread 8
Hello world from thread 9
Hello world from thread 2
Hello world from thread 10
Hello world from thread 11
Hello world from thread 12
Hello world from thread 13
Hello world from thread 0
Hello world from thread 14
Hello world from thread 15
Hello world from thread 1
Hello world from thread 2
Hello world from thread 0
Hello world from thread 4
Hello world from thread 5
Hello world from thread 6
Hello world from thread 7
Hello world from thread 8
Hello world from thread 9
Hello world from thread 10
Hello world from thread 11
Hello world from thread 12
Hello world from thread 13
Hello world from thread 14
Hello world from thread 15
Hello world from thread 3
Hello world from thread 1
Hello world from thread 2
Hello world from thread 9
Hello world from thread 15
Hello world from thread 5
Hello world from thread 6
Hello world from thread 7
Hello world from thread 8
Hello world from thread 3
Hello world from thread 10
Hello world from thread 11
Hello world from thread 12
Hello world from thread 13
Hello world from thread 0
Hello world from thread 14
Hello world from thread 4
Hello world from thread 1
Hello world from thread 2
Hello world from thread 9
Hello world from thread 3
Hello world from thread 5
Hello world from thread 6
Hello world from thread 7
Hello world from thread 8
Hello world from thread 0
Hello world from thread 4
Hello world from thread 10
Hello world from thread 11
Hello world from thread 12
Hello world from thread 13
Hello world from thread 14
Hello world from thread 15
[mandelbrot thread]: [185.233] ms
Wrote image file mandelbrot-thread.ppm++++ (3.99x speedup from 16 threads)
16线程的时候3.99倍。
可以看到,加速比随进程增加而增加,但是增加的量不是线性的。这是因为并行的时候会有额外的开销。
源代码
mandelbrot.cpp
主要的代码,
#include <stdio.h>
#include <pthread.h>
// Use this code to time your threads
// 我加入的代码在126~134行和164~174行
#include "CycleTimer.h"
/*
15418 Spring 2012 note: This code was modified from example code
originally provided by Intel. To comply with Intel's open source
licensing agreement, their copyright is retained below.
-----------------------------------------------------------------
Copyright (c) 2010-2011, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
// Core computation of Mandelbrot set membershop
// Iterate complex number c to determine whether it diverges
static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i)
{
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
return i;
}
//
// MandelbrotSerial --
//
// Compute an image visualizing the mandelbrot set. The resulting
// array contains the number of iterations required before the complex
// number corresponding to a pixel could be rejected from the set.
//
// * x0, y0, x1, y1 describe the complex coordinates mapping
// into the image viewport.
// * width, height describe the size of the output image
// * startRow, endRow describe how much of the image to compute
void mandelbrotSerial(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int endRow,
int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
for (int j = startRow; j < endRow; j++)
{
for (int i = 0; i < width; ++i)
{
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}
// Struct for passing arguments to thread routine
typedef struct
{
float x0, x1;
float y0, y1;
unsigned int width;
unsigned int height;
int maxIterations;
int *output;
int threadId;
int numThreads;
} WorkerArgs;
//
// workerThreadStart --
//
// Thread entrypoint.
void *workerThreadStart(void *threadArgs)
{
WorkerArgs *args = static_cast<WorkerArgs *>(threadArgs);
// TODO: Implement worker thread here.
printf("Hello world from thread %d\n", args->threadId);
//以下是我加的第一部分内容
mandelbrotSerial(
args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
args->height / args->numThreads * args->threadId, //startRow
args->threadId == args->numThreads - 1 ? args->height : args->height / args->numThreads * (args->threadId + 1), //endRow
args->maxIterations,
args->output);
//增加完毕,只需要调用串行部分的代码即可
return NULL;
}
//
// MandelbrotThread --
//
// Multi-threaded implementation of mandelbrot set image generation.
// Multi-threading performed via pthreads.
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;
if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}
pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];
for (int i = 0; i < numThreads; i++)
{
// TODO: Set thread arguments here.
args[i].threadId = i;
//以下是我加的第二段内容
args[i].x0 = x0;
args[i].x1 = x1;
args[i].y0 = y0;
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
args[i].output = output;
args[i].numThreads = numThreads;
//增加完毕,把所有信息传给worker线程即可
}
// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
for (int i = 1; i < numThreads; i++)
pthread_create(&workers[i], NULL, workerThreadStart, &args[i]);
workerThreadStart(&args[0]);
// wait for worker threads to complete
for (int i = 1; i < numThreads; i++)
pthread_join(workers[i], NULL);
}
main.cpp
#include <stdio.h>
#include <algorithm>
#include <getopt.h>
#include "CycleTimer.h"
extern void mandelbrotSerial(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int endRow,
int maxIterations,
int output[]);
extern void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations,
int output[]);
extern void writePPMImage(
int *data,
int width, int height,
const char *filename,
int maxIterations);
void scaleAndShift(float &x0, float &x1, float &y0, float &y1,
float scale,
float shiftX, float shiftY)
{
x0 *= scale;
x1 *= scale;
y0 *= scale;
y1 *= scale;
x0 += shiftX;
x1 += shiftX;
y0 += shiftY;
y1 += shiftY;
}
void usage(const char *progname)
{
printf("Usage: %s [options]\n", progname);
printf("Program Options:\n");
printf(" -t --threads <N> Use N threads\n");
printf(" -v --view <INT> Use specified view settings (1-6)\n");
printf(" -? --help This message\n");
}
bool verifyResult(int *gold, int *result, int width, int height)
{
int i, j;
for (i = 0; i < height; i++)
{
for (j = 0; j < width; j++)
{
if (gold[i * width + j] != result[i * width + j])
{
printf("Mismatch : [%d][%d], Expected : %d, Actual : %d\n",
i, j, gold[i * width + j], result[i * width + j]);
return 0;
}
}
}
return 1;
}
#define VIEWCNT 6
int main(int argc, char **argv)
{
const int width = 1440;
const int height = 900;
const int maxIterations = 256;
int numThreads = 2;
float x0 = -2.167;
float x1 = 1.167;
float y0 = -1;
float y1 = 1;
// Support VIEWCNT views
float scaleValues[VIEWCNT + 1] = {1.0f, 1.0f, 0.015f, 0.02f, 0.02f, 0.02f, 0.002f};
float shiftXs[VIEWCNT + 1] = {0.0f, 0.0f, -0.98f, 0.35f, 0.0f, -1.5f, -1.4f};
float shiftYs[VIEWCNT + 1] = {0.0f, 0.0f, 0.30f, 0.05f, 0.73f, 0.0f, 0.0f};
// parse commandline options ////////////////////////////////////////////
int opt;
static struct option long_options[] = {
{"threads", 1, 0, 't'},
{"view", 1, 0, 'v'},
{"help", 0, 0, '?'},
{0, 0, 0, 0}};
while ((opt = getopt_long(argc, argv, "t:v:?", long_options, NULL)) != EOF)
{
switch (opt)
{
case 't':
{
numThreads = atoi(optarg);
break;
}
case 'v':
{
int viewIndex = atoi(optarg);
// change view settings
if (viewIndex >= 1 && viewIndex <= VIEWCNT)
{
float scaleValue = scaleValues[viewIndex];
float shiftX = shiftXs[viewIndex];
float shiftY = shiftYs[viewIndex];
scaleAndShift(x0, x1, y0, y1, scaleValue, shiftX, shiftY);
}
else
{
fprintf(stderr, "Invalid view index\n");
return 1;
}
break;
}
case '?':
default:
usage(argv[0]);
return 1;
}
}
// end parsing of commandline options
int *output_serial = new int[width * height];
int *output_thread = new int[width * height];
//
// Run the serial implementation. Run the code three times and
// take the minimum to get a good estimate.
//
memset(output_serial, 0, width * height * sizeof(int));
double minSerial = 1e30;
for (int i = 0; i < 3; ++i)
{
double startTime = CycleTimer::currentSeconds();
mandelbrotSerial(x0, y0, x1, y1, width, height, 0, height, maxIterations, output_serial);
double endTime = CycleTimer::currentSeconds();
minSerial = std::min(minSerial, endTime - startTime);
}
printf("[mandelbrot serial]:\t\t[%.3f] ms\n", minSerial * 1000);
writePPMImage(output_serial, width, height, "mandelbrot-serial.ppm", maxIterations);
//
// Run the threaded version
//
memset(output_thread, 0, width * height * sizeof(int));
double minThread = 1e30;
for (int i = 0; i < 5; ++i)
{
double startTime = CycleTimer::currentSeconds();
mandelbrotThread(numThreads, x0, y0, x1, y1, width, height, maxIterations, output_thread);
double endTime = CycleTimer::currentSeconds();
minThread = std::min(minThread, endTime - startTime);
}
printf("[mandelbrot thread]:\t\t[%.3f] ms\n", minThread * 1000);
writePPMImage(output_thread, width, height, "mandelbrot-thread.ppm", maxIterations);
if (!verifyResult(output_serial, output_thread, width, height))
{
printf("ERROR : Output from threads does not match serial output\n");
delete[] output_serial;
delete[] output_thread;
return 1;
}
// compute speedup
printf("++++\t\t\t\t(%.2fx speedup from %d threads)\n", minSerial / minThread, numThreads);
delete[] output_serial;
delete[] output_thread;
return 0;
}
Makefile
CXX=g++ -m64
CXXFLAGS=-I../common -Iobjs/ -O3 -Wall
APP_NAME=mandelbrot
OBJDIR=objs
COMMONDIR=../common
PPM_CXX=$(COMMONDIR)/ppm.cpp
PPM_OBJ=$(addprefix $(OBJDIR)/, $(subst $(COMMONDIR)/,, $(PPM_CXX:.cpp=.o)))
default: $(APP_NAME)
.PHONY: dirs clean
dirs:
/bin/mkdir -p $(OBJDIR)/
clean:
/bin/rm -rf $(OBJDIR) *.ppm *~ $(APP_NAME)
OBJS=$(OBJDIR)/main.o $(OBJDIR)/mandelbrot.o $(PPM_OBJ)
$(APP_NAME): dirs $(OBJS)
$(CXX) $(CXXFLAGS) -o $@ $(OBJS) -lm -lpthread
$(OBJDIR)/%.o: %.cpp
$(CXX) $< $(CXXFLAGS) -c -o $@
$(OBJDIR)/%.o: $(COMMONDIR)/%.cpp
$(CXX) $< $(CXXFLAGS) -c -o $@
$(OBJDIR)/main.o: $(COMMONDIR)/CycleTimer.h