Khóa luận A parallel implementation on modern hardware for geo-Electrical tomographical software

Tài liệu Khóa luận A parallel implementation on modern hardware for geo-Electrical tomographical software: ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin HÀ NỘI – 2010 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin Cán bộ hướng dẫn: PGS. TSKH. Phạm Huy Điển Cán bộ đồng hướng dẫn: TS. Đoàn Văn Tuyến HÀ NỘI – 2010 ABSTRACT Geo-electrical tomographical software plays a crucial role in geophysical research. However, imported software is expensive and does not provide much customizability, which is essential for more advanced geophysical study. Besides, these programs are unable to exploit the full potential of modern hardware, so the running time is inadequate for large-scale geophysical surveys. It is therefore an essential task...

doc57 trang | Chia sẻ: hunglv | Lượt xem: 1001 | Lượt tải: 0download
Bạn đang xem trước 20 trang mẫu tài liệu Khóa luận A parallel implementation on modern hardware for geo-Electrical tomographical software, để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên
ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin HÀ NỘI – 2010 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin Cán bộ hướng dẫn: PGS. TSKH. Phạm Huy Điển Cán bộ đồng hướng dẫn: TS. Đoàn Văn Tuyến HÀ NỘI – 2010 ABSTRACT Geo-electrical tomographical software plays a crucial role in geophysical research. However, imported software is expensive and does not provide much customizability, which is essential for more advanced geophysical study. Besides, these programs are unable to exploit the full potential of modern hardware, so the running time is inadequate for large-scale geophysical surveys. It is therefore an essential task to develop domestic software for overcoming all these problems. The development of this software is based on our research in using parallel programming on modern multi-core processors and stream processors for high performance computing. While this project with its inter-disciplinary aspect poses many challenges, it has also enabled us to gain valuable insights in making scientific software and especially the new field of personal supercomputing. List of Acronyms CPU Central Processing Unit CUDA Compute Unified Device Architecture GPU Graphical Processing Unit OpenMP Open Multi Processing OpenCL Open Computing Language TBB Intel Threading Building Blocks INTRODUCTION Geophysical methods are based on studying the propagation of the different physical fields within the earth’s interior. One of the most widely used fields in geophysics is the electromagnetic field generated by natural or artificial (controlled) sources. Electromagnetic methods comprise one of the three principle technologies in applied geophysics (the other two being seismic methods and potential field methods). There are many geo-electromagnetic methods currently used in the world. Of these electromagnetic methods, resistivity tomography is the most widely used and it is of major interest in our work. Resistivity tomography [17] or resistivity imaging is a method used in exploration geophysics [18] to measure underground physical properties in mineral, hydrocarbon, ground water or even archaeological exploration. It is closely related to the medical imaging technique called electrical impedance tomography (EIT), and mathematically is the same inverse problem. In contrast to medical EIT however, resistivity tomography is essentially a direct current method. This method is relatively new compared to other geophysical methods. Since the 1970s, extensive research has been done on the inversion theory for this method and it is still an active research field today. A detailed historical description can be seen in [27]. Resistivity tomography surveys searching for oil and gas (left) or water (right) Resistivity tomography has the advantage of being relatively easy to carry out with inexpensive equipment and therefore has seen widespread use all over the world for many decades. With the increasing computing power of personal computers, inversion software for resistivity tomography has been made, most notably being Res2Dinv by Loke [5]. According to geophysicists at Institute of Geology (Vietnam Academy of Science and Technology), the use of imported resistivity software encountered the following serious problems: The user interface is not user-friendly; Some computation steps cannot be modified to adapt to measurement methods used in Vietnam; With large datasets, the computational power of modern hardware is not fully exploit; High cost for purchasing and upgrading software. Resistivity software is a popular tool for both short term and long term projects in research, education and exploration by Vietnamese geophysicists. Replacing imported software is therefore essential not only to reduce cost but also to enable more advance research on the theoretical side, which requires custom software implementations. The development of this software is based on research in using modern multi-core processors and stream processors for scientific software. This can also be the basis for solving larger geophysical problems on distributed systems if necessary. Our resistivity tomographical software is an example of applying high performance computing on modern hardware to computational geoscience. For 2-D surveys with small datasets, sequential programs still provide results in acceptable time. Parallelizing for these situations provides faster response time and therefore increases research productivity but is not a critical feature. However, for 3-D surveys, datasets are much larger with high computational expenses. A solution for this situation is using clusters. Clusters, however, are not a feasible option for many scientific institutions in Vietnam. Clusters are expensive with high power consumption. With limited availability only in large institutions, getting access to clusters is also inconvenient. Clusters are not suitable for field trip as well because of difficulties in transportation and power supply. Exploiting the parallel capabilities of modern hardware is therefore a must to enable cost-effective scientific computing on desktop systems for such problems. This can help reduce hardware cost, power consumption and increase user convenience and software development productivity. These benefits are especially valuable to scientific software customers in Vietnam where cluster deployment is costly in both money and human resources. Chapter 1 High Performance Computing on Modern Hardware 1.1 An overview of modern parallel architectures Computer speed is crucial in most software, especially scientific applications. As a result, computer designers have always looked for mechanisms to improve hardware performance. Processor speed and packaging densities have been enhanced greatly over the past decades. However, due to the physical limitations of electronic components, other mechanisms have been introduced to improve hardware performance. According to [1], the objectives of architectural acceleration mechanisms are to decrease latency, the time from start to completion of an operation; increase bandwidth, the width and rate of operations. Direct hardware implementations of expensive operations help reduce execution latency. Memory latency has been improved with larger register files, multiple register sets and caches, which exploit the spatial and temporal locality of reference in the program. With the bandwidth problem, the solutions can be classified into two forms of parallelism: pipelining and replication. Pipelining [22] divides an operation into different stages to enable the concurrent execution of these stages for a stream of operations. If all of the stages of the pipeline are filled, a new result is available every unit of time it takes to complete the slowest stage. Pipelines are used in many kinds of processors. In the picture below, a generic pipeline with four stages is shown. Without pipelining, four instructions take 16 clock cycles to complete. With pipelining, this is reduced to just 8 clock cycles. On the other hands, replication duplicates hardware components to enable concurrent execution of different operations. Pipelining and replication appear at different architectural levels and in various forms complementing each other. While numerous, these architectures can be divided into three groups [1]: Instruction-Level Parallel (Fined-Grained Control Parallelism) Process-Level Parallel (Coarse-Grained Control Parallelism) Data Parallel (Data Parallelism) These categories are not exclusive of each other. A hardware device (such as the CPU) can belong to all these three groups. 1.1.1 Instruction-Level Parallel Architectures There are two common kinds of instruction-level parallel architecture. The first is superscalar pipelined architectures which subdivide the execution of each machine instruction into a number of stages. As short stages allow for high clock frequencies, the recent trend is to use longer pipeline. For example the Pentium 4 uses a 20-stage pipeline and the latest Pentium 4 core contains a 31-stage pipeline. Figure 1 Generic 4-stage pipeline; the colored boxes represent instructions independent of each other [21]. A common problem with these pipelines is branching. When branches happen, the processor has to wait until the branch finishes fetching the next instruction. A branch prediction unit is put into the CPU to guess which branch would be executed. However, if branches are predicted poorly, the performance penalty can be high. Some programming techniques to make branches in code more predictable for hardware can be found in [2]. Programming tools such as Intel VTune Performance Analyzer can be of great help in profiling programs for missed branch predictions. The second kind of instruction-level parallel architecture is VLIW (very long instruction word) architectures. A very long instruction word usually controls 5 to 30 replicated execution units. An example of VLIW architecture is the Intel Itanium processor [23]. As of 2009, Itanium processors can execute up to six instructions per cycle. For ordinary architectures, superscalar execution and out-of-order execution is used to speed up computing. This increases hardware complexity. The processor must decide at runtime whether instruction parts are independent so that they can be executed simultaneously. In VLIW architectures, this is decided at compile time. This shifts the hardware complexity to software complexity. All operations in one instruction must be independent so efficient code generation is a hard task for compilers. The problem of writing compilers and porting legacy software to the new architectures make the Itanium architecture unpopular. 1.1.2 Process-Level Parallel Architectures Process-level parallel architectures are architectures that exploit coarse-grained control parallelism in loops, functions or complete programs. They replicate complete asynchronously executing processors to increase execution bandwidth and, hence, fit the multiple-instruction-multiple-data (MIMD) paradigm. Until a few years ago, these architectures comprised of multiprocessors and multicomputers. A multiprocessor uses a shared memory address space for all processors. There are two kinds of multiprocessors: Symmetric Multiprocessor or SMP computers: the cost of accessing an address in memory is the same for each processor. Furthermore, the processors are all equal in the eyes of the operation system. Non-uniform Memory Architecture or NUMA computers: the cost of accessing a given address in memory varies from one processor to another. In a multicomputer, each processor has its own local memory. Access to remote memory requires explicit message passing over the interconnection network. They are also called distributed memory architectures or message-passing architectures. An example is cluster system. A cluster consists of many computing nodes, which can be built using high-performance hardware or commodity desktop hardware. All the nodes in a cluster are connected via Infiniband or Gigabit Ethernet. Big clusters can have thousands of nodes with special topologies for interconnect. Cluster is currently the only affordable way for large scale supercomputing at the level of hundreds of teraflops or more. Figure 2 Example SMP system (left) and NUMA system (right) A recent derivative of cluster computing is grid computing [19]. While traditional clusters often consist of similar nodes close to each other, grids will incorporate heterogeneous collections of computers, possibly distributed geographically. They are, therefore, optimized for workloads containing many independent packets of work. The two biggest grid computing network is Folding@home and SETI@home (BOINC). Both have the computing capability of a few petaflops while the most powerful traditional cluster can barely reach over 1 petaflops. Figure 3 Intel CPU trends [12]. The most notable change to process-level parallel architectures happened in the last few years. Figure 3 shows that although the number of transistors a CPU contains still increases according to Moore’s law (which means doubling every 18 months), the clock speed has virtually stopped rising due to heating and manufacturing problems. CPU manufacturers have now turned to adding more cores to a single CPU while the clock speed stays the same or decreases. An individual core is a distinct processing element and is basically the same as a CPU in an older single-core PC. A multi-core chip can now be considered a SMP MIMD parallel processor. A multi-core chip can run at lower clock speed and therefore consumes less power but still has increases in processing power. The latest Intel Core i7-980 (Gulftown) CPU has 6 cores and 12 MB of cache. With hyper-threading it can support up to 12 hardware threads. Future multi-core CPU generations may have 8, 16 or even 32 cores in the next few years. These new architectures, especially in multi-processor node, can provide the level of parallelism that has been only available to cluster systems. Figure 4 Intel Gulftown CPU . 1.1.3 Data parallel architectures Data parallel architectures appeared very soon on the history of computing. They utilize data parallelism to increase execution bandwidth. Data parallelism is common in many scientific and engineering tasks where a single operation is applied to a whole data set, usually a vector or a matrix. This allows applications to exhibit a large amount of independent parallel workloads. Both pipelining and replication have been applied to hardware to utilize data parallelism. Pipelined vector processors such as the Cray 1 [15], operates on vectors rather than scalar. After the instruction is decoded, vectors of data stream directly from memory into the pipelined functional units. Separate pipelines can be chained together to get higher performance. The translation of sequential code into vector instructions is called vectorization. A vectorizing compiler played a crucial role in programming for vector processors. This has significantly pushed the maturity of compilers in generating efficient parallel code. Through replication, processor arrays can utilize data parallelism as a single control unit can order a large number of simple processing elements to operate the same instruction on different data elements. These massively parallel supercomputers fit into the single-instruction-multiple-data (SIMD) paradigm. Although both of the kinds of supercomputers mentioned above have virtually disappeared from common use, they are precursors for current data parallel architectures, most notably the CPU SIMD processing and GPUs. The CPU SIMD extension instruction set for Intel CPUs include MMX, SSE, SSE2, SSE3, SSE4 and AVX. They allow the CPU to use a single operation to operate on several data elements simultaneously. AVX, the latest extension instruction set is expected to be implemented on both Intel and AMD products in 2010 and 2011. With AVX, the size of SIMD vector register is increased from 128-bit to 256-bit, which means the CPU can operate on 8 single-precision or 4 double-precision floating point numbers during one instruction. CPU SIMD processing has been used widely by programmers in many applications such as multimedia and encryption and compiler code generation for these architectures are now considerably good. Even when multicore CPUs are popular, understanding SIMD extensions is still vital for optimizing program execution on each CPU core. A good handbook on utilizing software vectorization is [1]. However, graphics processing units (GPUs) are perhaps the hardware with the most dramatic growth in processing power over the last few years. Graphics chips started as fixed function graphics pipelines. Over the years, these graphics chips became increasingly programmable with newer graphics API and shaders. In the 1999-2000 timeframe, computer scientists in particular, along with researchers in fields such as medical imaging and electromagnetic started using GPUs for running general purpose computational applications. They found the excellent floating point performance in GPUs led to a huge performance boost for a range of scientific applications. This was the advent of the movement called GPGPU or General Purpose computing on GPUs. With the advent of programming languages such as CUDA and OpenCL, GPUs are now easier to program. With the processing power of a few Teraflops, GPUs are now massively parallel processors at a much smaller scale. They are now also termed stream processors as data is streamed directly from memory into the execution units without the latency like the CPUs. As can be seen in Figure 5, GPUs have currently outpaced CPUs many times in both speed and bandwidth. Figure 5 Comparison between CPU and GPU speed and bandwidth (CUDA programming Guide) [8]. The two most notable GPU architectures now are the ATI Radeon 5870 (Cypress) and Nvidia GF100 (Fermi) processor. The Radeon 5870 processor has 20 SIMD engines, each of which has 16 thread processors inside of it. Each of those thread processors has five arithmetic logic units, or ALUs. With a total of 1600 stream processors and a clock speed of 850 MHz, Radeon 5870 has the single-precision computing power of 2.72 Tflops while top of the line CPU still has processing power counted in Gflops. Double-precision computing is done at one fifth of the rate for single-precision, at 544 Gflops. This card supports both OpenCL and DirectCompute. The double version, the Radeon 5970 (Hemlock) dual graphics processor has a single-precision computing power of 4.7 Tflops in a graphics card at a thermal envelope of less than 300 W. Custom over clocked versions made by graphics card manufacturer can even offer much more computing power than the original version. Figure 6 ATI Radeon 5870 (Cypress) graphics processor The Nvidia GF100 processor has 3 billion transistors with 15 SM (Shader Multiprocessor) units, each has 32 shader cores or CUDA processor compared to 8 of previous Nvidia GPUs. Each CUDA processor has a fully pipelined integer arithmetic logic unit and floating point unit with better standard conformance and fused multiply-add instruction for both single and double precision. The integer precision was raised from 24 bit to 32 bit so multi-instruction emulation is no longer required. Special function units in each SM can execute transcendental instructions such as sin, cosine, reciprocal and square root. Figure 7 Nvidia GF100 (Fermi) processor with parallel kernel execution Single-precision performance of GF100 is about 1.7 Tflops but double-precision performance is only half at 800 Gflops, significantly better than the Radeon 5870. Previous architectures required that all SMs in the chip worked on the same kernel (function/program/loop) at the same time. In this generation the GigaThread scheduler can execute threads from multiple kernels in parallel. This chip is specifically designed to provide better support for GPGPU with memory error correction, native support for C++ (including virtual functions, function pointers, dynamic memory management using new and delete and exception handling), and compatible with CUDA, OpenCL and DirectCompute. A true cache hierarchy with two levels is added with more shared memory than previous GPU generations. Context switching and atomic operations are also faster. Fortran compilers are also available from PGI. Specific versions for scientific computing will have from 3GB to 6GB GDDR5. 1.1.4 Future trends in hardware Although the current parallel architectures are very powerful, especially for parallel workload, they won’t stay the same way in the future. From the current situation, we can present some trends for future hardware in the next few years. The first is the change in the composition of clusters. A cluster node can now have several multicore processors and some graphics processors. Consequently, clusters with fewer nodes can still have the same processing power. This also enables the maximum limit of cluster processing capabilities to increase. Traditional clusters consisting of only CPU nodes have virtually reached their peak at about 1 Pflops. Adding more nodes would result in more system overhead with marginal increase in speed. Electricity consumption is also enormous for such systems. Supercomputing now accounts for 2 percents of the total electric consumption of the entire United States. Building supercomputer at the exascale (1000 Pflops) using traditional clusters is too much costly. Graphics processors or similar architectures provide a good Gflops/W ratio and are, therefore, vital to building supercomputers with larger processing power. The IBM Roadrunner supercomputer [21] using Cell processors is a clear example for this trend. The second trend is the convergence of stream processors and CPUs. Graphics cards currently act the role of co-processors to the CPU in floating point intensive tasks. In the long term, all the functionalities of the graphics card may reside on the CPU, just like what happened in the case of math co-processors which are now CPU floating point units. The Cell processor by Sony, Toshiba and IBM is heading towards that direction. AMD has also been continuously pursuing this with its Fusion project. The Nvidia GF100 is a GPU with many CPU features such as memory correction and large caches. The Intel’s Larrabee experiment project event went further by aiming to produce an x86-compatible GPU that would later be integrated into Intel CPUs. These would all lead to a new kind of processor called Accelerated Processing Unit (APU). The third trend is the evolution of multicore CPUs into many-core processors in which individual cores form a cluster system. In December 2009, Intel unveiled the newest product of its Terascale Computing Research program, a 48-core x86 processor. Figure 8 The Intel 48-core processor. To the right is a dual-core tile. The processor has 24 such tiles in a 6 by 4 layout. It represents the sequel to Intel's 2007 Polaris 80-core prototype that was based on simple floating point units. This device is called a "Single-chip Cloud Computer" (SCC). The structure of the chip resembles that of a cluster with cores connected through a message-passing network with 256 GB/s bandwidth. Shared-memory is simulated on software. Cache coherence and power management is also software-based. Each core can run its own OS and software, which resembles a cloud computing center. Each tile (2 cores) can have its own frequency, and groupings of four tiles (8 cores) can each run at their own voltage. The SCC can run all 48 cores at one time over a range of 25W to 125W and selectively vary the voltage and frequency of the mesh network as well as sets of cores. This 48 core device consists of 1.3 billion transistors produced using 45nm high-k metal gate. Intel are currently handing out these processors to its partners in both industry and academy to enhance further research in parallel computing. Tilera corporation is also producing processors with one hundred cores. Each core can run a Linux OS independently. The processor also has Dynamic Distributed Cache technology which provides a fully coherent shared cache system across an arbitrary sized array of tiles. Programming can be done normally on a Linux derivative with full support for C and C++ and Tilera parallel libraries. The processor utilizes VLIW (Very Long Instruction Word) with RISC instructions for each core. The primary focus of this processor is for networking, multimedia and clouding computing with a strong emphasis on integer computation to complement GPU’s floating point computation. From all these trends, it would be reasonable to assume that in the near future, we will be able to see new architectures which resemble all current architectures, such as many-core processors where each core has a CPU core and stream-processors as co-processors. Such systems would provide tremendous computing power per processor that would cause major changes in the field of computing. 1.2 Programming tools for scientific computing on personal desktop systems Traditionally, most scientific computing tasks have been done on clusters. However, with the advent of modern hardware that provide great level of parallelism, many small to medium-sized tasks can now be run on a single high-end desktop computer in reasonable time. Such systems are called “personal supercomputers”. Although they have variable configurations, most today employ multicore CPUs with multiple GPUs. An example is the Fastra II desktop supercomputer [3] at University of Antwep, Belgium, which can achieve 12 Tflops computing power. The FASTRA II contains six NVIDIA GTX295 dual-GPU cards, and one GTX275 single-GPU card with a total cost of less than six thousands euros. The real processing speed of this system can equal that of a cluster with thousands of CPU cores. Although these systems are more cost-effective, consume less power and provide greater convenience for their users, they pose serious problems for software developers. Traditional programming tools and algorithms for cluster computing are not appropriate for exploiting the full potential of multi-core CPUs and GPUs. There are many kinds of interaction between components in such heterogeneous systems. The link between the CPU and the GPUs is through the PCI Express bus. The GPU has to go through the CPU to access system memory. The inside of the multicore CPU is a SMP system. As each GPU has separate graphics memory, the relationship between GPUs is like in a distributed-memory system. As these systems are in early stages of development, programming tools do not provide all the functionalities programmers need and many tasks still need to be done manually. Algorithms also need to be adapted to the limitations of current hardware and software tools. In the following parts, we will present some programming tools for desktop systems with multi-core CPUs and multi GPUs that we that we consider useful for exploiting parallelism in scientific computing. The grouping is just for easy comparison between similar tools as some tools provide more than one kind of parallelization. 1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++ Windows and Linux (and other Unixes) provide API’s for creating and manipulating operating system threads using WinAPI threads and POSIX threads (Pthreads), respectively. These threading approaches may be convenient when there's a natural way to functionally decompose an application - for example, into a user interface thread, a compute thread or a render thread. However, in the case of more complicated parallel algorithms, the manual creating and scheduling thread can lead to more complex code, longer development time and not optimal execution. The alternative is to program atop a concurrency platform — an abstraction layer of software that coordinates, schedules, and manages the multicore resources. Using thread pools is a parallel pattern that can provide some improvements. A thread pool is a strategy for minimizing the overhead associated with creating and destroying threads and is possibly the simplest concurrency platform. The basic idea of a thread pool is to create a set of threads once and for all at the beginning of the program. When a task is created, it executes on a thread in the pool, and returns the thread to the pool when finished. A problem is when the task arrives and the pool has no thread available. The pool then suspends the task and wakes it up when a new thread is available. This requires synchronization such as locks to ensure atomicity and avoid concurrency bugs. Thread pools are common for the server-client model but for other tasks, scalability and deadlocks still pose problems. This calls for concurrency platforms with higher levels of abstraction that provide more scalability, productivity and maintainability. Some examples are OpenMP, Intel Threading Building Blocks, and Cilk++ . OpenMP (Open Multiprocessing) [25] is an open concurrency platform with support for multithreading through compiler pragmas in C, C++ and Fortran. It is an API specification and compilers can provide different implementations. OpenMP is governed by the OpenMP Architecture Review Board (ARB). The first OpenMP specification came out in 1997 with support for Fortran, followed by C/C++ support in 1998. Version 2.0 was released in 2000 for Fortran and 2002 for C/C++. Version 3.0 was released in 2008 and is the current API specification. It contains many major enhancements, especially the task construct. Most recent compilers have added some level of support for OpenMP. Programmers can inspect the code to find places that require parallelization and insert the pragmas to tell the compiler to produce multithreaded code. This makes the code have a fork-join model, in which when the parallel section has finished; all the threads join back the master thread. Workloads in one loop are given to threads using work-sharing. There are four kinds of loop workload scheduling in OpenMP: Static scheduling, each thread is given an equal chunk of iterations. Dynamic scheduling, the iterations are assigned to threads as the threads request them. The thread executes the chunk of iterations (controlled through the chunk size parameter), then requests another chunk until there are no more chunks to work on. Guided scheduling is almost the same as dynamic scheduling, except that for a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations, divided by the number of threads, decreasing to 1. For a chunk size of “k” (k > 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations. Runtime scheduling, if this schedule is selected, the decision regarding scheduling kind is made at run time. The schedule and (optional) chunk size are set through the OMP_SCHEDULE environment variable. Beside scheduling clauses, OpenMP also has clauses for data sharing attribute, synchronization, IF control, initialization, data copying, reduction and other concurrent operations. A typical OpenMP parallelized loop may look like: #pragma omp for schedule(dynamic, CHUNKSIZE) for(int i = 2; i <= N-1; i++) for(int j = 2; j <= i; j++) for(int k = 1; k <= M; k++) b[i][j]+=a[i-1][j]/k+a[i+1][j]/k; Figure 9 OpenMP fork-join model. Intel’s Threading Building Blocks (TBB) [4] is an open source C++ template library developed by Intel for writing task based multithreaded applications with ideas and models inherited from many previous languages and libraries. While OpenMP uses the pragma approach for parallelization, TBB uses the library approach. The first version came out in August 2006 and since then TBB has seen widespread use in many applications, especially game engines such as Unreal. TBB is available with both a commercial license and an open source license. The latest version 2.2 was introduced in August 2009. The library has also received Jolt Productivity award and InfoWorld OSS award. It is a library based on generic programming, requires no special compiler support, and is processor and OS independent. This makes TBB ideal for parallelizing legacy applications. TBB has support for Windows, Linux, OS X, Solaris, PowerPC, Xbox, QNX, FreeBSD and can be compiled using Visual C++, Intel C++, gcc and other popular compilers. TBB is not a thread-replacement library but provides a higher level of abstraction. Developers do not work directly with threads but tasks, which are mapped to threads by the library runtime. The number of threads are automatically managed by the library or set manually by the user, just like the case with OpenMP. Beside basic loop parallelizing parallel_for constructs, TBB also have parallel patterns such as parallel_reduce, parallel_scan, parallel_do; concurrent data containers including vectors, queues and hash map; scalable memory allocator and synchronization primitives such as atomics and mutexes; and pipelining. TBB parallel algorithms operate on the concept of blocked_range, which is the iteration space. Work is divided between threads using work stealing, in which the range is recursively divided into smaller tasks. A thread work on the task it meets depth first and steals the task breadth first following the principles: do task from own queue (FIFO) steal task from another queue. This can be applied to one, two or three-dimensional ranges, allowing for effective blocking on data structures with dimensions greater than one. Task stealing also enables better cache utilization and avoid false sharing as much as possible. The figures below illustrate the mechanism behind task stealing. Figure 10 TBB’s range split Figure 11 TBB’s task stealing illustration Normally, TBB calls using function objects can be quite verbose as parallelization is done through libraries. However, with the new C++0x lambda syntax, TBB code can be much shorter and easier to read. Below is an example TBB call using C++0x lambdas: void ParallelApplyFoo(float a[], size_t n ){ parallel_for( blocked_range( 0, n ), [=](const blocked_range& range) { for(int i= range.begin();i!=range.end();++i) Foo(a[i]); }, auto_partitioner() );} } Another platform for multi-threading development is Cilk++. Cilk++ extends the C++ programming language with three new keywords: cilk_spawn, cilk_sync, cilk_for; and Cilk++ hyper objects which can be global variables but avoid data races. Parallelized code using Cilk++ can therefore be quite compact. void matrix_multiply(matrix_t A,matrix_t B,matrix_t C) { cilk_for (int i = 0; i < A.rows; ++i) { cilk_for (int j = 0; j < B.cols; ++j) { for (int k = 0; k < A.cols; ++k) C[i][j] += A[i][k] * B[k][j]; } } } Matrix multiplication implementation in Cilk++ However, this requires the Cilk++ compiler which extends on a standard C++ compiler such as Visual C++ or GCC. Cilk++ also uses work stealing like TBB for scheduling threads but with its own modification. Both are based on the work stealing method by the MIT Cilk project. Intel has recently acquired the Cilk++ company and Cilk++ will be integrated into Intel’s line of parallel programming tools together with TBB and other tools. 1.2.2 GPU programming with CUDA The first GPU platform we present is CUDA [16]. CUDA is Nvidia’s parallel computing architecture that allows programmer to program GPGPU applications on Nvidia’s graphics card architecture including Geforce, Quadro and Tesla. CUDA was introduced in the late 2006 and received a lot of attention when released, especially from high performance computing communities. It is now the basis for all other kinds of parallel programming tools on the Nvidia hardware. CUDA is now the most popular tool for programming GPGPU applications with various research papers, libraries and commercial development tools. Previous to CUDA, GPGPU was done using graphics API. Although speedup for this was great, there were several drawbacks that limits the growth of GPGPU. Firstly, the programmer is required to possess in-depth knowledge of graphics APIs and GPU architecture. Secondly, problems had to be expressed in terms of vertex coordinates, textures and shader programs, which results in greatly increased program complexity, low productivity and poor code maintenance. Thirdly, basic programming features such as random reads and writes to memory were not supported, greatly restricting the programming model and algorithms available. CUDA has to some extent made GPU programming more accessible to programmers with a more intuitive programming model and a good collections of libraries for popular tasks. Figure 12 CUDA processing flow [16] As CUDA is used for an architecture which supports massively data parallel programming, programming using CUDA is also different from traditional programming on the CPU. The processing flow for CUDA applications is shown in Figure 12. As the GPU does not have direct access to main memory, processing data must be copied from the main memory to GPU memory through the PCI Express lanes. The GPU is also not a totally self-control processor and needs the CPU to instruct the processing. The GPU then executes the data processing in parallel on its core using its hardware scheduler. When the processing is done, processed data is copied back the main memory. All the memory copy operations are much more expensive than computation operations so it is best to keep the processing on the GPU for as long as possible to avoid expensive memory copies. C for CUDA extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads. A kernel is defined using the __global__ declaration specifier and the number of CUDA threads for each call is specified using a new >> syntax. The first parameter for the >> call is the number of blocks and the second one is the block size. For example: // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // Kernel invocation VecAdd>>(A, B, C); } All files containing CUDA syntax must be compiled with the Nvidia nvcc compiler. It can be used together with both Visual C++, GCC and other compilers. All the CUDA threads run the same kernel when invoked. Threads are grouped into blocks of the same size and blocks together form a grid. Both grid and block can be one, two or three-dimensional. Through the CUDA extensions blockIdx, blockDim and threadIdx, the kernel implementer can get the co-ordinate of the thread in the grid and process the corresponding data element. Thread blocks are required to execute independently which means it must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores, enabling programmers to write code that is scalable with the increasing number of cores. The number of blocks in a grid is typically governed by the size of the data being processed rather than by the number of stream processors in the system, which it can greatly exceed. CUDA’s hierarchy of threads is mapped to the hierarchy of processors on the GPU. A GPU executes one or more kernel grids. A streaming multiprocessor (SM) executes one or more thread blocks, and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp. While programmers can generally ignore warp execution for functional correctness and think of programming one thread, greatly improved performance can be achieved by having threads in a warp execute the same code path and access memory in nearby addresses. Figure 13 shows an example of thread grids with each block in the grids containing executable threads. For a problem of certain size, the number of threads per block and the number of blocks and the grid and block dimension can be manually controlled by the programmer at kernel launch. Choosing these numbers appropriately can have great effects on execution speed. As threads are scheduled in warps of 32 threads each, the number of threads per block should be a multiple of 32. Each generation of processor also has maximum limitations on number of threads, number of threads per stream processor and number of blocks per stream processor. Careful considerations of these parameters can enable the kernel to utilize the full potential of the GPU. For example the G80 architecture can schedule up to 8 blocks and 768 threads per stream processor. If there are 64 threads per block, there are 12 blocks but only 8 blocks can go into the stream processor so only 512 threads is scheduled. With 256 threads per block, each stream processor can execute three blocks and achieve full capacity. For 1024 threads per block, even one block can not fit into a stream processor and cannot be scheduled. Choosing the right parameter for kernel launch is a difficult problem which has different solutions for each situation and optimal execution can only be obtained with some amount of experiments. Figure 13 Grid of thread blocks [8] The CUDA memory hierarchy is also different from that of traditional C programming with much more complexities. Each thread has a private local memory called registers. Registers are the fastest form of memory in the GPU but only accessible by the thread and has the life time of the thread. Each thread block has a shared memory visible to all threads of the block and with the same lifetime as the block. Shared memory can be as fast as registers when there are no bank conflicts or when reading from the same address. Finally, all threads have access to the same global memory. Global memory is much slower than shared memory and registers, requiring 400 to 600 clock cycles for one access. As a result, programmers should strive to utilize registers and shared memory to increase bandwidth. When reading from global memory, coalesced memory access also help avoid performance penalties of global memory access. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages. Constant memory is read only from kernels and is optimized for the case when threads read from the same memory location. Texture memory can offer the ability to cache global memory and interact with the dedicated graphics hardware of the GPU. In appropriate cases, using texture memory can provide considerable performance gain over global memory access. However, this depends on particular situations and using constant memory and texture memory effectively is still an expert topic in CUDA as careless uses of these advanced features can lead to even slower performance, erroneous execution or reduced code clarity. Figure 14 The CUDA Memory Hierarchy [8] The CUDA programming model also assumes that the host (CPU) and the device (GPU) maintain separate memory as host memory and device memory. A CUDA application has to manually manage all the memory related to the device including memory allocation and deallocation using cudaMalloc and cudaFree; and data transfer between host and device using cudaMemcpy. Since CUDA 2.2, there is also cudaHostAlloc which allows host memory to be into CUDA address space providing asynchronous transparent access to data without explicit copy using cudaMemcpy. A notable property that applications using CUDA have is the memory sharing between CUDA memory and OpenGL or DirectX memory. This makes it quite convenient for scientific simulations that can push computed results on GPU shaders directly to the display device by calling appropriate APIs. This feature is also useful in games and other graphic tasks. Further information about CUDA programming can be found in the Nvidia CUDA Programming Guide [8]. 1.2.3 Heterogeneous programming and OpenCL The next platform is OpenCL (Open Computing Language) [24]. It was proposed by Apple and has now become a standard maintained by the Khronos group just like OpenGL. The first version of OpenCL came out in December 2008. OpenCL shares many similarities with CUDA but support heterogeneous computing on a wide range of platforms, including Intel and AMD CPUs; Nvidia and ATI GPUs; Cell processors, FPGAs, digital signal processors and other platforms. At the time of this writing, there are OpenCL implementations on most major parallel platforms although features and performance still varies. OpenCL has a lower-level interface than CUDA, like the CUDA driver API. Programmers still write compute kernels but have to load them into the device command queue. This also makes OpenCL more flexible than CUDA as it can support both task parallelism and data parallelism. Data movements between the host and compute devices, as well as OpenCL tasks, are coordinated via command queues. Command queues provide a general way of specifying relationships between tasks, ensuring that tasks are executed in an order that satisfies the natural dependences in the computation. The OpenCL runtime is free to execute tasks in parallel if their dependencies are satisfied, which provides a general-purpose task parallel execution model. Tasks themselves can be comprised of data-parallel kernels, which apply a single function over a range of data elements, in parallel, allowing only restricted synchronization and communication during the execution of a kernel. OpenCL’s data parallelism is quite like CUDA. There are many similar concepts to CUDA but with different names. Instead of grid and blocks, the kernel is executed over a similar index space with equal workgroups. The memory hierarchy also share many similarities with CUDA. While the division of a kernel into work-items and work-groups supports data-parallelism, task parallelism is supported via the command queue. Different kernels representing different tasks may be enqueued. Dependency between tasks can also be specified. Multiple command queues can also be created. Figure 15 OpenCL data parallelism and task parallelism Below is an example OpenCL code in the Nvidia GPU Computing SDK for computing dot product: __kernel void DotProduct (__global float* a, __global float* b, __global float* c, int iNumElements) { // find position in global arrays int iGID = get_global_id(0); // bound check (equivalent to the limit on a 'for' loop for standard/serial C code if (iGID >= iNumElements) { return; } // process int iInOffset = iGID << 2; c[iGID] = a[iInOffset] * b[iInOffset] + a[iInOffset + 1] * b[iInOffset + 1] + a[iInOffset + 2] * b[iInOffset + 2] + a[iInOffset + 3] * b[iInOffset + 3]; } Compared to CUDA, OpenCL has both task parallelism and data parallelism while CUDA has limited support for task parallelism. However, this makes the OpenCL have a steeper learning curve. Writing portable OpenCL code with good performance on all architectures is still hard as there are many differences which call for platform specific optimizations. For example, while Nvidia hardware encourages the use of scalar code, AMD GPUs is designed for vector types such as float4 to get good performance. The efficiency of OpenCL model on CPUs versus CPU-only technologies such as OpenMP or TBB is a problem that needs further experiment. OpenCL drivers and support tools are also not as mature as established technologies. Support libraries is also a factor worth considering. While CUDA has cuBlas, cuFFT and many other third-party libraries, OpenCL has virtually none of these. As a result, OpenCL has not seen widespread adoption and the market share is still low compared to CUDA and other technologies. The mixture of data parallelism and task parallelism is also a high learning curve for programmers coming from other homogeneous parallel platforms. However, OpenCL is an open standard and in the long term, with more quality implementations, better support and the natural industrial trend towards open standards, it should be able to gain more acceptances and receive broader use than now. It is also highly possible that new parallel frameworks will be built on top of OpenCL to offer higher level parallel programming as well as provide additional utilities for various common tasks such as scientific computing, image processing, etc. Such frameworks are being considered by most major parallel software providers and would provide a great boost to OpenCL adoption. Third-party libraries may also see big improvements as well. Nevertheless, understanding OpenCL at its current level would make the programmer use these future tools more efficiently. Chapter 2. The Forward Problem in Resistivity Tomography 2.1 Inversion theory The inverse problem is a common problem found in many sciences. It has an extremely important role in most geophysical methods. The definition of the forward and inverse problem can be described as followed: Forward problem: model { model parameters, sources } data Inverse problem: { data , sources }model{ model parameters } In the case of our resistivity tomography problem, m is the resistivity model of the underground while d is the measured voltage on the surface of the earth and s is the DC electrical current injected into the earth through electrodes. The forward modeling problem therefore determines the potential that would be observed over a given subsurface structure. Figure 16 Electrical forward and inverse problems. While the forward problem has a unique solution with each model and sources, it is different for the inverse problem. When solving an inverse problem, three important questions must be answered: Does the solution exist? Is it unique? Is it stable? The question of the solution's existence is related to the mathematical formulation of the inverse problem. From the mathematical point of view, there could possibly be no adequate numerical model from the given model set which would fit our observed data. From the geophysical point of view, however, there should be some certain solution, since we study real geological structures of the earth's interior. The question of the uniqueness of the solution can be illustrated by the following formulae. Assume that we have two different models, and, and two different sources, and, which generate the same data with the forward modeling operator: , . In this case, it is impossible to distinguish these two models from the given data. That is why the question of uniqueness is so important in inversion. The question of solution stability is also a crucial one. Real geophysical data are always contaminated by some noises. The question is whether the different between responses from different models is larger than the noise level. If the answer is not, it is impossible to distinguish these models from each other based on our data. According to the French mathematician Hadamard [20], if all questions raised above have positive answer, the problem is said to be well-posed. Problems which are not well-posed are called ill-posed by Hadamard. Ill-posed problems may not have a solution or the solution is not unique or if a small change in the observed data would cause a large perturbation in the solution of the problem. Hadamard considered ill-posed problems not mathematically or physically meaningful. However, most scientific problems, including geophysical ones are ill-posed and it was subsequently found out that these problems are still meaningful and can be solved [26]. Foundations of the theory of ill-posed problems were developed by the Russian mathematician Andrei Tikhonov [13] in the middle of the 20th century. Tikhonov who is best known for his work on regularization of inverse problems also worked on geophysical problems. He explains in detail how to solve the resistivity tomography problem in a simple case of 2-layered medium. During the 1940s he collaborated with geophysicists and without the aid of computers they discovered large deposits of copper. As a result they were awarded a State Prize of Soviet Union. An in-depth book on the inverse problem topic in geophysics is [26]. With all common methods for solving the inverse problem, the forward problem has to be solved many times. It is therefore necessary to develop the forward module which can produce results with reasonable precision in acceptable time. 2.2 The geophysical model The resistivity of a material is defined as the resistance in ohms between the opposite faces of a unit cube of the material. The SI (Système International) unit of resistivity is ohm-metre (ohm.m). If resistivity is know, resistance can be computed by taking the integral of the formula: , with R being resistance, L being the length of the material and being the cross-sectional area. In exploration geophysics, resistivity is a very important physical parameter that provides information about the mineral content and physical structure of rocks, and also about fluids in the rocks. Different materials have different resistivity value. The composition of the underground can therefore be studied by measuring the underground resistivities. Electrical conductivity is the reciprocal (inverse) of electrical resistivity, and has the SI units of Siemens per meter (S.m-1). . For the simplest case with a homogenous subsurface and a single point current source on the ground surface, the electrical potential varies inversely with the distance from the source (Figure 17): . The equipotential surfaces for this case are spherical surfaces radiating from the current electrode position. Figure 17 Homogenous subsurface with single point current source. With two point sources one positive and one negative, we get , where and are distances of the point from the first and second current electrodes. In most surveys, the potential difference between two points is measured. Two emitting electrodes, one positive, one negative, are used to inject the current into the ground. Two electrodes are receiving electrodes. The potential difference between them can be measured using a volt meter. The most used material for electrodes is stainless steel but others, such as copper and brass, are also used. A typical arrangement with four electrodes is shown in Figure 18. The potential difference in the case of four electrodes over a homogenous half space is . (2.1) Figure 18 shows a four electrode arrangement with the current source, four electrodes A, B, M, N (A is the positive source, B is the negative source, M and N are receiving electrodes), current meter, volt meter together with current flows and equipotential surfaces. If the half space is not homogenous, the value of computed from the above equation is the apparent resistivity. Figure 18 A four electrode arrangement. There are many types of electrode array. [6] provides details about each type of common electrode arrays. A few electrode configurations are shown on Figure 19. Figure 19 Example surface electrode configurations [11]. Different arrays have different depth of investigation, sensitivity to vertical and horizontal changes in the subsurface resistivity, horizontal data coverage and signal strength. It is therefore up to the geophysicists to choose the appropriate electrode array depending on survey objectives, spatial variability of electrical properties, access to suitable electrode site and equipment availability. Figure 20 shows the difference in received signal between some electrode array types. Contours show relative contributions to the signal from unit volumes of homogeneous ground. Dashed lines indicate negative values. Figure 20 Signal contribution sections for a) Wenner array b) Schlumberger array c) dipole-dipole arrays [7]. A reasonable first reaction to the measuring of resistivities shown in this figure is that useful resistivity surveys are impossible, as the contributions from regions close to the electrodes are very large. However, the variations in sign imply that a conductive near-surface layer will in some places increase and in other places decrease the apparent resistivity. In homogeneous ground these effects can cancel quite precisely. When a Wenner or dipole–dipole array is expanded, all the electrodes are moved and the contributions from near-surface bodies vary from reading to reading. With a Schlumberger array, near-surface effects vary much less, provided that only the outer electrodes are moved, and for this reason the array is often preferred for depth sounding. However, offset techniques allow excellent results to be obtained with the Wenner. The electrode array is moved together along the profile and the apparent resistivity is measured. Data points are collected for each electrode position. Data points at deeper layer are measured by repeating for array with larger width parameters. In the past, geophysicists had to measure each data point manually but by the mid- to late 1990s, most instrument manufacturers were able to offer multiplexed instruments, and in some cases multi-measurement channel hardware, thus enabling reduced survey time. Autonomous data acquisition systems are emerging and offer great potential for remote monitoring. From the collected data, a pseudo-section of the underground apparent resistivity can be constructed. With each measurement, the potential difference is measured and the apparent resistivity is calculated using equation (2.1). This apparent resistivity provides a datum point. After all measurements are completed, we can produce an apparent resistivity model of the underground (Figure 21). Figure 21 An example pseudo-section [11]. 2.3 The forward problem by differential method Fully analytical methods have been used for simple cases of the forward modeling problem, such as a sphere in a homogenous medium or a vertical fault between two areas each with a constant resistivity. However, this can not be applied to general cases. Here we present the forward problem using differential equations. The input of the forward algorithm is a discretized mesh model of the subsurface in which each mesh cell is assigned a resistivity value. The most common model is a rectangular mesh while triangular meshes have also been proposed. For the general 3D case the partial differential equation to be solved is , (2.2) where is the point of current injection. Figure 22 An example 3D mesh [5]. The boundary conditions are V (x, y, z) must be continuous across each element boundary of the contrasting physical property distribution of σ (x, y). the potential is zero at infinity , there is no current leak into the air which means at the horizontal boundary . The equation can be solved using finite difference or finite element method with the above boundary conditions. The boundary condition at infinity is satisfied by expanding the mesh to a distance far enough. The result is a linear system of the form , which can be solved using sparse linear solvers as is a sparse matrix. Details on getting this linear system can be found in [9]. For the 2D case, the resistivity distribution is assumed to be constant in the direction. Figure 23 A 2-D Mesh [5] Applying the Fourier transform , to equation (2.2) yields (2.3) with being the wave number. This equation can be solved using the same method as (2.2). Once the result is obtained, the approximated inverse Fourier transform , is applied to get back ( is the number of wave numbers, are the summation weights for each wave number). An interesting problem is that although the wave number has to be theoretically continuous, in reality, we can only solve (2.3) for a limited number of wave numbers. The number of wave numbers should be minimized to speed up the computation but the accuracy must not be affected. A method for getting the optimal wave numbers and their Fourier inverses is detailed in [10]. The idea is to get the wave numbers from cases where the analytical solution to the forward problem equation is known. Chapter 3 Software Implementation 3.1 Algorithms for solving large linear systems For solving large linear systems, both iterative methods and direct methods can be used. While direct methods such as Gaussian elimination or LU decomposition give us an exact solution after a finite number of operations, iterative methods start with an initial solution and improve upon it after each iteration, obtaining a reasonably good approximation of the solution after a number of iterations. For dense matrices, both direct and iterative methods are commonly used. With large sparse matrices, however, direct methods can become more costly in both memory footprint and implementation complexity as new nonzero elements can appear when using direct methods causing problems for storage and computation. As a result, iterative solvers are more common place for sparse linear systems. Iterative methods are also easier for fine-grained parallelization. In our case, we use the conjugate gradient method, a common iterative method for solving our linear systems. The Wikipedia article [14] provides a good amount of details on this method. Here we only present an overview of the conjugate gradient method. Although all iterative methods obtain the results through a number of iterations, there are many differences in how the iterations are carried out. For the family of relaxation methods, the result of an iteration only depends on the result of the previous one. The conjugate gradient method and some other methods can use the result of all previous iterations for producing the result of an iteration for faster convergence. Proper treatment of the theory behind iterative methods is a complicated subject. Below is the conjugate gradient method for solving the system as presented in [14]: Choose initial guess (possibly the zero vector) for if is less than some tolerance then stop end for. To increase convergence, a preconditioner is often used. This is called the preconditioned conjugate gradient method. The preconditioner is a matrix that can provide an approximation of the matrix. There are many types of preconditioners, including diagonal, incomplete Cholesky and incomplete LU. The above algorithm then becomes: Choose initial guess Solve for for if is less than some tolerance then stop Solve for end for. 3.2 CPU implementation For each wave number, a large number of linear systems have to be solved. In theory, this equals the number of data points which means that for each electrode arrangement, a linear system must be solved. This is the common method in most research articles [9], [10]. However, according the superposition property of the electrical field, the electrical potential caused by two electrodes is the sum of the two potential caused by each electrode: , where is the potential caused by the positive electrode and is the potential caused by the negative electrode. As a result, the potential has to be calculated only for each electrode position and the results for each data point can be produced from the calculated results for single electrodes using equation (2.1). This means the number of linear systems we have to solve now only equals the number of electrodes, not the number of data points. For example, a survey with 100 electrodes can have more than 900 data points for a pseudo-section depth of 10. With this notice, we can reduce the number of needed computations by a large magnitude. Although the number of linear systems is large, they can be solved independently. Parallelization can therefore be done on a coarse-grain scale. In our software, Intel Threading Building Blocks [4] is used for parallelizing the code as this library is portable across platforms, provides good load balancing and has good documentation with a large user community. The loop over all the linear systems is parallelized using TBB’s parallel_for. We use the Matrix Template Library 4 by Open Systems Lab at Indiana University for solving each linear system. Running time for typical test cases are quite optimistic as near linear scale can be achieved. On our Core 2 Quad 9400 test system, for example, a speedup of 3.9x is observed. 3.2 Example Results Here we present some results for common underground model using the Wenner alpha electrodes array. Each example contains two picture. The picture on top is the model. The picture below is the result after running the forward modeling module. Surface Block Environment Figure 24 Example 1 Surface Environment Area with higher resistivity Figure 25 Example 2 Surface Block 1 Block 2 Environment Figure 26 Example 3 Surface Figure 27 Example 4 3.3 GPU Implementation using CUDA For the 2-D case, the number of electrodes is often less than 1000. The above multi-core CPU implementation can therefore be adequate for 2-D surveys. However, for 3-D surveys, the number of electrodes can be a few thousands or more and the 3-D mesh is significantly larger than the 2-D mesh. It is therefore essential to get more performance increase for large data sets. Traditionally, this has been done using clusters, which are expensive and inconvenient for ordinary users. At the moment, we are experimenting with porting computationally expensive parts of the program to run on the GPU. It can be seen form chapter 1 that GPU has a very different model of execution from CPU. The memory hierarchy is also different. This lead to more work in porting a program from CPU to GPU. Choosing the right matrix format is an important first step. There are a few common sparse matrix formats to choose from. For dense matrices, most elements are nonzero and the storage format often stores all the matrix elements in column-major or row-major form. For sparse matrices, because of the fact that a large number of elements are zero, it is more efficient to store only the nonzero ones to save space and increase computation speed. This requires for more complicated storage scheme than dense matrices. Some common sparse matrix formats are Compressed Sparse Row (CSR) format: a vector contains the entire matrix elements in which the elements in the same row are next to each other, rows are placed according to their position in the matrix. A second vector contains the columns of the elements. The third vector contains the start of each row in the first vector. For example the matrix has the CSR presentation . Diagonal format: every diagonal of the matrix that has a non zero is stored. Zeros are padded in so every diagonal has the same length. This is most suitable for matrices produced by finite differencing. For example, the matrix has the diagonal presentation . ELL format: every row is assumed to have no more than a fixed number of non-zeros. Zeros are padded in if the row has fewer non-zeros. The column of each element is also explicitly stored. For example, the matrix has the ELL presentation: . COO format: both the coordinate and the value of the element are stored. For example, the matrix has COO presentation . Many other formats have also been proposed. Some are hybrid of the existing types, some are completely new format. There are many other types of matrix data formats. Finding the appropriate format for specific parallel architectures requires a much deeper understanding of the inner of hardware and parallel algorithms. This should be an interesting subject for further research. In the CPU implementation, we use the CSR format as it is the most common in matrix libraries and supported by many solvers. However, this format is inefficient for GPU as memory access in the matrix-vector multiplication kernel is not contiguous, which significantly slow down GPU computation. As we use finite difference in our mesh, we use the diagonal format to enable contiguous memory access. We have also applied some common optimizations for GPU kernels. Attention has been paid to the use of shared memory within GPU threads in the same block to avoid access to the slower global memory. This alone proves to be quite effective in optimizing for GPU. We also tried to keep the processing on the GPU for as long as possible to avoid long memory transfer time between GPU and CPU. For small cases, grouping smaller matrices into larger ones can help get more bandwidth as the bandwidth is often under-utilized for these cases. Porting the whole computation part of the program to GPU is an on-going task that requires more time to complete. Below are preliminary results for comparison between CPU and GPU conjugate gradient method with diagonal preconditioner after some optimization on the GPU kernels. The CPU code is run on a Core 2 Quad Q9400 CPU (2.66 GHz, 6 MB cache). The GPU code is run on an MSI Nvidia 9800GT GPU (600 MHz core clock, 1375 MHz shader clock, 512 MB GDDR3). The time is measured for 2-D survery datasets. However, with 3200 electrodes, the amount of computation is quite the same as large 3-D datasets. Memory copy operation between CPU and GPU is already included in the GPU execution timing to provide real comparison between GPU time and CPU time. Figure 28 Running time for different implementations (CPU sequential time is not measured after 800 electrodes because of long running time). It can be seen that for large datasets, the GPU shows up to more than 6x speed up over a 4-core CPU. This is quite promising for an old GPU such as the Geforce 9800. The GPU also has a 75W TDP compared to the 120W TDP of the CPU. The speed up can be improved in the future as there is still much room for improvement in our GPU kernel. GPUs are also getting much faster with each generation with better compilers and tools. GPU computing can therefore be a viable option for processing large 3-D survey datasets with a great amount of speed up over CPU computing. We are currently working on a multi-GPU implementation. This requires more research into the problem of load balancing between different GPUs. Using a cluster with multi-GPU nodes is also an interesting research area which has the potential to solve larger problems at lower cost and power consumption. CONCLUSION In the first phase of our project, we have developed the forward module for the resistivity tomography problem. Our implementation can run on multi-core CPUs and GPUs. Future versions may see new modules for resistivity inversion and other geophysical features. We are also working on making our code more efficient and more scalable over different kinds of architecture. References [1] Bik A. J. C. Software Vectorization Handbook. Intel Press,2005. [2] Gerber R., Bik A. J. C., Smith K. B., Tian X. Software optimization Cookbook, 2nd Edition. Intel Press, 2006. [3] FASTRA II, ( [4] Intel Threading Building Blocks ( [5] Loke M. H. GeoTomoSoftware ( [6] Loke M. H. Tutorial: 2-D and 3-D electrical imaging surveys. ( [7] Milton J. Field Geophysics. Wiley, 2003. [8] Nvidia CUDA Programming Guide ( ) [9] Pidlisecky A., Haber E., Knight R. RESINVM3D: A MATLAB 3-D Resistivity Inversion Package, Geophysics 72, 2006. [10] Pidlisecky A., Knight R. FW2_5D: A MATLAB 2.5-D electrical resistivity modeling code. Computer and Geosciences 34, 2008. [11] Rubin Y., Hubbard S. S. Hydro geophysics (Water Science and Technology Library). Springer, 2006. [12] Sutter H. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb's Journal, March 2005. [13] Wikipedia. Andrey Nikolayevich Tychonoff ( [14] Wikipedia. Conjugate gradient method ( [15] Wikipedia. Cray-1( [16] Wikipedia. CUDA ( [17] Wikipedia. Electrial Resistivity Imaging ( [18] Wikipedia. Exploration Geophysics ( [19] Wikipedia. Grid computing ( [20] Wikipedia. Hadamard ( [21] Wikipedia. IBM Roadrunner ( [22] Wikipedia. Instruction Pipeline ( [23] Wikipedia. Itanium ( [24] Wikipedia. OpenCL ( [25] Wikipedia. OpenMP ( [26] Zhdanov M. S. Geophysical Inverse Theory and Regularization Problems (Methods in Geochemistry and Geophysics). Elsevier Science, 2002. [27] Zhdanov M. S. Geophysical Electromagnetic Theory and Methods (Methods in Geochemistry and Geophysics). Elsevier Science, 2009.

Các file đính kèm theo tài liệu này:

  • docNguyen Hoang Vu_K51KHMT_Khoa luan tot nghiep dai hoc.doc