-
Essay / Calculation of transmission - 829
The calculation of transmission is the longest part of the radiative transfer model. One way to implement this on a GPU is to split the workload so that each thread calculates the results for a level. Each CUDA thread would in effect calculate a matrix by vector multiplication between coefficients and predictors. In BLAS terms, this is an SGEMV batch operation. Since there are only 101 levels, there would only be 101 threads. Compared to the over 30,000 active threads available on the Tesla C1070, this number is extremely low for a CUDA thread model. NVIDIA recommends that the number of threads per block be a multiple of 32 threads [30]. So 101 threads could effectively use at most 4 of the 30 multiprocessors. Since this type of division of labor is not very efficient, we will next describe in detail implementations that divide work such that each thread calculates results for one channel.3.6 Implementing a GPU using 6 CUDA coresWe implement radiance calculation on a GPU using six cores, five for transmission calculation and one for radiance calculation. Our GPU implementation of the radiative transfer transmission model is shown in Figure 6. It shows what is happening inside each of the 30 multiprocessors inside a Tesla C1070 during the layer-space transmission calculation. . The use of shared memory to store predictors for effective optical layer depths is also illustrated. The longest part of the radiative transfer model concerns the dot products in a layer-space transmission calculation, as shown in Figure 3. The dot product between the regression coefficients to predict the effective optical depths of the layers, "C" , and the predictors. For the effective optical layer depths, 'X',...... middle of paper ......execution, the precalculated values, Sqp, are stored in shared memory. Sqp represents the square root of an oblique path gas layer quantity. For each level, 101 threads participate in transferring the values of the predictors for the effective optical layer depths of the fixed gases, variable X, to the shared memory and the threads are synchronized. After that, a dot product between the coefficients and the values of the variable X is calculated. This operation is the longest in the kernel due to the large number of global memory accesses. The overall memory access pattern for the coefficients used in the dot product is such that consecutive threads access consecutive memory addresses. This maximizes memory bandwidth usage. Similar cores are used for other transmission components and for brevity we do not introduce other transmission cores in this article..