相关文章推荐
  • 编译C、 C++源程序的命令分别: pgcc、 pgCC
  • 编译Fortran 77源程序的命令: pgf77
  • 编译Fortran 90的源程序的命令: pgf90、 pgf901、 pgf902、 pgf90_ex、pgf95和pgfortran
  • 与NVIDIA CUDA库配合,可以编译Fortran-CUDA程序
  • ENDPROGRAM hello cndaqiang@girl:~/code/cuda$ pgf90 hello.f90 -o hello cndaqiang@girl:~/code/cuda$ ./hello hello

    编译cuda示例

    cndaqiang@girl:~/code/cuda$ cat example.cuf 
    !attributes(global) 设备函数,global可被主机调用
    attributes(global) subroutine hellocuda()
        IMPLICIT NONE
        INTEGER :: i,d,j
        i = BlockIdx%x  !线程块索引
        d = BlockDim%x  !线程块长度
        j = ThreadIdx%x !线程索引
        WRITE(*,*) "Block",i,"Thread",j,"In",d
    ENDSUBROUTINE
    PROGRAM main
    USE cudafor
    IMPLICIT NONE
    INTEGER :: i
    call hellocuda<<<2,3>>>()
    i=cudaDeviceSynchronize() !use cudafor才能正常使用
    !在执行hellocuda时,控制权就交给cpu了
    END PROGRAM
    cndaqiang@girl:~/code/cuda$ pgfortran example.cuf && ./a.out 
     Block            2 Thread            1 In            3
     Block            2 Thread            2 In            3
     Block            2 Thread            3 In            3
     Block            1 Thread            1 In            3
     Block            1 Thread            2 In            3
     Block            1 Thread            3 In            3
    

    因为启动GPU上的函数hellocuda后,控制权就返回CPU了,因此,加上i=cudaDeviceSynchronize() ,等待GPU输出之后,再继续

  • 主机host:指CPU及其内存。
  • 设备device:指GPU及其内存。
  • CPU代码:指一个仅用到CPU的实现/在host上执行的代码
  • 核函数kernel: host上调用,device上执行的subroutine,如attributes(global) subroutine hellocuda()
  • 启动核函数:在host上调用device上的函数 ,如CALL hellocuda <<<2,3>>>([参数])
    使用<<<线程块数量,每个线程块的线程数>>>,即<<<Grid,Block>>>
  • 定义函数attributes(属性) subroutine funname(),当属性为:
    global, device上执行,host上调用,例attributes(global) subroutine hellocuda(),调用时指明网格<<<>>>
    device, device上执行,device上调用,例attributes(device) subroutine hellocuda(),直接启用,没有<<<>>>
    host, host上执行,host上调用,不写属性,默认就是此情况,例subroutine hellohost(),直接启用,没有<<<>>>
  • 全局内存(Gloabl Memory):由显存决定,所有线程都可以访问,操作,我们传递给核函数的device变量存储在共享内存上
    共享内存(Shard Memory):每个线程块共享,寿命与线程块一样
    私有本地内存(Local Memory):仅本线程能访问
  • 设备上的函数不能contained在主程序或其他子程序中
  • 很多函数变量都来自USE cudafor
  • CPU进行控制,GPU进行计算
  • 查看显卡属性

    cndaqiang@girl:~/code/cuda$ cat checkcuda.cuf 
    PROGRAM checkcuda
    USE cudafor
    IMPLICIT NONE
    type(cudaDeviceProp) :: prop
    INTEGER :: nDevices=0,i,ierr
    ierr = cudaGetDeviceCount(nDevices)
    if(nDevices .EQ. 0) RETURN
    DO i=0,nDevices-1
        WRITE(*,"('Device Number:',i0)"),i
        ierr = cudaGetDeviceProperties(prop,i)
        WRITE(*,"('Device Name: ',a)") TRIM(prop%name)
        !计算能力 主要.次要 ,理解为版本号
        WRITE(*,"('Compute Capability: ',i0,'.',i0)") prop%major, prop%minor
        !处理器数量
        WRITE(*,"('Number of Multiprocessors: ',i0)") prop%multiProcessorCount
        WRITE(*,"('Max Threads per Multiprocessor:',i0)") prop%maxThreadsPerMultiprocessor
        WRITE(*,"('Global Memory (GB):',f9.3)" ) prop%totalGlobalMem/(1024.0**3)
        WRITE(*,"(/,A)") "Execution Configuration Limits "
        WRITE(*,"('Max Grid Dims: ',i0,'x',i0,'x',i0)") prop%maxGridSize
        !每维Block内最大线程数
        WRITE(*,"('Max Block Dims: ',i0,'x',i0,'x',i0)") prop%maxThreadsDim
        WRITE(*,"('Max Threads per Block: ',i0)") prop%maxThreadsPerBlock
    ENDDO
    END PROGRAM
    cndaqiang@girl:~/code/cuda$ pgf90 checkcuda.cuf && ./a.out 
    Device Number:0
    Device Name: GeForce 940MX
    Compute Capability: 5.0
    Number of Multiprocessors: 3
    Max Threads per Multiprocessor:2048
    Global Memory (GB):    1.958
    Execution Configuration Limits 
    Max Grid Dims: 2147483647x65535x65535
    Max Block Dims: 1024x1024x64
    Max Threads per Block: 1024
    

    并发线程数=多处理器数量*每处理器上最大线程数量
    对于此940MX而言,使用一维网格,一个Block线程数可达2147483647*1024
    也可以使用pgaccelinfo查看

    cndaqiang@girl:~/code/cuda$ pgaccelinfo 
    CUDA Driver Version:           10010
    NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  435.21  Sun Aug 25 08:17:57 CDT 2019
    Device Number:                 0
    Device Name:                   GeForce 940MX
    Device Revision Number:        5.0
    Global Memory Size:            2101870592
    Number of Multiprocessors:     3
    Concurrent Copy and Execution: Yes
    Total Constant Memory:         65536
    Total Shared Memory per Block: 49152
    Registers per Block:           65536
    Warp Size:                     32
    Maximum Threads per Block:     1024
    Maximum Block Dimensions:      1024, 1024, 64
    Maximum Grid Dimensions:       2147483647 x 65535 x 65535
    Maximum Memory Pitch:          2147483647B
    Texture Alignment:             512B
    Clock Rate:                    1189 MHz
    Execution Timeout:             Yes
    Integrated Device:             No
    Can Map Host Memory:           Yes
    Compute Mode:                  default
    Concurrent Kernels:            Yes
    ECC Enabled:                   No
    Memory Clock Rate:             2000 MHz
    Memory Bus Width:              64 bits
    L2 Cache Size:                 1048576 bytes
    Max Threads Per SMP:           2048
    Async Engines:                 1
    Unified Addressing:            Yes
    Managed Memory:                Yes
    Concurrent Managed Memory:     No
    PGI Default Target:            -ta=tesla:cc50
    

    设置核函数计算网格

    一维Gird,一维Block

    !一维网格中有M个Block,每个Block有N个Thread
    call sub<<<M,N>>>([argument])
    
    TYPE(dim3) grid,block
    grid=dim3(2,2,2)
    block=dim3(1,1,1)
    call hellocuda<<<grid,block>>>()
    

    kernal函数中的网格内置变量,是dim3的类型

        gx=GridDim%x !网格大小,即Block数量
        gy=GridDim%y !网格大小,即Block数量
        gz=GridDim%z !网格大小,即Block数量
        bx=BlockIdx%x !Block ID
        by=BlockIdx%y !Block ID
        bz=BlockIdx%z !Block ID
        bDx=BlockDim%x !Block大小,即Thread数量/Block
        bDy=BlockDim%y !Block大小,即Thread数量/Block
        bDz=BlockDim%z !Block大小,即Thread数量/Block
        tx=ThreadIdx%x !Thead ID
        ty=ThreadIdx%y !Thead ID
        tz=ThreadIdx%z !Thead ID
    

    确定绝对线程数,可用

    bi =    BlockIdx%x +        &
            (BlockIdx%y - 1)*GridDim%x + &
            (BlockIdx%z - 1)*GridDim%y*GridDim%x
    ti = (bi-1)*BlockDim%x*BlockDim%y*BlockDim%z + &
            ThreadIdx%x +        &
            (ThreadIdx%y - 1)*BlockDim%x + &
            (ThreadIdx%z - 1)*BlockDim%y*BlockDim%x
    

    在核函数上启动device函数时,不能设置网格(因为是各个Thread分别调用此函数),device中也有网格内置变量,值由调用该函数的线程确定

    主机,设备代码同步问题

    默认主机和设备间的代码(golbal的核函数)执行是不同步的,
    即下面代码在开始执行CALL hellocuda <<<2,3>>>()时,控制权就返回CPU了

    a_d=1 
    CALL hellocuda <<<2,3>>>()
    

    主机和内核之前的数据传输是同步的(或者是阻塞的),如a_d=1,只有当CPU和GPU都执行到此处才可以传输数据。
    其他同步方法

  • 在代码内添加i=cudaDeviceSynchronize(),直到所有设备(GPU)上代码执行完到此处,主机(CPU)程序才继续执行
  • 可以设置环境变量,使CPU和GPU同步函数调用export CUDA_LAUNCH_BLOCKING=1
  • 核函数的参数传递

    默认传址调用
    传递的变量需要保存在device上,即该变量原始定义时加上device属性,如INTEGER,device :: x_d(100)
    应该只能传递在GPU/device上的地址
    不可以直接传递host上的变量地址
    传值调用时
    传值调用可以传递host上的变量,此时,在核函数内部定义时加上value属性,如INTEGER,value :: N
    包含传递形参的核函数只能在MODULE里定义,不能使用external,不然传递过来的值都很随机,引起计算异常

    传递示例,可看到external的函数不能传值,module的可以,传递device形变量都可以

    cndaqiang@girl:~/code/cuda$ cat cudavar.cuf
    attributes(global) SUBROUTINE whoIam(n,n_d)
    IMPLICIT NONE
    INTEGER,value :: N
    INTEGER :: N_d
    WRITE(*,*) "External", "N",N,"N_d",N_d
    END SUBROUTINE
    MODULE m_var
    CONTAINS
    attributes(global) SUBROUTINE whoIam_m(n,n_d)
    IMPLICIT NONE
    INTEGER,value :: N
    INTEGER :: N_d
    WRITE(*,*) "Module", "N",N,"N_d",N_d
    END SUBROUTINE
    END MODULE
    PROGRAM testSend
    use cudafor
    USE m_var
    IMPLICIT NONE
    INTEGER   :: N,i
    INTEGER,device :: N_d
    N_d=10
    CALL whoIam<<<1,1>>>(N,N_d)
    CALL whoIam_m<<<1,1>>>(N,N_d)
    i=cudaDeviceSynchronize()
    END PROGRAM
    cndaqiang@girl:~/code/cuda$ pgfortran cudavar.cuf && ./a.out 
     External N    816657356 N_d           10
     Module N           10 N_d           10
    

    下载fastcuda.cuf,编译测试
    可以看到当矩阵大于1E5时,GPU(940MX)计算时间=CPU(i7-7500U)时间/3
    不同变量赋值耗时 < 相同变量赋值
    不同变量赋值: GPU-GPU < CPU-CPU
    GPU-CPU赋值耗时:GPUx-GPUy + CPUx-CPUy
    GPUx-GPUy默认的赋值还挺快的

    cndaqiang@girl:~/code/cuda$ pgfortran fastcuda.cuf  && ./a.out 
     -----------------------
    N     100000      CPU build, WALL         1.10000 ms
    N     100000        CPU cal, WALL         0.25800 ms
    N     100000      GPU build, WALL         0.12800 ms
    N     100000        GPU cal, WALL         0.06400 ms
    N     100000       GPU->CPU, WALL         0.33300 ms
    N     100000       CPU->GPU, WALL         0.30300 ms
    N     100000     CPUy->CPUx, WALL         0.20800 ms
    N     100000     CPUy->CPUy, WALL         0.42600 ms
    N     100000     GPUy->GPUz, WALL         0.07100 ms
    N     100000     GPUy->GPUy, WALL         0.59500 ms
    N     100000                Error         0.00000
     -----------------------
    N  100100000      CPU build, WALL       910.17902 ms
    N  100100000        CPU cal, WALL       202.79601 ms
    N  100100000      GPU build, WALL        45.75900 ms
    N  100100000        GPU cal, WALL        59.73200 ms
    N  100100000       GPU->CPU, WALL       244.66000 ms
    N  100100000       CPU->GPU, WALL       251.89999 ms
    N  100100000     CPUy->CPUx, WALL       210.46001 ms
    N  100100000     CPUy->CPUy, WALL       495.70901 ms
    N  100100000     GPUy->GPUz, WALL        31.10200 ms
    N  100100000     GPUy->GPUy, WALL       499.45700 ms
    N  100100000                Error         0.00000
    

    检测各个模块运行时间 nvprof ./a.out

    (python27) cndaqiang@girl:~/code/cuda$ nvprof ./a.out
                2            2            2
    ==2410== NVPROF is profiling process 2410, command: ./a.out
    ...程序运行输出
    ==2410== Profiling application: ./a.out
    ==2410== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:  100.00%  509.55us         1  509.55us  509.55us  509.55us  m_mycuda_hellocuda_
          API calls:   99.61%  264.75ms         1  264.75ms  264.75ms  264.75ms  cudaLaunchKernel
                        0.21%  549.89us         1  549.89us  549.89us  549.89us  cudaDeviceSynchronize
                        0.12%  318.01us        97  3.2780us     250ns  139.94us  cuDeviceGetAttribute
                        0.05%  123.14us         1  123.14us  123.14us  123.14us  cuDeviceTotalMem
                        0.02%  50.688us         1  50.688us  50.688us  50.688us  cuDeviceGetName
                        0.00%  2.9850us         3     995ns     267ns  2.0500us  cuDeviceGetCount
                        0.00%  2.1570us         2  1.0780us     348ns  1.8090us  cuDeviceGet
                        0.00%     464ns         1     464ns     464ns     464ns
    

    CUDA代码错误

    在设备(GPU)代码上运行主机(CPU)程序,如CALL SLEEP(1),CALL system("")

    PGF90-S-0155-Calls from device code to a host subroutine are allowed only in emulation mode - sleep (hellocuda.cuf: 11)
      0 inform,   0 warnings,   1 severes, 0 fatal for mycuda2
    

    在host上输出device变量,如WRITE(*,*) a_d

    PGF90-S-0155-device data allowed in I/O statements only in emulation mode  (hellocuda.cuf: 45)
      0 inform,   0 warnings,   1 severes, 0 fatal for hellocuda
    

    device代码不能使用格式化输出? 不能用WRITE(*,"(A)") "Hello",只能WRITE(*,*) "Hello"

    PGF90-S-0155-I/O statements allowed in device routines only in emulation mode  (hellocuda.cuf: 12)
      0 inform,   0 warnings,   1 severes, 0 fatal for mycuda
    

    在host上调用device函数

    PGF90-S-0155-Kernel launch of attributes(device) subprogram is not allowed - gpufun (hellocuda.cuf: 43)
      0 inform,   0 warnings,   1 severes, 0 fatal for main
    

    拓展名cuf写为f90时,不识别cuda函数

    /home/cndaqiang/code/cuda/checkcuda.f90:8: undefined reference to `cudagetdevicecount_'
    

    向核函数传递host上变量,如果非要传递,把输入参数类型加上value,且需要在Module里面

    PGF90-S-0528-Argument number 1 to whoiam_m: device attribute mismatch (cudavar.cuf: 47)
      0 inform,   0 warnings,   1 severes, 0 fatal for testsend
    

    PGI普通Fortran语法错误

    在代码行后面多打中文字符(空格,叹号,汉字…),如`IMPLICIT NONE`与IMPLICIT NONE 的区别 警告,也可以运行

    PGF90-W-0025-Illegal character (E3) - ignored (hellocuda.cuf: 7)
    

    中英文逗号打错,

    PGF90-W-0025-Illegal character (EF) - ignored (hellocuda.cuf: 12)
    

    sqrt输入类型为实数,使用整形sqrt(10)输入报错

    PGF90-S-0038-Symbol, sqrt, has not been explicitly declared (fatercuda.cuf)
      0 inform,   0 warnings,   1 severes, 0 fatal for fast
    
     
    推荐文章