[Solved] Nvidia Tesla T4 tensor core benchmark [closed]

Question

This might be more of an extended comment, bet hear me out …

As pointed out in the comments CUDA Samples are not meant as performance measuring tools.
The second benchmark you provided does not actually use tensor cores, but just a normal instruction executed on FP32 or FP64 cores.

for(int i=0; i<compute_iterations; i++){
            tmps[j] = mad(tmps[j], tmps[j], seed);
    }

On a Turing T4 this, for single precision operations gives me a peak of 7.97 TFLOPS, so very close to the theoretical limit of 8.1 TFLOPS.
For half precision operations I get 16.09 TFLOPS, as expected about double that of the single precision performance.

Now, on to Tensor cores. As the previously mentioned benchmark does not use them, let’s look for something that does.
CUTLASS (https://github.com/NVIDIA/cutlass) is a high performance Matrix-Matrix Multiplication library from NVIDIA.
They provide a profiling application for all the kernels provided. If you run this on a T4, you should get output like this:

Problem ID: 1

   Provider: ^[[1;37mCUTLASS^[[0m
   OperationKind: ^[[1;37mgemm^[[0m
   Operation: cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8

      Status: ^[[1;37mSuccess^[[0m
Verification: ^[[1;37mON^[[0m
 Disposition: ^[[1;32mPassed^[[0m

 reference_device: Passed
      cuBLAS: Passed

   Arguments: --gemm_kind=universal --m=1024 --n=1024 --k=1024 --A=f16:column --B=f16:row --C=f16:column --alpha=1  \
              --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f16 --cta_m=256 --cta_n=128  \
              --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
              --max_cc=1024

       Bytes: 6291456  bytes
       FLOPs: 2149580800  flops

     Runtime: 0.0640419  ms
      Memory: 91.4928 GiB/s

        Math: 33565.2 GFLOP/s

As you can see we are now actually using Tensor cores, and half-precision operation, with a performance of 33.5 TFLOPS. Now, this might not be at 65 TFLOS, but for an application you can use in the real world, that is pretty good.

Accepted Answer

This might be more of an extended comment, bet hear me out …

As pointed out in the comments CUDA Samples are not meant as performance measuring tools.
The second benchmark you provided does not actually use tensor cores, but just a normal instruction executed on FP32 or FP64 cores.

for(int i=0; i<compute_iterations; i++){
            tmps[j] = mad(tmps[j], tmps[j], seed);
    }

On a Turing T4 this, for single precision operations gives me a peak of 7.97 TFLOPS, so very close to the theoretical limit of 8.1 TFLOPS.
For half precision operations I get 16.09 TFLOPS, as expected about double that of the single precision performance.

Now, on to Tensor cores. As the previously mentioned benchmark does not use them, let’s look for something that does.
CUTLASS (https://github.com/NVIDIA/cutlass) is a high performance Matrix-Matrix Multiplication library from NVIDIA.
They provide a profiling application for all the kernels provided. If you run this on a T4, you should get output like this:

Problem ID: 1

   Provider: ^[[1;37mCUTLASS^[[0m
   OperationKind: ^[[1;37mgemm^[[0m
   Operation: cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8

      Status: ^[[1;37mSuccess^[[0m
Verification: ^[[1;37mON^[[0m
 Disposition: ^[[1;32mPassed^[[0m

 reference_device: Passed
      cuBLAS: Passed

   Arguments: --gemm_kind=universal --m=1024 --n=1024 --k=1024 --A=f16:column --B=f16:row --C=f16:column --alpha=1  \
              --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f16 --cta_m=256 --cta_n=128  \
              --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
              --max_cc=1024

       Bytes: 6291456  bytes
       FLOPs: 2149580800  flops

     Runtime: 0.0640419  ms
      Memory: 91.4928 GiB/s

        Math: 33565.2 GFLOP/s

As you can see we are now actually using Tensor cores, and half-precision operation, with a performance of 33.5 TFLOPS. Now, this might not be at 65 TFLOS, but for an application you can use in the real world, that is pretty good.