CUDA编程性能分析工具 nvprof/ncu --metrics参数含义

摘要

在网上没有比较全的中文 ncu —metrics 参数含义,于是自己整理了一下官方和外国友人的笔记。

nvprof 和 ncu

nvprof 是过去比较常用的命令行工具,但在终端直接输入nvprof ./*.o会得到以下 Warning

1
2
3
======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher.
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
Refer for more details.

目前主流的 CUDA 驱动不再支持nvprof命令,但我们仍可以在 NVIDIA Nsight Systems 中使用,在终端输入 nsys nvprof ./*.o就可以看到CUDA 程序执行的具体内容。

另外,nvprof --metrics 命令的功能被转换到了 ncu --metrics 命令中,下面就对 nvprof/ncu --metrics命令的参数作详细解释,nsys 和 ncu 工具都有可视化版本,这里只讨论命令行版本。

List

  • inst_per_warp: 每个 warp 执行的平均指令数
  • branch_efficiency: 非发散分支与总分支的比率
  • warp_execution_efficiency: 每个 warp 的平均活动线程数与 SM 支持的每个 warp 的最大线程数之比
  • warp_nonpred_execution_efficiency: 执行非谓词指令的每个 warp 的平均活动线程数与 SM 支持的每个 warp 的最大线程数之比
  • inst_replay_overhead: 每条指令执行的平均重放次数
  • shared_load_transactions_per_request: 每次共享内存加载时执行的平均共享内存加载事务数
  • shared_store_transactions_per_request: 每次共享内存加载时执行的平均共享内存写入事务数
  • local_load_transactions_per_request: 每次本地内存加载执行的本地内存加载事务平均数
  • local_store_transactions_per_request: 为每个本地内存存储执行的本地内存存储交易的平均数量
  • gld_transactions_per_request: 为每个全局内存加载执行的全局内存加载事务的平均数。
  • gst_transactions_per_request: 为每个全局内存存储执行的平均全局内存存储事务数
  • shared_store_transactions: 共享内存存储事务数
  • shared_load_transactions: 共享内存加载事务数
  • local_load_transactions: 本地内存加载事务数
  • local_store_transactions: 本地内存存储事务数
  • gld_transactions: 全局内存加载事务数
  • gst_transactions: 全局内存存储事务数
  • sysmem_read_transactions: 系统内存读取事务数
  • sysmem_write_transactions: 系统内存写入事务数
  • l2_read_transactions: 所有读取请求在 L2 缓存中接收到的内存读取事务
  • l2_write_transactions: 所有写入请求在 L2 缓存中接收到的内存写入事务
  • dram_read_transactions: 设备内存读取事务
  • dram_write_transactions: 设备内存写入事务
  • global_hit_rate: 统一 L1/tex 缓存中全局加载的命中率
  • local_hit_rate: 本地加载和存储的命中率
  • gld_requested_throughput: 请求的全局内存负载吞吐量
  • gst_requested_throughput: 请求的全局内存存储吞吐量
  • gld_throughput: 全局内存负载吞吐量
  • gst_throughput: 全局内存存储吞吐量
  • local_memory_overhead: 本地内存流量占 L1 和 L2 缓存之间总内存流量之比
  • tex_cache_hit_rate: 统一缓存命中率
  • l2_tex_read_hit_rate: 来自纹理缓存的所有读取请求在 L2 缓存中的命中率
  • l2_tex_write_hit_rate: 来自纹理缓存的所有写入请求在 L2 缓存中的命中率
  • dram_read_throughput: 设备内存读取吞吐量
  • dram_write_throughput: 设备内存写入吞吐量
  • tex_cache_throughput: 统一缓存吞吐量
  • l2_tex_read_throughput: 在 L2 缓存中接收到的来自纹理缓存的内存读取吞吐量
  • l2_tex_write_throughput: 在 L2 缓存中接收到的来自纹理缓存的内存写入吞吐量
  • l2_read_throughput: 在 L2 缓存中接收到的所有内存读取吞吐量
  • l2_write_throughput: 在 L2 缓存中接收到的所有内存写入吞吐量
  • sysmem_read_throughput: 系统内存读取吞吐量
  • sysmem_write_throughput: 系统内存写入吞吐量
  • local_load_throughput: 本地内存加载吞吐量
  • local_store_throughput: 本地内存存储吞吐量
  • shared_load_throughput: 共享内存负载吞吐量
  • shared_store_throughput: 共享内存存储吞吐量
  • gld_efficiency: 请求的全局内存负载吞吐量与所需的全局内存负载吞吐量的比率
  • gst_efficiency: 请求的全局内存存储吞吐量与所需的全局内存存储吞吐量的比率
  • tex_cache_transactions: 统一缓存读取事务
  • flop_count_dp: 非谓词线程执行的双精度浮点运算数(加法、乘法和乘法累加)。每个乘法累加运算对计数贡献 2。
  • flop_count_dp_add: 非断言线程执行的双精度浮点加法运算次数
  • flop_count_dp_fma: 非谓词线程执行的双精度浮点乘累加运算次数,每个乘法累加运算使计数加一
  • flop_count_dp_mul: 非谓词线程执行的双精度浮点乘法运算次数
  • flop_count_sp: 非谓词线程执行的单精度浮点运算数(加法、乘法和乘法累加),每个乘法累加运算使计数加二(不包括特殊操作)
  • flop_count_sp_add: 非断言线程执行的单精度浮点加法运算次数
  • flop_count_sp_fma: 非谓词线程执行的单精度浮点乘累加运算次数。每个乘法累加运算使计数加一
  • flop_count_sp_mul: 非谓词线程执行的单精度浮点乘法运算次数
  • flop_count_sp_special: 非谓词线程执行的单精度浮点特殊操作数
  • inst_executed: 执行的指令数
  • inst_issued: 发出的指令数
  • dram_utilization: 设备内存利用率相对于理论峰值利用率的级别,范围为 0 到 10

  • sysmem_utilization: 系统内存利用率相对于理论峰值利用率的级别

  • stall_inst_fetch: 由于尚未获取下一条汇编指令而发生的停顿百分比
  • stall_exec_dependency: 由于指令所需的输入尚不可用而发生的停顿百分比
  • stall_memory_dependency: 由于所需资源不可用或未完全利用而无法执行内存操作,或者由于给定类型的太多请求未完成而导致的停顿百分比
  • stall_texture: 由于纹理子系统被充分利用或有太多未完成的请求而发生的停顿百分比
  • stall_sync: 由于 warp 在 __syncthreads() 调用时被阻塞而发生的停顿百分比
  • stall_other: 由于各种原因发生的停顿百分比
  • stall_constant_memory_dependency: 由于立即常量高速缓存未命中而发生的停顿百分比
  • stall_pipe_busy: 由于计算管道繁忙而无法执行计算操作而发生的停顿百分比
  • shared_efficiency: 请求的共享内存吞吐量与所需共享内存吞吐量的比率
  • inst_fp_32: 非谓词线程(算术、比较等)执行的单精度浮点指令数
  • inst_fp_64: 非谓词线程(算术、比较等)执行的双精度浮点指令数
  • inst_integer: 非谓词线程执行的整数指令数
  • inst_bit_convert: 非谓词线程执行的位转换指令数
  • inst_control: 非谓词线程(跳转、分支等)执行的控制流指令数
  • inst_compute_ld_st: 非谓词线程执行的计算加载/存储指令数
  • inst_misc: 非谓词线程执行的杂项指令数
  • inst_inter_thread_communication: 非谓词线程执行的线程间通信指令数
  • issue_slots: 使用的问题槽数
  • cf_issued: 发出的控制流指令数
  • cf_executed: 执行的控制流指令数
  • ldst_issued: 发出的本地、全局、共享和纹理内存加载和存储指令的数量
  • ldst_executed: 执行的本地、全局、共享和纹理内存加载和存储指令的数量
  • atomic_transactions: 全局内存原子和减少事务
  • atomic_transactions_per_request: 为每个原子和归约指令执行的全局内存原子和归约事务的平均数量
  • l2_atomic_throughput: 在 L2 缓存中接收到的原子和减少请求的内存读取吞吐量
  • l2_atomic_transactions: 在 L2 缓存中接收到的内存读取事务,用于原子请求和缩减请求
  • l2_tex_read_transactions: 在 L2 缓存中接收到的内存读取事务,用于来自纹理缓存的读取请求
  • stall_memory_throttle: 由于内存节流而发生的停顿百分比
  • stall_not_selected: 由于未选择 warp 而发生的停顿百分比
  • l2_tex_write_transactions: 在 L2 缓存中接收到的内存写入事务,用于来自纹理缓存的写入请求
  • flop_count_hp: 非谓词线程执行的半精度浮点运算数(加法、乘法和乘法累加),每个乘法累加运算使计数加二
  • flop_count_hp_add: 非断言线程执行的半精度浮点加法运算的次数
  • flop_count_hp_mul: 非谓词线程执行的半精度浮点乘法运算次数
  • flop_count_hp_fma: 非谓词线程执行的半精度浮点乘累加运算次数。每个乘法累加运算使计数加一
  • inst_fp_16: 非谓词线程(算术、比较等)执行的半精度浮点指令数
  • ipc: 每个周期执行的指令
  • issued_ipc: 每个周期发出的指令
  • issue_slot_utilization: 发出至少一条指令的发布槽的百分比,在所有周期中取平均值
  • sm_efficiency: 至少一个 warp 在特定 SM 上处于活动状态的时间百分比
  • achieved_occupancy: 每个活动周期的平均活动 warp 与 SM 支持的最大 warp 数之比
  • eligible_warps_per_cycle: 每个活动周期有资格发布的平均 warp 数
  • shared_utilization: 共享内存相对于理论峰值利用率的利用率级别
  • l2_utilization: L2 缓存利用率相对于理论峰值利用率的级别,范围为 0 到 10

  • tex_utilization: 统一缓存利用率相对于理论峰值利用率的级别

  • ldst_fu_utilization: 执行共享加载、共享存储和恒定加载指令的 SM 的利用率级别
  • cf_fu_utilization: 执行控制流指令的 SM 的利用率级别,范围为 0 到 10

  • tex_fu_utilization: 执行全局、局部和纹理内存指令的 SM 的利用率级别,范围为 0 到 10

  • special_fu_utilization: 执行 sin、cos、ex2、popc、flo 和类似指令的 SM 的利用率级别,范围为 0 到 10

  • half_precision_fu_utilization: 执行 16 位浮点指令和整数指令的 SM 的利用率级别,范围为 0到10

  • single_precision_fu_utilization: 执行单精度浮点指令和整数指令的 SM 的利用率级别

  • double_precision_fu_utilization: 执行双精度浮点指令的 SM 的利用率级别
  • flop_hp_efficiency: 实现的半精度浮点运算与理论峰值的比值
  • flop_sp_efficiency: 实现的单精度浮点运算与理论峰值的比值
  • flop_dp_efficiency: 实现的双精度浮点运算与理论峰值的比值
  • sysmem_read_utilization: 系统内存的读取利用率相对于理论峰值利用率的级别,范围为 0 到 10

  • sysmem_write_utilization: 系统内存的写入利用率相对于理论峰值利用率的级别,范围为 0 到 10

Table

nvprof —metricsncu —metrics (>= SM 7.0)
achieved_occupancysm__warps_active.avg.pct_of_peak_sustained_active
atomic_transactionsl1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum + l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum
atomic_transactions_per_request(l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red.sum) / (l1tex__t_requests_pipe_lsu_mem_global_op_atom.sum + l1tex__t_requests_pipe_lsu_mem_global_op_red.sum)
branch_efficiencysmsp__sass_average_branch_targets_threads_uniform.pct
cf_executedsmsp__inst_executed_pipe_cbu.sum + smsp__inst_executed_pipe_adu.sum
cf_fu_utilizationn/a
cf_issuedn/a
double_precision_fu_utilizationsmsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active
dram_read_bytesdram__bytes_read.sum
dram_read_throughputdram__bytes_read.sum.per_second
dram_read_transactionsdram__sectors_read.sum
dram_utilizationdram__throughput.avg.pct_of_peak_sustained_elapsed
dram_write_bytesdram__bytes_write.sum
dram_write_throughputdram__bytes_write.sum.per_second
dram_write_transactionsdram__sectors_write.sum
eligible_warps_per_cyclesmsp__warps_eligible.sum.per_cycle_active
flop_count_dpsmsp__sass_thread_inst_executed_op_dadd_pred_on.sum + smsp__sass_thread_inst_executed_op_dmul_pred_on.sum + smsp__sass_thread_inst_executed_op_dfma_pred_on.sum * 2
flop_count_dp_addsmsp__sass_thread_inst_executed_op_dadd_pred_on.sum
flop_count_dp_fmasmsp__sass_thread_inst_executed_op_dfma_pred_on.sum
flop_count_dp_mulsmsp__sass_thread_inst_executed_op_dmul_pred_on.sum
flop_count_hpsmsp__sass_thread_inst_executed_op_hadd_pred_on.sum + smsp__sass_thread_inst_executed_op_hmul_pred_on.sum + smsp__sass_thread_inst_executed_op_hfma_pred_on.sum * 2
flop_count_hp_addsmsp__sass_thread_inst_executed_op_hadd_pred_on.sum
flop_count_hp_fmasmsp__sass_thread_inst_executed_op_hfma_pred_on.sum
flop_count_hp_mulsmsp__sass_thread_inst_executed_op_hmul_pred_on.sum
flop_count_spsmsp__sass_thread_inst_executed_op_fadd_pred_on.sum + smsp__sass_thread_inst_executed_op_fmul_pred_on.sum + smsp__sass_thread_inst_executed_op_ffma_pred_on.sum * 2
flop_count_sp_addsmsp__sass_thread_inst_executed_op_fadd_pred_on.sum
flop_count_sp_fmasmsp__sass_thread_inst_executed_op_ffma_pred_on.sum
flop_count_sp_mulsmsp__sass_thread_inst_executed_op_fmul_pred_on.sum
flop_count_sp_specialn/a
flop_dp_efficiencysmsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_peak_sustained_elapsed
flop_hp_efficiencysmsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_peak_sustained_elapsed
flop_sp_efficiencysmsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_peak_sustained_elapsed
gld_efficiencysmsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct
gld_requested_throughputn/a
gld_throughputl1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second
gld_transactionsl1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum
gld_transactions_per_requestl1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio
global_atomic_requestsl1tex__t_requests_pipe_lsu_mem_global_op_atom.sum
global_hit_rate(l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_st_lookup_hit.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red_lookup_hit.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.sum) / (l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum)
global_load_requestsl1tex__t_requests_pipe_lsu_mem_global_op_ld.sum
global_reduction_requestsl1tex__t_requests_pipe_lsu_mem_global_op_red.sum
global_store_requestsl1tex__t_requests_pipe_lsu_mem_global_op_st.sum
gst_efficiencysmsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct
gst_requested_throughputn/a
gst_throughputl1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second
gst_transactionsl1tex__t_sectors_pipe_lsu_mem_global_op_st.sum
gst_transactions_per_requestl1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio
half_precision_fu_utilizationsmsp__inst_executed_pipe_fp16.avg.pct_of_peak_sustained_active
inst_bit_convertsmsp__sass_thread_inst_executed_op_conversion_pred_on.sum
inst_compute_ld_stsmsp__sass_thread_inst_executed_op_memory_pred_on.sum
inst_controlsmsp__sass_thread_inst_executed_op_control_pred_on.sum
inst_executedsmsp__inst_executed.sum
inst_executed_global_atomicssmsp__sass_inst_executed_op_global_atom.sum
inst_executed_global_loadssmsp__inst_executed_op_global_ld.sum
inst_executed_global_reductionssmsp__inst_executed_op_global_red.sum
inst_executed_global_storessmsp__inst_executed_op_global_st.sum
inst_executed_local_loadssmsp__inst_executed_op_local_ld.sum
inst_executed_local_storessmsp__inst_executed_op_local_st.sum
inst_executed_shared_atomicssmsp__inst_executed_op_shared_atom.sum + smsp__inst_executed_op_shared_atom_dot_alu.sum + smsp__inst_executed_op_shared_atom_dot_cas.sum
inst_executed_shared_loadssmsp__inst_executed_op_shared_ld.sum
inst_executed_shared_storessmsp__inst_executed_op_shared_st.sum
inst_executed_surface_atomicssmsp__inst_executed_op_surface_atom.sum
inst_executed_surface_loadssmsp__inst_executed_op_surface_ld.sum + smsp__inst_executed_op_shared_atom_dot_alu.sum + smsp__inst_executed_op_shared_atom_dot_cas.sum
inst_executed_surface_reductionssmsp__inst_executed_op_surface_red.sum
inst_executed_surface_storessmsp__inst_executed_op_surface_st.sum
inst_executed_tex_opssmsp__inst_executed_op_texture.sum
inst_fp_16smsp__sass_thread_inst_executed_op_fp16_pred_on.sum
inst_fp_32smsp__sass_thread_inst_executed_op_fp32_pred_on.sum
inst_fp_64smsp__sass_thread_inst_executed_op_fp64_pred_on.sum
inst_integersmsp__sass_thread_inst_executed_op_integer_pred_on.sum
inst_inter_thread_communicationsmsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on.sum
inst_issuedsmsp__inst_issued.sum
inst_miscsmsp__sass_thread_inst_executed_op_misc_pred_on.sum
inst_per_warpsmsp__average_inst_executed_per_warp.ratio
inst_replay_overheadn/a
ipcsmsp__inst_executed.avg.per_cycle_active
issue_slot_utilizationsmsp__issue_active.avg.pct_of_peak_sustained_active
issue_slotssmsp__inst_issued.sum
issued_ipcsmsp__inst_issued.avg.per_cycle_active
l1_sm_lg_utilizationl1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active
l2_atomic_throughput2 * ( lts__t_sectors_op_atom.sum.per_second + lts__t_sectors_op_red.sum.per_second )
l2_atomic_transactions2 * ( lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum )
l2_global_atomic_store_byteslts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_atom.sum
l2_global_load_byteslts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum
l2_local_global_store_byteslts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_st.sum + lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_st.sum
l2_local_load_byteslts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_ld.sum
l2_read_throughputlts__t_sectors_op_read.sum.per_second + lts__t_sectors_op_atom.sum.per_second + lts__t_sectors_op_red.sum.per_second
l2_read_transactionslts__t_sectors_op_read.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum
l2_surface_load_byteslts__t_bytes_equiv_l1sectormiss_pipe_tex_mem_surface_op_ld.sum
l2_surface_store_byteslts__t_bytes_equiv_l1sectormiss_pipe_tex_mem_surface_op_st.sum
l2_tex_hit_ratelts__t_sector_hit_rate.pct
l2_tex_read_hit_ratelts__t_sector_op_read_hit_rate.pct
l2_tex_read_throughputlts__t_sectors_srcunit_tex_op_read.sum.per_second
l2_tex_read_transactionslts__t_sectors_srcunit_tex_op_read.sum
l2_tex_write_hit_ratelts__t_sector_op_write_hit_rate.pct
l2_tex_write_throughputlts__t_sectors_srcunit_tex_op_write.sum.per_second
l2_tex_write_transactionslts__t_sectors_srcunit_tex_op_write.sum
l2_utilizationlts__t_sectors.avg.pct_of_peak_sustained_elapsed
l2_write_throughputlts__t_sectors_op_write.sum.per_second + lts__t_sectors_op_atom.sum.per_second + lts__t_sectors_op_red.sum.per_second
l2_write_transactionslts__t_sectors_op_write.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum
ldst_executedn/a
ldst_fu_utilizationsmsp__inst_executed_pipe_lsu.avg.pct_of_peak_sustained_active
ldst_issuedn/a
local_hit_raten/a
local_load_requestsl1tex__t_requests_pipe_lsu_mem_local_op_ld.sum
local_load_throughputl1tex__t_bytes_pipe_lsu_mem_local_op_ld.sum.per_second
local_load_transactionsl1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum
local_load_transactions_per_requestl1tex__average_t_sectors_per_request_pipe_lsu_mem_local_op_ld.ratio
local_memory_overheadn/a
local_store_requestsl1tex__t_requests_pipe_lsu_mem_local_op_st.sum
local_store_throughputl1tex__t_sectors_pipe_lsu_mem_local_op_st.sum.per_second
local_store_transactionsl1tex__t_sectors_pipe_lsu_mem_local_op_st.sum
local_store_transactions_per_requestl1tex__average_t_sectors_per_request_pipe_lsu_mem_local_op_st.ratio
nvlink_data_receive_efficiencyn/a
nvlink_data_transmission_efficiencyn/a
nvlink_overhead_data_received(nvlrx__bytes_data_protocol.sum / nvlrx__bytes.sum) * 100
nvlink_overhead_data_transmitted(nvltx__bytes_data_protocol.sum / nvltx__bytes.sum) * 100
nvlink_receive_throughputnvlrx__bytes.sum.per_second
nvlink_total_data_receivednvlrx__bytes.sum
nvlink_total_data_transmittednvltx__bytes.sum
nvlink_total_nratom_data_transmittedn/a
nvlink_total_ratom_data_transmittedn/a
nvlink_total_response_data_receivedn/a
nvlink_total_write_data_transmittedn/a
nvlink_transmit_throughputnvltx__bytes.sum.per_second
nvlink_user_data_receivednvlrx__bytes_data_user.sum
nvlink_user_data_transmittednvltx__bytes_data_user.sum
nvlink_user_nratom_data_transmittedn/a
nvlink_user_ratom_data_transmittedn/a
nvlink_user_response_data_receivedn/a
nvlink_user_write_data_transmittedn/a
pcie_total_data_receivedpcie__read_bytes.sum
pcie_total_data_transmittedpcie__write_bytes.sum
shared_efficiencysmsp__sass_average_data_bytes_per_wavefront_mem_shared.pct
shared_load_throughputl1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum.per_second
shared_load_transactionsl1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum
shared_load_transactions_per_requestn/a
shared_store_throughputl1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum.per_second
shared_store_transactionsl1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum
shared_store_transactions_per_requestn/a
shared_utilizationl1tex__data_pipe_lsu_wavefronts_mem_shared.avg.pct_of_peak_sustained_elapsed
single_precision_fu_utilizationsmsp__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active
sm_efficiencysmsp__cycles_active.avg.pct_of_peak_sustained_elapsed
sm_tex_utilizationl1tex__texin_sm2tex_req_cycles_active.avg.pct_of_peak_sustained_elapsed
special_fu_utilizationsmsp__inst_executed_pipe_xu.avg.pct_of_peak_sustained_active
stall_constant_memory_dependencysmsp__warp_issue_stalled_imc_miss_per_warp_active.pct
stall_exec_dependencysmsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct + smsp__warp_issue_stalled_wait_per_warp_active.pct
stall_inst_fetchsmsp__warp_issue_stalled_no_instruction_per_warp_active.pct
stall_memory_dependencysmsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct
stall_memory_throttlesmsp__warp_issue_stalled_drain_per_warp_active.pct + smsp__warp_issue_stalled_lg_throttle_per_warp_active.pct
stall_not_selectedsmsp__warp_issue_stalled_not_selected_per_warp_active.pct
stall_othersmsp__warp_issue_stalled_dispatch_stall_per_warp_active.pct + smsp__warp_issue_stalled_misc_per_warp_active.pct
stall_pipe_busysmsp__warp_issue_stalled_math_pipe_throttle_per_warp_active.pct + smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct
stall_sleepingsmsp__warp_issue_stalled_sleeping_per_warp_active.pct
stall_syncsmsp__warp_issue_stalled_barrier_per_warp_active.pct + smsp__warp_issue_stalled_membar_per_warp_active.pct
stall_texturesmsp__warp_issue_stalled_tex_throttle_per_warp_active.pct
surface_atomic_requestsl1tex__t_requests_pipe_tex_mem_surface_op_atom.sum
surface_load_requestsl1tex__t_requests_pipe_tex_mem_surface_op_ld.sum
surface_reduction_requestsl1tex__t_requests_pipe_tex_mem_surface_op_red.sum
surface_store_requestsl1tex__t_requests_pipe_tex_mem_surface_op_st.sum
sysmem_read_byteslts__t_sectors_aperture_sysmem_op_read * 32
sysmem_read_throughputlts__t_sectors_aperture_sysmem_op_read.sum.per_second
sysmem_read_transactionslts__t_sectors_aperture_sysmem_op_read.sum
sysmem_read_utilizationn/a
sysmem_utilizationn/a
sysmem_write_byteslts__t_sectors_aperture_sysmem_op_write * 32
sysmem_write_throughputlts__t_sectors_aperture_sysmem_op_write.sum.per_second
sysmem_write_transactionslts__t_sectors_aperture_sysmem_op_write.sum
sysmem_write_utilizationn/a
tensor_precision_fu_utilizationsm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_active
tensor_precision_int_utilizationsm__pipe_tensor_op_imma_cycles_active.avg.pct_of_peak_sustained_active (SM 7.2+)
tex_cache_hit_ratel1tex__t_sector_hit_rate.pct
tex_cache_throughputn/a
tex_cache_transactionsl1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active + l1tex__tex_writeback_active.avg.pct_of_peak_sustained_active
tex_fu_utilizationsmsp__inst_executed_pipe_tex.avg.pct_of_peak_sustained_active
tex_sm_tex_utilizationl1tex__f_tex2sm_cycles_active.avg.pct_of_peak_sustained_elapsed
tex_sm_utilizationsm__mio2rf_writeback_active.avg.pct_of_peak_sustained_elapsed
tex_utilizationn/a
texture_load_requestsl1tex__t_requests_pipe_tex_mem_texture.sum
warp_execution_efficiencysmsp__thread_inst_executed_per_inst_executed.ratio
warp_nonpred_execution_efficiencysmsp__thread_inst_executed_per_inst_executed.pct
- ETX   Thank you for reading -
  • Copyright: All posts on this blog except otherwise stated, All adopt CC BY-NC-ND 4.0 license agreement. Please indicate the source of reprint!