摘要
在网上没有比较全的中文 ncu —metrics 参数含义,于是自己整理了一下官方和外国友人的笔记。
nvprof 和 ncu
nvprof 是过去比较常用的命令行工具,但在终端直接输入nvprof ./*.o
会得到以下 Warning
1 | ======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher. |
目前主流的 CUDA 驱动不再支持nvprof
命令,但我们仍可以在 NVIDIA Nsight Systems 中使用,在终端输入 nsys nvprof ./*.o
就可以看到CUDA 程序执行的具体内容。
另外,nvprof --metrics
命令的功能被转换到了 ncu --metrics
命令中,下面就对 nvprof/ncu --metrics
命令的参数作详细解释,nsys 和 ncu 工具都有可视化版本,这里只讨论命令行版本。
List
inst_per_warp
: 每个 warp 执行的平均指令数branch_efficiency
: 非发散分支与总分支的比率warp_execution_efficiency
: 每个 warp 的平均活动线程数与 SM 支持的每个 warp 的最大线程数之比warp_nonpred_execution_efficiency
: 执行非谓词指令的每个 warp 的平均活动线程数与 SM 支持的每个 warp 的最大线程数之比inst_replay_overhead
: 每条指令执行的平均重放次数shared_load_transactions_per_request
: 每次共享内存加载时执行的平均共享内存加载事务数shared_store_transactions_per_request
: 每次共享内存加载时执行的平均共享内存写入事务数local_load_transactions_per_request
: 每次本地内存加载执行的本地内存加载事务平均数local_store_transactions_per_request
: 为每个本地内存存储执行的本地内存存储交易的平均数量gld_transactions_per_request
: 为每个全局内存加载执行的全局内存加载事务的平均数。gst_transactions_per_request
: 为每个全局内存存储执行的平均全局内存存储事务数shared_store_transactions
: 共享内存存储事务数shared_load_transactions
: 共享内存加载事务数local_load_transactions
: 本地内存加载事务数local_store_transactions
: 本地内存存储事务数gld_transactions
: 全局内存加载事务数gst_transactions
: 全局内存存储事务数sysmem_read_transactions
: 系统内存读取事务数sysmem_write_transactions
: 系统内存写入事务数l2_read_transactions
: 所有读取请求在 L2 缓存中接收到的内存读取事务l2_write_transactions
: 所有写入请求在 L2 缓存中接收到的内存写入事务dram_read_transactions
: 设备内存读取事务dram_write_transactions
: 设备内存写入事务global_hit_rate
: 统一 L1/tex 缓存中全局加载的命中率local_hit_rate
: 本地加载和存储的命中率gld_requested_throughput
: 请求的全局内存负载吞吐量gst_requested_throughput
: 请求的全局内存存储吞吐量gld_throughput
: 全局内存负载吞吐量gst_throughput
: 全局内存存储吞吐量local_memory_overhead
: 本地内存流量占 L1 和 L2 缓存之间总内存流量之比tex_cache_hit_rate
: 统一缓存命中率l2_tex_read_hit_rate
: 来自纹理缓存的所有读取请求在 L2 缓存中的命中率l2_tex_write_hit_rate
: 来自纹理缓存的所有写入请求在 L2 缓存中的命中率dram_read_throughput
: 设备内存读取吞吐量dram_write_throughput
: 设备内存写入吞吐量tex_cache_throughput
: 统一缓存吞吐量l2_tex_read_throughput
: 在 L2 缓存中接收到的来自纹理缓存的内存读取吞吐量l2_tex_write_throughput
: 在 L2 缓存中接收到的来自纹理缓存的内存写入吞吐量l2_read_throughput
: 在 L2 缓存中接收到的所有内存读取吞吐量l2_write_throughput
: 在 L2 缓存中接收到的所有内存写入吞吐量sysmem_read_throughput
: 系统内存读取吞吐量sysmem_write_throughput
: 系统内存写入吞吐量local_load_throughput
: 本地内存加载吞吐量local_store_throughput
: 本地内存存储吞吐量shared_load_throughput
: 共享内存负载吞吐量shared_store_throughput
: 共享内存存储吞吐量gld_efficiency
: 请求的全局内存负载吞吐量与所需的全局内存负载吞吐量的比率gst_efficiency
: 请求的全局内存存储吞吐量与所需的全局内存存储吞吐量的比率tex_cache_transactions
: 统一缓存读取事务flop_count_dp
: 非谓词线程执行的双精度浮点运算数(加法、乘法和乘法累加)。每个乘法累加运算对计数贡献 2。flop_count_dp_add
: 非断言线程执行的双精度浮点加法运算次数flop_count_dp_fma
: 非谓词线程执行的双精度浮点乘累加运算次数,每个乘法累加运算使计数加一flop_count_dp_mul
: 非谓词线程执行的双精度浮点乘法运算次数flop_count_sp
: 非谓词线程执行的单精度浮点运算数(加法、乘法和乘法累加),每个乘法累加运算使计数加二(不包括特殊操作)flop_count_sp_add
: 非断言线程执行的单精度浮点加法运算次数flop_count_sp_fma
: 非谓词线程执行的单精度浮点乘累加运算次数。每个乘法累加运算使计数加一flop_count_sp_mul
: 非谓词线程执行的单精度浮点乘法运算次数flop_count_sp_special
: 非谓词线程执行的单精度浮点特殊操作数inst_executed
: 执行的指令数inst_issued
: 发出的指令数dram_utilization
: 设备内存利用率相对于理论峰值利用率的级别,范围为 0 到 10sysmem_utilization
: 系统内存利用率相对于理论峰值利用率的级别stall_inst_fetch
: 由于尚未获取下一条汇编指令而发生的停顿百分比stall_exec_dependency
: 由于指令所需的输入尚不可用而发生的停顿百分比stall_memory_dependency
: 由于所需资源不可用或未完全利用而无法执行内存操作,或者由于给定类型的太多请求未完成而导致的停顿百分比stall_texture
: 由于纹理子系统被充分利用或有太多未完成的请求而发生的停顿百分比stall_sync
: 由于 warp 在 __syncthreads() 调用时被阻塞而发生的停顿百分比stall_other
: 由于各种原因发生的停顿百分比stall_constant_memory_dependency
: 由于立即常量高速缓存未命中而发生的停顿百分比stall_pipe_busy
: 由于计算管道繁忙而无法执行计算操作而发生的停顿百分比shared_efficiency
: 请求的共享内存吞吐量与所需共享内存吞吐量的比率inst_fp_32
: 非谓词线程(算术、比较等)执行的单精度浮点指令数inst_fp_64
: 非谓词线程(算术、比较等)执行的双精度浮点指令数inst_integer
: 非谓词线程执行的整数指令数inst_bit_convert
: 非谓词线程执行的位转换指令数inst_control
: 非谓词线程(跳转、分支等)执行的控制流指令数inst_compute_ld_st
: 非谓词线程执行的计算加载/存储指令数inst_misc
: 非谓词线程执行的杂项指令数inst_inter_thread_communication
: 非谓词线程执行的线程间通信指令数issue_slots
: 使用的问题槽数cf_issued
: 发出的控制流指令数cf_executed
: 执行的控制流指令数ldst_issued
: 发出的本地、全局、共享和纹理内存加载和存储指令的数量ldst_executed
: 执行的本地、全局、共享和纹理内存加载和存储指令的数量atomic_transactions
: 全局内存原子和减少事务atomic_transactions_per_request
: 为每个原子和归约指令执行的全局内存原子和归约事务的平均数量l2_atomic_throughput
: 在 L2 缓存中接收到的原子和减少请求的内存读取吞吐量l2_atomic_transactions
: 在 L2 缓存中接收到的内存读取事务,用于原子请求和缩减请求l2_tex_read_transactions
: 在 L2 缓存中接收到的内存读取事务,用于来自纹理缓存的读取请求stall_memory_throttle
: 由于内存节流而发生的停顿百分比stall_not_selected
: 由于未选择 warp 而发生的停顿百分比l2_tex_write_transactions
: 在 L2 缓存中接收到的内存写入事务,用于来自纹理缓存的写入请求flop_count_hp
: 非谓词线程执行的半精度浮点运算数(加法、乘法和乘法累加),每个乘法累加运算使计数加二flop_count_hp_add
: 非断言线程执行的半精度浮点加法运算的次数flop_count_hp_mul
: 非谓词线程执行的半精度浮点乘法运算次数flop_count_hp_fma
: 非谓词线程执行的半精度浮点乘累加运算次数。每个乘法累加运算使计数加一inst_fp_16
: 非谓词线程(算术、比较等)执行的半精度浮点指令数ipc
: 每个周期执行的指令issued_ipc
: 每个周期发出的指令issue_slot_utilization
: 发出至少一条指令的发布槽的百分比,在所有周期中取平均值sm_efficiency
: 至少一个 warp 在特定 SM 上处于活动状态的时间百分比achieved_occupancy
: 每个活动周期的平均活动 warp 与 SM 支持的最大 warp 数之比eligible_warps_per_cycle
: 每个活动周期有资格发布的平均 warp 数shared_utilization
: 共享内存相对于理论峰值利用率的利用率级别l2_utilization
: L2 缓存利用率相对于理论峰值利用率的级别,范围为 0 到 10tex_utilization
: 统一缓存利用率相对于理论峰值利用率的级别ldst_fu_utilization
: 执行共享加载、共享存储和恒定加载指令的 SM 的利用率级别cf_fu_utilization
: 执行控制流指令的 SM 的利用率级别,范围为 0 到 10tex_fu_utilization
: 执行全局、局部和纹理内存指令的 SM 的利用率级别,范围为 0 到 10special_fu_utilization
: 执行 sin、cos、ex2、popc、flo 和类似指令的 SM 的利用率级别,范围为 0 到 10half_precision_fu_utilization
: 执行 16 位浮点指令和整数指令的 SM 的利用率级别,范围为 0到10single_precision_fu_utilization
: 执行单精度浮点指令和整数指令的 SM 的利用率级别double_precision_fu_utilization
: 执行双精度浮点指令的 SM 的利用率级别flop_hp_efficiency
: 实现的半精度浮点运算与理论峰值的比值flop_sp_efficiency
: 实现的单精度浮点运算与理论峰值的比值flop_dp_efficiency
: 实现的双精度浮点运算与理论峰值的比值sysmem_read_utilization
: 系统内存的读取利用率相对于理论峰值利用率的级别,范围为 0 到 10sysmem_write_utilization
: 系统内存的写入利用率相对于理论峰值利用率的级别,范围为 0 到 10
Table
nvprof —metrics | ncu —metrics (>= SM 7.0) |
---|---|
achieved_occupancy | sm__warps_active.avg.pct_of_peak_sustained_active |
atomic_transactions | l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum + l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum |
atomic_transactions_per_request | (l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red.sum) / (l1tex__t_requests_pipe_lsu_mem_global_op_atom.sum + l1tex__t_requests_pipe_lsu_mem_global_op_red.sum) |
branch_efficiency | smsp__sass_average_branch_targets_threads_uniform.pct |
cf_executed | smsp__inst_executed_pipe_cbu.sum + smsp__inst_executed_pipe_adu.sum |
cf_fu_utilization | n/a |
cf_issued | n/a |
double_precision_fu_utilization | smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active |
dram_read_bytes | dram__bytes_read.sum |
dram_read_throughput | dram__bytes_read.sum.per_second |
dram_read_transactions | dram__sectors_read.sum |
dram_utilization | dram__throughput.avg.pct_of_peak_sustained_elapsed |
dram_write_bytes | dram__bytes_write.sum |
dram_write_throughput | dram__bytes_write.sum.per_second |
dram_write_transactions | dram__sectors_write.sum |
eligible_warps_per_cycle | smsp__warps_eligible.sum.per_cycle_active |
flop_count_dp | smsp__sass_thread_inst_executed_op_dadd_pred_on.sum + smsp__sass_thread_inst_executed_op_dmul_pred_on.sum + smsp__sass_thread_inst_executed_op_dfma_pred_on.sum * 2 |
flop_count_dp_add | smsp__sass_thread_inst_executed_op_dadd_pred_on.sum |
flop_count_dp_fma | smsp__sass_thread_inst_executed_op_dfma_pred_on.sum |
flop_count_dp_mul | smsp__sass_thread_inst_executed_op_dmul_pred_on.sum |
flop_count_hp | smsp__sass_thread_inst_executed_op_hadd_pred_on.sum + smsp__sass_thread_inst_executed_op_hmul_pred_on.sum + smsp__sass_thread_inst_executed_op_hfma_pred_on.sum * 2 |
flop_count_hp_add | smsp__sass_thread_inst_executed_op_hadd_pred_on.sum |
flop_count_hp_fma | smsp__sass_thread_inst_executed_op_hfma_pred_on.sum |
flop_count_hp_mul | smsp__sass_thread_inst_executed_op_hmul_pred_on.sum |
flop_count_sp | smsp__sass_thread_inst_executed_op_fadd_pred_on.sum + smsp__sass_thread_inst_executed_op_fmul_pred_on.sum + smsp__sass_thread_inst_executed_op_ffma_pred_on.sum * 2 |
flop_count_sp_add | smsp__sass_thread_inst_executed_op_fadd_pred_on.sum |
flop_count_sp_fma | smsp__sass_thread_inst_executed_op_ffma_pred_on.sum |
flop_count_sp_mul | smsp__sass_thread_inst_executed_op_fmul_pred_on.sum |
flop_count_sp_special | n/a |
flop_dp_efficiency | smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_peak_sustained_elapsed |
flop_hp_efficiency | smsp__sass_thread_inst_executed_ops_hadd_hmul_hfma_pred_on.avg.pct_of_peak_sustained_elapsed |
flop_sp_efficiency | smsp__sass_thread_inst_executed_ops_fadd_fmul_ffma_pred_on.avg.pct_of_peak_sustained_elapsed |
gld_efficiency | smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct |
gld_requested_throughput | n/a |
gld_throughput | l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second |
gld_transactions | l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum |
gld_transactions_per_request | l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio |
global_atomic_requests | l1tex__t_requests_pipe_lsu_mem_global_op_atom.sum |
global_hit_rate | (l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_st_lookup_hit.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red_lookup_hit.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_atom_lookup_hit.sum) / (l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_red.sum + l1tex__t_sectors_pipe_lsu_mem_global_op_atom.sum) |
global_load_requests | l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum |
global_reduction_requests | l1tex__t_requests_pipe_lsu_mem_global_op_red.sum |
global_store_requests | l1tex__t_requests_pipe_lsu_mem_global_op_st.sum |
gst_efficiency | smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct |
gst_requested_throughput | n/a |
gst_throughput | l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second |
gst_transactions | l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum |
gst_transactions_per_request | l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio |
half_precision_fu_utilization | smsp__inst_executed_pipe_fp16.avg.pct_of_peak_sustained_active |
inst_bit_convert | smsp__sass_thread_inst_executed_op_conversion_pred_on.sum |
inst_compute_ld_st | smsp__sass_thread_inst_executed_op_memory_pred_on.sum |
inst_control | smsp__sass_thread_inst_executed_op_control_pred_on.sum |
inst_executed | smsp__inst_executed.sum |
inst_executed_global_atomics | smsp__sass_inst_executed_op_global_atom.sum |
inst_executed_global_loads | smsp__inst_executed_op_global_ld.sum |
inst_executed_global_reductions | smsp__inst_executed_op_global_red.sum |
inst_executed_global_stores | smsp__inst_executed_op_global_st.sum |
inst_executed_local_loads | smsp__inst_executed_op_local_ld.sum |
inst_executed_local_stores | smsp__inst_executed_op_local_st.sum |
inst_executed_shared_atomics | smsp__inst_executed_op_shared_atom.sum + smsp__inst_executed_op_shared_atom_dot_alu.sum + smsp__inst_executed_op_shared_atom_dot_cas.sum |
inst_executed_shared_loads | smsp__inst_executed_op_shared_ld.sum |
inst_executed_shared_stores | smsp__inst_executed_op_shared_st.sum |
inst_executed_surface_atomics | smsp__inst_executed_op_surface_atom.sum |
inst_executed_surface_loads | smsp__inst_executed_op_surface_ld.sum + smsp__inst_executed_op_shared_atom_dot_alu.sum + smsp__inst_executed_op_shared_atom_dot_cas.sum |
inst_executed_surface_reductions | smsp__inst_executed_op_surface_red.sum |
inst_executed_surface_stores | smsp__inst_executed_op_surface_st.sum |
inst_executed_tex_ops | smsp__inst_executed_op_texture.sum |
inst_fp_16 | smsp__sass_thread_inst_executed_op_fp16_pred_on.sum |
inst_fp_32 | smsp__sass_thread_inst_executed_op_fp32_pred_on.sum |
inst_fp_64 | smsp__sass_thread_inst_executed_op_fp64_pred_on.sum |
inst_integer | smsp__sass_thread_inst_executed_op_integer_pred_on.sum |
inst_inter_thread_communication | smsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on.sum |
inst_issued | smsp__inst_issued.sum |
inst_misc | smsp__sass_thread_inst_executed_op_misc_pred_on.sum |
inst_per_warp | smsp__average_inst_executed_per_warp.ratio |
inst_replay_overhead | n/a |
ipc | smsp__inst_executed.avg.per_cycle_active |
issue_slot_utilization | smsp__issue_active.avg.pct_of_peak_sustained_active |
issue_slots | smsp__inst_issued.sum |
issued_ipc | smsp__inst_issued.avg.per_cycle_active |
l1_sm_lg_utilization | l1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active |
l2_atomic_throughput | 2 * ( lts__t_sectors_op_atom.sum.per_second + lts__t_sectors_op_red.sum.per_second ) |
l2_atomic_transactions | 2 * ( lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum ) |
l2_global_atomic_store_bytes | lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_atom.sum |
l2_global_load_bytes | lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum |
l2_local_global_store_bytes | lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_st.sum + lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_st.sum |
l2_local_load_bytes | lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_ld.sum |
l2_read_throughput | lts__t_sectors_op_read.sum.per_second + lts__t_sectors_op_atom.sum.per_second + lts__t_sectors_op_red.sum.per_second |
l2_read_transactions | lts__t_sectors_op_read.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum |
l2_surface_load_bytes | lts__t_bytes_equiv_l1sectormiss_pipe_tex_mem_surface_op_ld.sum |
l2_surface_store_bytes | lts__t_bytes_equiv_l1sectormiss_pipe_tex_mem_surface_op_st.sum |
l2_tex_hit_rate | lts__t_sector_hit_rate.pct |
l2_tex_read_hit_rate | lts__t_sector_op_read_hit_rate.pct |
l2_tex_read_throughput | lts__t_sectors_srcunit_tex_op_read.sum.per_second |
l2_tex_read_transactions | lts__t_sectors_srcunit_tex_op_read.sum |
l2_tex_write_hit_rate | lts__t_sector_op_write_hit_rate.pct |
l2_tex_write_throughput | lts__t_sectors_srcunit_tex_op_write.sum.per_second |
l2_tex_write_transactions | lts__t_sectors_srcunit_tex_op_write.sum |
l2_utilization | lts__t_sectors.avg.pct_of_peak_sustained_elapsed |
l2_write_throughput | lts__t_sectors_op_write.sum.per_second + lts__t_sectors_op_atom.sum.per_second + lts__t_sectors_op_red.sum.per_second |
l2_write_transactions | lts__t_sectors_op_write.sum + lts__t_sectors_op_atom.sum + lts__t_sectors_op_red.sum |
ldst_executed | n/a |
ldst_fu_utilization | smsp__inst_executed_pipe_lsu.avg.pct_of_peak_sustained_active |
ldst_issued | n/a |
local_hit_rate | n/a |
local_load_requests | l1tex__t_requests_pipe_lsu_mem_local_op_ld.sum |
local_load_throughput | l1tex__t_bytes_pipe_lsu_mem_local_op_ld.sum.per_second |
local_load_transactions | l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum |
local_load_transactions_per_request | l1tex__average_t_sectors_per_request_pipe_lsu_mem_local_op_ld.ratio |
local_memory_overhead | n/a |
local_store_requests | l1tex__t_requests_pipe_lsu_mem_local_op_st.sum |
local_store_throughput | l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum.per_second |
local_store_transactions | l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum |
local_store_transactions_per_request | l1tex__average_t_sectors_per_request_pipe_lsu_mem_local_op_st.ratio |
nvlink_data_receive_efficiency | n/a |
nvlink_data_transmission_efficiency | n/a |
nvlink_overhead_data_received | (nvlrx__bytes_data_protocol.sum / nvlrx__bytes.sum) * 100 |
nvlink_overhead_data_transmitted | (nvltx__bytes_data_protocol.sum / nvltx__bytes.sum) * 100 |
nvlink_receive_throughput | nvlrx__bytes.sum.per_second |
nvlink_total_data_received | nvlrx__bytes.sum |
nvlink_total_data_transmitted | nvltx__bytes.sum |
nvlink_total_nratom_data_transmitted | n/a |
nvlink_total_ratom_data_transmitted | n/a |
nvlink_total_response_data_received | n/a |
nvlink_total_write_data_transmitted | n/a |
nvlink_transmit_throughput | nvltx__bytes.sum.per_second |
nvlink_user_data_received | nvlrx__bytes_data_user.sum |
nvlink_user_data_transmitted | nvltx__bytes_data_user.sum |
nvlink_user_nratom_data_transmitted | n/a |
nvlink_user_ratom_data_transmitted | n/a |
nvlink_user_response_data_received | n/a |
nvlink_user_write_data_transmitted | n/a |
pcie_total_data_received | pcie__read_bytes.sum |
pcie_total_data_transmitted | pcie__write_bytes.sum |
shared_efficiency | smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct |
shared_load_throughput | l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum.per_second |
shared_load_transactions | l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum |
shared_load_transactions_per_request | n/a |
shared_store_throughput | l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum.per_second |
shared_store_transactions | l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum |
shared_store_transactions_per_request | n/a |
shared_utilization | l1tex__data_pipe_lsu_wavefronts_mem_shared.avg.pct_of_peak_sustained_elapsed |
single_precision_fu_utilization | smsp__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active |
sm_efficiency | smsp__cycles_active.avg.pct_of_peak_sustained_elapsed |
sm_tex_utilization | l1tex__texin_sm2tex_req_cycles_active.avg.pct_of_peak_sustained_elapsed |
special_fu_utilization | smsp__inst_executed_pipe_xu.avg.pct_of_peak_sustained_active |
stall_constant_memory_dependency | smsp__warp_issue_stalled_imc_miss_per_warp_active.pct |
stall_exec_dependency | smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct + smsp__warp_issue_stalled_wait_per_warp_active.pct |
stall_inst_fetch | smsp__warp_issue_stalled_no_instruction_per_warp_active.pct |
stall_memory_dependency | smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct |
stall_memory_throttle | smsp__warp_issue_stalled_drain_per_warp_active.pct + smsp__warp_issue_stalled_lg_throttle_per_warp_active.pct |
stall_not_selected | smsp__warp_issue_stalled_not_selected_per_warp_active.pct |
stall_other | smsp__warp_issue_stalled_dispatch_stall_per_warp_active.pct + smsp__warp_issue_stalled_misc_per_warp_active.pct |
stall_pipe_busy | smsp__warp_issue_stalled_math_pipe_throttle_per_warp_active.pct + smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct |
stall_sleeping | smsp__warp_issue_stalled_sleeping_per_warp_active.pct |
stall_sync | smsp__warp_issue_stalled_barrier_per_warp_active.pct + smsp__warp_issue_stalled_membar_per_warp_active.pct |
stall_texture | smsp__warp_issue_stalled_tex_throttle_per_warp_active.pct |
surface_atomic_requests | l1tex__t_requests_pipe_tex_mem_surface_op_atom.sum |
surface_load_requests | l1tex__t_requests_pipe_tex_mem_surface_op_ld.sum |
surface_reduction_requests | l1tex__t_requests_pipe_tex_mem_surface_op_red.sum |
surface_store_requests | l1tex__t_requests_pipe_tex_mem_surface_op_st.sum |
sysmem_read_bytes | lts__t_sectors_aperture_sysmem_op_read * 32 |
sysmem_read_throughput | lts__t_sectors_aperture_sysmem_op_read.sum.per_second |
sysmem_read_transactions | lts__t_sectors_aperture_sysmem_op_read.sum |
sysmem_read_utilization | n/a |
sysmem_utilization | n/a |
sysmem_write_bytes | lts__t_sectors_aperture_sysmem_op_write * 32 |
sysmem_write_throughput | lts__t_sectors_aperture_sysmem_op_write.sum.per_second |
sysmem_write_transactions | lts__t_sectors_aperture_sysmem_op_write.sum |
sysmem_write_utilization | n/a |
tensor_precision_fu_utilization | sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_active |
tensor_precision_int_utilization | sm__pipe_tensor_op_imma_cycles_active.avg.pct_of_peak_sustained_active (SM 7.2+) |
tex_cache_hit_rate | l1tex__t_sector_hit_rate.pct |
tex_cache_throughput | n/a |
tex_cache_transactions | l1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active + l1tex__tex_writeback_active.avg.pct_of_peak_sustained_active |
tex_fu_utilization | smsp__inst_executed_pipe_tex.avg.pct_of_peak_sustained_active |
tex_sm_tex_utilization | l1tex__f_tex2sm_cycles_active.avg.pct_of_peak_sustained_elapsed |
tex_sm_utilization | sm__mio2rf_writeback_active.avg.pct_of_peak_sustained_elapsed |
tex_utilization | n/a |
texture_load_requests | l1tex__t_requests_pipe_tex_mem_texture.sum |
warp_execution_efficiency | smsp__thread_inst_executed_per_inst_executed.ratio |
warp_nonpred_execution_efficiency | smsp__thread_inst_executed_per_inst_executed.pct |