Timeline
Drag on the GPU lane to select a region to zoom • Scroll to pan when zoomed • Double-click to reset
Top Kernels by GPU Time
Kernel Summary
| # | Kernel | Calls | Total | Avg | % | Occupancy |
| ■ 1 | mul_mat_vec_q | 399 | 2.101 ms | 5.3 us | 38.3% | 11.4% |
| ■ 2 | quantize_q8_1 | 399 | 1.043 ms | 2.6 us | 19.0% | 0.1% |
| ■ 3 | mul_mat_vec_f | 132 | 660.1 us | 5.0 us | 12.0% | 7.1% |
| ■ 4 | rms_norm_f32 | 135 | 536.1 us | 4.0 us | 9.8% | 0.0% |
| ■ 5 | rope_norm | 132 | 400.6 us | 3.0 us | 7.3% | 0.1% |
#1: mul_mat_vec_q
mul_mat_vec_q
Occupancy
Workgroup Size
64x2x1 (128)
Instruction Mix
| VALU |
SALU |
SMEM |
VMEM_RD |
VMEM_WR |
LDS |
FLAT |
MFMA |
| 187,560,192 |
113,325,312 |
20,141,568 |
16,020,480 |
907,008 |
10,500,864 |
16,927,488 |
0 |
Roofline Utilization
Bandwidth
Compute (FLOPS / IOPS)
#2: quantize_q8_1
quantize_q8_1
Occupancy
Workgroup Size
256x1x1 (256)
Instruction Mix
| VALU |
SALU |
SMEM |
VMEM_RD |
VMEM_WR |
LDS |
FLAT |
MFMA |
| 1,547,616 |
757,344 |
131,712 |
16,464 |
32,928 |
164,640 |
49,392 |
0 |
Roofline Utilization
Bandwidth
Compute (FLOPS / IOPS)
#3: mul_mat_vec_f
mul_mat_vec_f
Occupancy
Workgroup Size
64x1x1 (64)
Instruction Mix
| VALU |
SALU |
SMEM |
VMEM_RD |
VMEM_WR |
LDS |
FLAT |
MFMA |
| 46,362,624 |
51,093,504 |
7,299,072 |
1,622,016 |
675,840 |
6,352,896 |
2,297,856 |
0 |
Roofline Utilization
Bandwidth
Compute (FLOPS / IOPS)
#4: rms_norm_f32
rms_norm_f32
Occupancy
Workgroup Size
1024x1x1 (1024)
Instruction Mix
| VALU |
SALU |
SMEM |
VMEM_RD |
VMEM_WR |
LDS |
FLAT |
MFMA |
| 216,000 |
244,080 |
19,440 |
12,960 |
4,320 |
25,920 |
17,280 |
0 |
Roofline Utilization
Bandwidth
Compute (FLOPS / IOPS)
#5: rope_norm
rope_norm
Occupancy
Workgroup Size
1x256x1 (256)
Instruction Mix
| VALU |
SALU |
SMEM |
VMEM_RD |
VMEM_WR |
LDS |
FLAT |
MFMA |
| 761,112 |
135,696 |
35,904 |
5,016 |
2,376 |
0 |
7,392 |
0 |
Roofline Utilization
Bandwidth
Compute (FLOPS / IOPS)
Generated by rocm-profile-agent using rocprofv3