ROCm Profile Report

2026-03-27 08:11:54 — AMD Instinct MI300X (gfx942) — 304 CUs @ 2100 MHz
../llama.cpp/build/bin/llama-bench -m ../llama.cpp/models/tinyllama-1.1b-q4_0.gguf -t 1 -r 1 -p 0 -n 2 -dev ROCm0

Timeline

CPU (HIP) Memory GPU 0 ns 80.619 ms 161.237 ms 241.856 ms 322.475 ms 403.093 ms 483.712 ms 564.331 ms 644.949 ms 725.568 ms 806.187 ms mul_mat_vec_q quantize_q8_1 mul_mat_vec_f rms_norm_f32 rope_norm Other kernels H→D (Host→Device) D→H (Device→Host) D→D (Device→Device) 1.0x
Drag on the GPU lane to select a region to zoom • Scroll to pan when zoomed • Double-click to reset

Top Kernels by GPU Time

mul_mat_vec_q: 38.3% quantize_q8_1: 19.0% mul_mat_vec_f: 12.0% rms_norm_f32: 9.8% rope_norm: 7.3% Top 5 Kernels mul_mat_vec_q (38.3%) quantize_q8_1 (19.0%) mul_mat_vec_f (12.0%) rms_norm_f32 (9.8%) rope_norm (7.3%)

Kernel Summary

#KernelCallsTotalAvg%Occupancy
1mul_mat_vec_q3992.101 ms5.3 us38.3%11.4%
2quantize_q8_13991.043 ms2.6 us19.0%0.1%
3mul_mat_vec_f132660.1 us5.0 us12.0%7.1%
4rms_norm_f32135536.1 us4.0 us9.8%0.0%
5rope_norm132400.6 us3.0 us7.3%0.1%

#1: mul_mat_vec_q

mul_mat_vec_q
Total
2.101 ms
Avg
5.3 us
Min
2.3 us
Max
34.0 us
Calls
399
% GPU Time
38.3%

Occupancy

Workgroup Size
64x2x1 (128)
Waves / WG
2
Occupancy
11.4%

Instruction Mix

VALU: 187,560,192 (51.3%) VALU SALU: 113,325,312 (31.0%) SALU SMEM: 20,141,568 (5.5%) VMEM_RD: 16,020,480 (4.4%) VMEM_WR: 907,008 (0.2%) LDS: 10,500,864 (2.9%) FLAT: 16,927,488 (4.6%)
VALU SALU SMEM VMEM_RD VMEM_WR LDS FLAT MFMA
187,560,192 113,325,312 20,141,568 16,020,480 907,008 10,500,864 16,927,488 0

Roofline Utilization

Bandwidth
0% 25% 50% 75% 100% HBM: 8.6% (456.2 GB/s / 5300.0 GB/s) 8.6% HBM 456.2 GB/s L2: 4.0% (552.0 GB/s / 13926.0 GB/s) 4.0% L2 552.0 GB/s L1: 78.7% (32169.9 GB/s / 40857.0 GB/s) 78.7% L1 32169.9 GB/s LDS: 1.6% (1281.0 GB/s / 81715.0 GB/s) 1.6% LDS 1281.0 GB/s
Compute (FLOPS / IOPS)
0% 25% 50% 75% 100% VALU (FP32): 3.5% (5.8 TFLOPS / 163.4 TFLOPS) 3.5% VALU (FP32) 5.8 TFLOPS MFMA (FP16): 0.0% (0.0 MFLOPS / 1307.4 TFLOPS) 0.0% MFMA (FP16) 0.0 MFLOPS SALU (INT): 8.5% (54.4 GOPS / 638.4 GOPS) 8.5% SALU (INT) 54.4 GOPS

#2: quantize_q8_1

quantize_q8_1
Total
1.043 ms
Avg
2.6 us
Min
1.7 us
Max
5.0 us
Calls
399
% GPU Time
19.0%

Occupancy

Workgroup Size
256x1x1 (256)
Waves / WG
4
Occupancy
0.1%

Instruction Mix

VALU: 1,547,616 (57.3%) VALU SALU: 757,344 (28.0%) SALU SMEM: 131,712 (4.9%) VMEM_RD: 16,464 (0.6%) VMEM_WR: 32,928 (1.2%) LDS: 164,640 (6.1%) LDS FLAT: 49,392 (1.8%)
VALU SALU SMEM VMEM_RD VMEM_WR LDS FLAT MFMA
1,547,616 757,344 131,712 16,464 32,928 164,640 49,392 0

Roofline Utilization

Bandwidth
0% 25% 50% 75% 100% HBM: 0.2% (10.2 GB/s / 5300.0 GB/s) 0.2% HBM 10.2 GB/s L2: 0.0% (5.7 GB/s / 13926.0 GB/s) 0.0% L2 5.7 GB/s L1: 0.1% (49.8 GB/s / 40857.0 GB/s) 0.1% L1 49.8 GB/s LDS: 0.1% (41.5 GB/s / 81715.0 GB/s) 0.1% LDS 41.5 GB/s
Compute (FLOPS / IOPS)
0% 25% 50% 75% 100% VALU (FP32): 0.1% (97.6 GFLOPS / 163.4 TFLOPS) 0.1% VALU (FP32) 97.6 GFLOPS MFMA (FP16): 0.0% (0.0 MFLOPS / 1307.4 TFLOPS) 0.0% MFMA (FP16) 0.0 MFLOPS SALU (INT): 0.1% (746.5 MOPS / 638.4 GOPS) 0.1% SALU (INT) 746.5 MOPS

#3: mul_mat_vec_f

mul_mat_vec_f
Total
660.1 us
Avg
5.0 us
Min
3.4 us
Max
6.7 us
Calls
132
% GPU Time
12.0%

Occupancy

Workgroup Size
64x1x1 (64)
Waves / WG
1
Occupancy
7.1%

Instruction Mix

VALU: 46,362,624 (40.1%) VALU SALU: 51,093,504 (44.2%) SALU SMEM: 7,299,072 (6.3%) SMEM VMEM_RD: 1,622,016 (1.4%) VMEM_WR: 675,840 (0.6%) LDS: 6,352,896 (5.5%) FLAT: 2,297,856 (2.0%)
VALU SALU SMEM VMEM_RD VMEM_WR LDS FLAT MFMA
46,362,624 51,093,504 7,299,072 1,622,016 675,840 6,352,896 2,297,856 0

Roofline Utilization

Bandwidth
0% 25% 50% 75% 100% HBM: 1.4% (74.2 GB/s / 5300.0 GB/s) 1.4% HBM 74.2 GB/s L2: 3.1% (428.4 GB/s / 13926.0 GB/s) 3.1% L2 428.4 GB/s L1: 20.2% (8273.2 GB/s / 40857.0 GB/s) 20.2% L1 8273.2 GB/s LDS: 3.4% (2777.4 GB/s / 81715.0 GB/s) 3.4% LDS 2777.4 GB/s
Compute (FLOPS / IOPS)
0% 25% 50% 75% 100% VALU (FP32): 3.1% (5.1 TFLOPS / 163.4 TFLOPS) 3.1% VALU (FP32) 5.1 TFLOPS MFMA (FP16): 0.0% (0.0 MFLOPS / 1307.4 TFLOPS) 0.0% MFMA (FP16) 0.0 MFLOPS SALU (INT): 13.7% (87.2 GOPS / 638.4 GOPS) 13.7% SALU (INT) 87.2 GOPS

#4: rms_norm_f32

rms_norm_f32
Total
536.1 us
Avg
4.0 us
Min
3.6 us
Max
5.2 us
Calls
135
% GPU Time
9.8%

Occupancy

Workgroup Size
1024x1x1 (1024)
Waves / WG
16
Occupancy
0.0%

Instruction Mix

VALU: 216,000 (40.0%) VALU SALU: 244,080 (45.2%) SALU SMEM: 19,440 (3.6%) VMEM_RD: 12,960 (2.4%) VMEM_WR: 4,320 (0.8%) LDS: 25,920 (4.8%) FLAT: 17,280 (3.2%)
VALU SALU SMEM VMEM_RD VMEM_WR LDS FLAT MFMA
216,000 244,080 19,440 12,960 4,320 25,920 17,280 0

Roofline Utilization

Bandwidth
0% 25% 50% 75% 100% HBM: 0.1% (4.6 GB/s / 5300.0 GB/s) 0.1% HBM 4.6 GB/s L2: 0.0% (3.9 GB/s / 13926.0 GB/s) 0.0% L2 3.9 GB/s L1: 0.1% (31.0 GB/s / 40857.0 GB/s) 0.1% L1 31.0 GB/s LDS: 0.0% (11.6 GB/s / 81715.0 GB/s) 0.0% LDS 11.6 GB/s
Compute (FLOPS / IOPS)
0% 25% 50% 75% 100% VALU (FP32): 0.0% (23.9 GFLOPS / 163.4 TFLOPS) 0.0% VALU (FP32) 23.9 GFLOPS MFMA (FP16): 0.0% (0.0 MFLOPS / 1307.4 TFLOPS) 0.0% MFMA (FP16) 0.0 MFLOPS SALU (INT): 0.1% (422.4 MOPS / 638.4 GOPS) 0.1% SALU (INT) 422.4 MOPS

#5: rope_norm

rope_norm
Total
400.6 us
Avg
3.0 us
Min
2.1 us
Max
5.2 us
Calls
132
% GPU Time
7.3%

Occupancy

Workgroup Size
1x256x1 (256)
Waves / WG
4
Occupancy
0.1%

Instruction Mix

VALU: 761,112 (80.3%) VALU SALU: 135,696 (14.3%) SALU SMEM: 35,904 (3.8%) VMEM_RD: 5,016 (0.5%) VMEM_WR: 2,376 (0.3%) FLAT: 7,392 (0.8%)
VALU SALU SMEM VMEM_RD VMEM_WR LDS FLAT MFMA
761,112 135,696 35,904 5,016 2,376 0 7,392 0

Roofline Utilization

Bandwidth
0% 25% 50% 75% 100% HBM: 0.2% (11.8 GB/s / 5300.0 GB/s) 0.2% HBM 11.8 GB/s L2: 0.0% (2.5 GB/s / 13926.0 GB/s) 0.0% L2 2.5 GB/s L1: 0.1% (51.8 GB/s / 40857.0 GB/s) 0.1% L1 51.8 GB/s LDS: 0.0% (0.0 GB/s / 81715.0 GB/s) 0.0% LDS 0.0 GB/s
Compute (FLOPS / IOPS)
0% 25% 50% 75% 100% VALU (FP32): 0.1% (115.8 GFLOPS / 163.4 TFLOPS) 0.1% VALU (FP32) 115.8 GFLOPS MFMA (FP16): 0.0% (0.0 MFLOPS / 1307.4 TFLOPS) 0.0% MFMA (FP16) 0.0 MFLOPS SALU (INT): 0.1% (322.6 MOPS / 638.4 GOPS) 0.1% SALU (INT) 322.6 MOPS
Generated by rocm-profile-agent using rocprofv3