LLM & VLM

  • This performance data were collected based on the maximum CPU and NPU frequencies of each platform.
  • The script for setting the frequencies is located in the scripts directory.
  • All models should be converted with optimization_level set to 0 to enable optimized runtime performance.

RK3588

ModelModel SizeDtypeSeqlenNew_tokensTTFT(ms)Tokens/sMemory(MB)
Qwen20.5Bw8a812864143.8342.58654.26
MiniCPM40.5Bw8a812864128.4645.13524.55
Qwen30.6Bw8a812864213.5032.16773.77
TinyLLAMA1.1Bw8a812864239.0024.491085.21
Qwen2.51.5Bw8a812864412.2716.321659.15
RWKV71.5Bw8a812864788.0013.331450.29
InternLM21.8Bw8a812864374.0015.581765.71
Gemma22Bw8a812864679.909.802765.30
Gemma3n2Bw8a8128641220.409.462709.25
TeleChat23Bw8a812864649.6010.222777.00
Phi33.8Bw8a8128641022.007.503747.73
MiniCPM34Bw8a8128641385.925.994339.61
ChatGLM36Bw8a8128641395.344.945976.43
Qwen3-VL2Bw8a81286439115.121892.13
DeepSeekOCR3B(A570M)w8a812864696.2131.813028.66

RK3576

ModelModel SizeDtypeSeqlenNew_tokensTTFT(ms)Tokens/sMemory(MB)
Qwen20.5Bw4a1612864327.7234.24426.24
0.5Bw4a16_g12812864363.5833.22445.95
0.5Bw8a812864334.2622.95661.1
MiniCPM40.5Bw4a1612864348.8735.8322.41
0.5Bw4a16_g12812864371.9632.88362.23
0.5Bw8a812864337.5223.71528.96
Qwen30.6Bw4a1612864482.8225.16495.99
0.6Bw4a16_g12812864512.3624.3528.48
0.6Bw8a812864448.9417.09779.62
TinyLLAMA1.1Bw4a1612864517.8221.32591
1.1Bw4a16_g12812864658.7818.89681
1.1Bw8a812864537.8212.631082.83
RWKV71.5Bw4a16128641779.659.96799.89
1.5Bw4a16_g128128641877.959.37890.16
1.5Bw8a8128641718.86.961458.48
InternLM21.8Bw4a1612864771.613.65966.12
1.8Bw4a16_g128128641001.2312.181061.57
1.8Bw8a812864777.867.911773.23
Gemma22Bw4a16128641119.518.451529.03
2Bw4a16_g128128641407.317.761616.45
2Bw8a8128641052.775.012771.54
Gemma-3n2Bw4a161286431877.381574.34
2Bw8a8128643229.164.752722.76
TeleChat23Bw4a16128641143.739.051514.98
3Bw4a16_g128128641422.387.911633.54
3Bw8a8128641035.375.152783.73
Phi33.8Bw4a16128641800.926.521985.75
3.8Bw4a16_g128128642236.95.962141.89
3.8Bw8a8128641591.593.763757.22
MiniCPM34Bw4a16128642484.634.942336.73
4Bw4a16_g128128643053.524.492618.14
4Bw8a8128642509.273.044366.85
ChatGLM36Bw4a16128642121.264.73014.38
6Bw4a16_g128128642958.884.033244.15
6Bw8a8128641920.972.55958.65
Qwen3-VL2Bw4a1612864791.2012.881082.65
2Bw4a16_g128128641026.3111.621170.89
2Bw8a812864799.097.671900.80
DeepSeekOCR3B(A570M)w4a16128641010.1524.851756.13
3B(A570M)w8a8128641312.0016.213072.33

RV1126B

ModelModel SizeDtypeSeqlenNew_tokensTTFT(ms)Tokens/s
Qwen20.5Bw4a1612864650.6921.43
0.5Bw4a16_g12812864679.7818.18
0.5Bw8a812864636.9013.91
MiniCPM40.5Bw4a1612864654.2022.97
0.5Bw4a16_g12812864691.5718.78
0.5Bw8a812864663.4115.12
Qwen30.6Bw4a1612864955.9415.41
0.6Bw4a16_g128128641019.9412.60
0.6Bw8a812864945.1810.55

Multimodal

ModelStageRK3588(w8a8)RK3576(w4a16)
Qwen2-VL-2Bimg-encoder(392*392)3.28s3.55s
Prefill(len=196)632.6ms1234.9ms
Decode16.6 tokens/s14.57 tokens/s
Qwen2.5-VL-3Bimg-encoder(392*392)2.93s2.87s
Prefill(len=196)1120ms2130ms
Decode8.66 tokens/s7.87 tokens/s
MiniCPM-V-2_6img-encoder(448*448)3.27s2.4s
Prefill(len=64)826ms1230ms
Decode4.18 tokens/s3.85 tokens/s
SmolVLM-256MImg-encoder(512*512)842ms768ms
Prefill(len=128)77.3ms180ms
Decode78 tokens/s57.73tokens/s
Qwen3-VL-2Bimg-encoder(448*448)2.08s1.61s
Prefill(len=196)649ms1587ms
Decode14.91 tokens/s10.36 tokens/s
DeepSeekOCR-3B(A570M)Img-encoder(448*448)2.09s2.27ms
Prefill(len=128)696ms1010ms
Decode31.8 tokens/s22.3 tokens/s
  • The img-encoder runs inference on RKNN with FP16, tested using all NPU cores.

Inferance

The performance benchmarks and inference data presented in this section are sourced from the official Rockchip RKNN Model Zoo. These results demonstrate the optimized performance of various LLMs and VLMs on Rockchip NPU platforms using the latest RKNN Toolkit2.

Source: