LLM & VLM

This performance data were collected based on the maximum CPU and NPU frequencies of each platform.
The script for setting the frequencies is located in the scripts directory.
All models should be converted with optimization_level set to 0 to enable optimized runtime performance.

RK3588

Model	Model Size	Dtype	Seqlen	New_tokens	TTFT(ms)	Tokens/s	Memory(MB)
Qwen2	0.5B	w8a8	128	64	143.83	42.58	654.26
MiniCPM4	0.5B	w8a8	128	64	128.46	45.13	524.55
Qwen3	0.6B	w8a8	128	64	213.50	32.16	773.77
TinyLLAMA	1.1B	w8a8	128	64	239.00	24.49	1085.21
Qwen2.5	1.5B	w8a8	128	64	412.27	16.32	1659.15
RWKV7	1.5B	w8a8	128	64	788.00	13.33	1450.29
InternLM2	1.8B	w8a8	128	64	374.00	15.58	1765.71
Gemma2	2B	w8a8	128	64	679.90	9.80	2765.30
Gemma3n	2B	w8a8	128	64	1220.40	9.46	2709.25
TeleChat2	3B	w8a8	128	64	649.60	10.22	2777.00
Phi3	3.8B	w8a8	128	64	1022.00	7.50	3747.73
MiniCPM3	4B	w8a8	128	64	1385.92	5.99	4339.61
ChatGLM3	6B	w8a8	128	64	1395.34	4.94	5976.43
Qwen3-VL	2B	w8a8	128	64	391	15.12	1892.13
DeepSeekOCR	3B(A570M)	w8a8	128	64	696.21	31.81	3028.66

RK3576

Model	Model Size	Dtype	Seqlen	New_tokens	TTFT(ms)	Tokens/s	Memory(MB)
Qwen2	0.5B	w4a16	128	64	327.72	34.24	426.24
	0.5B	w4a16_g128	128	64	363.58	33.22	445.95
	0.5B	w8a8	128	64	334.26	22.95	661.1
MiniCPM4	0.5B	w4a16	128	64	348.87	35.8	322.41
	0.5B	w4a16_g128	128	64	371.96	32.88	362.23
	0.5B	w8a8	128	64	337.52	23.71	528.96
Qwen3	0.6B	w4a16	128	64	482.82	25.16	495.99
	0.6B	w4a16_g128	128	64	512.36	24.3	528.48
	0.6B	w8a8	128	64	448.94	17.09	779.62
TinyLLAMA	1.1B	w4a16	128	64	517.82	21.32	591
	1.1B	w4a16_g128	128	64	658.78	18.89	681
	1.1B	w8a8	128	64	537.82	12.63	1082.83
RWKV7	1.5B	w4a16	128	64	1779.65	9.96	799.89
	1.5B	w4a16_g128	128	64	1877.95	9.37	890.16
	1.5B	w8a8	128	64	1718.8	6.96	1458.48
InternLM2	1.8B	w4a16	128	64	771.6	13.65	966.12
	1.8B	w4a16_g128	128	64	1001.23	12.18	1061.57
	1.8B	w8a8	128	64	777.86	7.91	1773.23
Gemma2	2B	w4a16	128	64	1119.51	8.45	1529.03
	2B	w4a16_g128	128	64	1407.31	7.76	1616.45
	2B	w8a8	128	64	1052.77	5.01	2771.54
Gemma-3n	2B	w4a16	128	64	3187	7.38	1574.34
	2B	w8a8	128	64	3229.16	4.75	2722.76
TeleChat2	3B	w4a16	128	64	1143.73	9.05	1514.98
	3B	w4a16_g128	128	64	1422.38	7.91	1633.54
	3B	w8a8	128	64	1035.37	5.15	2783.73
Phi3	3.8B	w4a16	128	64	1800.92	6.52	1985.75
	3.8B	w4a16_g128	128	64	2236.9	5.96	2141.89
	3.8B	w8a8	128	64	1591.59	3.76	3757.22
MiniCPM3	4B	w4a16	128	64	2484.63	4.94	2336.73
	4B	w4a16_g128	128	64	3053.52	4.49	2618.14
	4B	w8a8	128	64	2509.27	3.04	4366.85
ChatGLM3	6B	w4a16	128	64	2121.26	4.7	3014.38
	6B	w4a16_g128	128	64	2958.88	4.03	3244.15
	6B	w8a8	128	64	1920.97	2.5	5958.65
Qwen3-VL	2B	w4a16	128	64	791.20	12.88	1082.65
	2B	w4a16_g128	128	64	1026.31	11.62	1170.89
	2B	w8a8	128	64	799.09	7.67	1900.80
DeepSeekOCR	3B(A570M)	w4a16	128	64	1010.15	24.85	1756.13
	3B(A570M)	w8a8	128	64	1312.00	16.21	3072.33

RV1126B

Model	Model Size	Dtype	Seqlen	New_tokens	TTFT(ms)	Tokens/s
Qwen2	0.5B	w4a16	128	64	650.69	21.43
	0.5B	w4a16_g128	128	64	679.78	18.18
	0.5B	w8a8	128	64	636.90	13.91
MiniCPM4	0.5B	w4a16	128	64	654.20	22.97
	0.5B	w4a16_g128	128	64	691.57	18.78
	0.5B	w8a8	128	64	663.41	15.12
Qwen3	0.6B	w4a16	128	64	955.94	15.41
	0.6B	w4a16_g128	128	64	1019.94	12.60
	0.6B	w8a8	128	64	945.18	10.55

Multimodal

Model	Stage	RK3588(w8a8)	RK3576(w4a16)
Qwen2-VL-2B	img-encoder(392*392)	3.28s	3.55s
	Prefill(len=196)	632.6ms	1234.9ms
	Decode	16.6 tokens/s	14.57 tokens/s
Qwen2.5-VL-3B	img-encoder(392*392)	2.93s	2.87s
	Prefill(len=196)	1120ms	2130ms
	Decode	8.66 tokens/s	7.87 tokens/s
MiniCPM-V-2_6	img-encoder(448*448)	3.27s	2.4s
	Prefill(len=64)	826ms	1230ms
	Decode	4.18 tokens/s	3.85 tokens/s
SmolVLM-256M	Img-encoder(512*512)	842ms	768ms
	Prefill(len=128)	77.3ms	180ms
	Decode	78 tokens/s	57.73tokens/s
Qwen3-VL-2B	img-encoder(448*448)	2.08s	1.61s
	Prefill(len=196)	649ms	1587ms
	Decode	14.91 tokens/s	10.36 tokens/s
DeepSeekOCR-3B(A570M)	Img-encoder(448*448)	2.09s	2.27ms
	Prefill(len=128)	696ms	1010ms
	Decode	31.8 tokens/s	22.3 tokens/s

The img-encoder runs inference on RKNN with FP16, tested using all NPU cores.

Inferance

The performance benchmarks and inference data presented in this section are sourced from the official Rockchip RKNN Model Zoo. These results demonstrate the optimized performance of various LLMs and VLMs on Rockchip NPU platforms using the latest RKNN Toolkit2.

Source:

RKNN-Modelzoo : https://github.com/airockchip/rknn_model_zoo
RKNN-LLM : https://github.com/airockchip/rknn-llm/blob/main/benchmark.md