AMD AI MAX+395 跑 QWEN3.5-122B-A10B-MTP-Q4 实测

之前有测了这个芯片跑QWEN3.6-27B和QWEN3.6-35B-A3B的MTP对比，《本地模型性能翻倍！qwen3.6-MTP测试》
但那两个模型体积小，还不如买两张二手前代甚至前前代的游戏N卡去跑，价格更便宜性能也更强。那我这AMD是笑话么？不，这台机器有128GB统一内存，windows下可以分96G给显存，完全能跑更大的模型，今天就来测下QWEN3.5-122B-A10B这个80GB左右的模型。

可能有人问现在QWEN都3.7了为什么还用3.5？其实从3.6开始开源策略就变了，大于35B的模型不再开源了，80B/122B/397B目前只有3.5的，3.6-PLUS和3.6-MAX都不开源，3.7甚至一个开源的都没有。

另外，我之前批量测试本地模型时，有关注到lmstudio里跑QWEN3.5-122B-A10B-Q4_K_S能到20t/s，如果开启MTP提一提速，没准真的能达到本地可用的级别，可能会比QWEN3.6-27B开MTP的性能更好，而且其参数量是27B的4.5倍！虽然激活参数量只有10B，还不到27B的一半，但大概率比35B-A3B只激活3B要更聪明。

我手上122B有三个版本，都是unsloth量化的：

Qwen3.5-122B-A10B-GGUF Q4_K_S
Qwen3.5-122B-A10B-MTP-GGUF Q4_K_S
Qwen3.5-122B-A10B-MTP-GGUF Q4_K_XL

由于模型太大下载耗时间，而且磁盘空间也不够了，Q4_K_XL的非MTP版本就没下（玩本地大模型高速大容量的硬盘是必备的，1T都是小硬盘了）。

llama.cpp 用的版本是 llama-b9190-bin-win-vulkan-x64

先说结论

MTP确实有用，Q4_K_S版本从21.66提到了33.15 tokens/s，提升了53%。不过这个是S的型号，XL的估计性能会稍弱点，但更适合复杂任务。

性能数据

Q4_K_S 非MTP vs MTP对比：

配置	生成速度 (tokens/s)	提升幅度
非MTP	21.66	-
MTP (spec-draft-n-max=4)	33.15	+53%

MTP Q4_K_S 不同 spec-draft-n-max 值：

spec-draft-n-max	生成速度 (tokens/s)	草稿接受率
2	29.15	83.9%
3	30.97	81.2%
4	33.15	77.8%
5	31.12	70.1%

MTP Q4_K_XL 版本：

spec-draft-n-max	生成速度 (tokens/s)	草稿接受率
1	25.30	92.3%
2	28.53	85.8%
3	30.66	83.2%
4	30.60	76.9%
5	28.93	68.1%

资源占用

但是，395这个芯片有个尴尬的地方。统一内存分配96GB给显存，加载模型参数文件时显存占了八十多GB，但内存32GB也几乎全占满了，再想启动其他大内存程序就难了。不过加 --no-mmap可以避免这个问题。

如果用N卡，可能要买张96G的RTX PRO 6000才能装下这个122B模型，虽然速度远超AMD 395，但其价格是AMD 395的好几倍了。

下面是Qwen3.5-122B-A10B-MTP-GGUF Q4_K_XL加 --no-mmap的资源使用情况：

加载模型阶段（电脑上已经运行了很多日常办公必要的程序了）
模型使用阶段（其实可以看到有2.6GB显存是从内存上借的，这和内存占用增加了2.6GB是符合的）
卸载模型

启动脚本

@echo off
setlocal EnableExtensions EnableDelayedExpansion

REM Run from this script's directory so relative executable paths always work.
cd /d "%~dp0"

REM =========================
REM Model files (local paths)
REM =========================
REM Split GGUF: only the first shard is specified; llama.cpp loads the rest automatically.
set "MODEL_DIR=%USERPROFILE%\.lmstudio\models\unsloth\Qwen3.5-122B-A10B-MTP-GGUF"
set "MODEL_FILE=%MODEL_DIR%\Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf"
set "MMPROJ_FILE=%MODEL_DIR%\mmproj-F32.gguf"

REM =========================
REM Server settings
REM =========================
set "HOST=127.0.0.1"
set "PORT=8001"
set "ALIAS=unsloth/Qwen3.5-122B-A10B-MTP-GGUF"
set "CTX_SIZE=262144"
set "N_GPU_LAYERS=999"
set "BATCH_SIZE=2048"
set "N_PARALLEL=1"

REM Recommended non-thinking defaults from Unsloth guide.
set "TEMP=0.7"
set "TOP_P=0.8"
set "TOP_K=20"
set "MIN_P=0.00"
set "PRESENCE_PENALTY=1.5"

REM MTP settings:
REM New flag (May 2026+) is --spec-type draft-mtp.
REM Unsloth notes draft-n-max=2 often gives best real speed/acceptance tradeoff.
set "SPEC_TYPE=draft-mtp"
set "SPEC_DRAFT_N_MAX=4"

REM Qwen3.5 reasoning mode control (new flag).
REM Use off for non-thinking mode; change to on if needed.
set "REASONING=off"

REM Auto-detect multi-slot flags:
REM   N_PARALLEL > 1 => enable --kv-unified and --cache-idle-slots
REM   N_PARALLEL = 1 => no extra flags (avoids "requires --kv-unified" warning)
set "EXTRA_FLAGS="
if %N_PARALLEL% GTR 1 (
  set "EXTRA_FLAGS=--kv-unified --cache-idle-slots"
)

echo.
echo [INFO] Starting llama-server for Qwen3.5-122B-A10B-MTP...
echo [INFO] Endpoint: http://%HOST%:%PORT%/v1
echo [INFO] Model: %MODEL_FILE%
echo [INFO] MMProj: %MMPROJ_FILE%
echo.

if not exist "%MODEL_FILE%" (
  echo [ERROR] Model file not found:
  echo         %MODEL_FILE%
  exit /b 1
)

if not exist "%MMPROJ_FILE%" (
  echo [ERROR] MMProj file not found:
  echo         %MMPROJ_FILE%
  exit /b 1
)

if not exist "llama-server.exe" (
  echo [ERROR] llama-server.exe not found in:
  echo         %CD%
  exit /b 1
)

llama-server.exe ^
  --host %HOST% ^
  --port %PORT% ^
  --model "%MODEL_FILE%" ^
  --mmproj "%MMPROJ_FILE%" ^
  --alias "%ALIAS%" ^
  --gpu-layers %N_GPU_LAYERS% ^
  --flash-attn on ^
  --kv-offload ^
  --batch-size %BATCH_SIZE% ^
  --parallel %N_PARALLEL% ^
  %EXTRA_FLAGS% ^
  --ctx-size %CTX_SIZE% ^
  --no-mmap ^
  --temp %TEMP% ^
  --top-p %TOP_P% ^
  --top-k %TOP_K% ^
  --min-p %MIN_P% ^
  --presence-penalty %PRESENCE_PENALTY% ^
  --spec-type %SPEC_TYPE% ^
  --spec-draft-n-max %SPEC_DRAFT_N_MAX% ^
  --image-min-tokens 1024 ^
  --reasoning %REASONING%

set "EXIT_CODE=%ERRORLEVEL%"
echo.
echo [INFO] llama-server exited with code %EXIT_CODE%.
exit /b %EXIT_CODE%

实际使用建议

--spec-draft-n-max 4 是 Qwen3.5-122B-A10B-MTP-GGUF Q4这个模型性能和接受率比较好的平衡点
用的时候关掉其他占内存的程序
Q4_K_XL版本速度稍慢但更适合复杂任务，看自己需求选
本测试仅为我当前环境测试结果，不具有普适性
max395这个芯片的AI性能目前肯定是弱于MAC M5和专业计算N卡的，而且是几倍的差异，建议想好自己的使用场景再进行选择，本文不包含任何购买建议。

接入vscode copilot执行真实任务

读取多个GaussDB生产的复杂SQL的plantrace文件（官方文档缺失解读方法），生成逐行解读的说明，这是一个有很大的上下文且生成token量也大的任务。

一开始使用的是llama.cpp

当上下文长度达到40k时，生成速度会掉到21 token/s

prompt eval time =  266727.26 ms / 40890 tokens (    6.52 ms per token,   153.30 tokens per second)
       eval time =  334591.12 ms /  7027 tokens (   47.62 ms per token,    21.00 tokens per second)
      total time =  601318.38 ms / 47917 tokens
draft acceptance rate = 0.64977 ( 5076 accepted /  7812 generated)
27.50.699.930 I statistics draft-mtp: #calls(b,g,a) = 11 2281 2281, #gen drafts = 2281, #acc drafts = 2002, #gen tokens = 9124, #acc tokens = 6248, dur(b,g,a) = 0.039, 78206.375, 6.603 ms
27.50.703.263 I slot      release: id  0 | task 376 | stop processing: n_tokens = 78672, truncated = 0
27.50.704.156 I slot print_timing: id  0 | task -1 | n_decoded =   7027, tg =  21.00 t/s

当上下文达到90K的时候,llama server奔溃了，没去排查原因，可能是我用的llama.cpp版本有BUG，切到 lmstudio 正常继续，但 lmstduio 的生成速度比llama.cpp慢多了，掉到了8t/s,预填充也变慢了，观察资源使用，内存都快用完了，而且GPU计算这个锯齿波形感觉使不上力一样。

114k 上下文，在生成阶段已经掉到 6t/s,

2026-06-10 14:25:49 [DEBUG]
 21.31.657.665 I slot create_check: id  3 | task 314 | created context checkpoint 7 of 32 (pos_min = 114859, pos_max = 114859, n_tokens = 114860, size = 375.590 MiB)
2026-06-10 14:25:49  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 100.0%
2026-06-10 14:26:06 [DEBUG]
 21.49.055.746 I slot print_timing: id  3 | task 314 | n_decoded =    103, tg =   6.08 t/s

观察到GPU没有全速了，部分计算在CPU上

lmstudio 也开了mtp4。只是gpu卸载层数只保持了默认的42，没拉满到49，这就是导致CPU参与计算的原因。

目前初步判断下来这套组合不适合执行长周期的多轮调度任务，只适合做单轮或者少数几轮的复杂任务。

在kv cache有效时，预填充的耗时还是可以接受的，参考下面的日志，130K的上下文几秒就弄完了，如果没命中cache，160t/s估计得15分钟左右才能填充完

2026-06-10 15:32:34  [INFO]
 [qwen3.5-122b-a10b-mtp] Running chat completion on conversation with 86 messages.
2026-06-10 15:32:34  [INFO]
 [qwen3.5-122b-a10b-mtp] Streaming response...
2026-06-10 15:32:34 [DEBUG]
 LlamaV4::predict slot selection: session_id=<empty> server-selected (LCP/LRU)
2026-06-10 15:32:34 [DEBUG]
 88.17.272.059 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.998 (> 0.100 thold), f_keep = 0.999
2026-06-10 15:32:34 [DEBUG]
 88.17.272.512 I slot launch_slot_: id  3 | task 5951 | processing task, is_child = 0
88.17.272.644 W slot update_slots: id  3 | task 5951 | cache reuse is not supported - ignoring n_cache_reuse = 256
88.17.272.774 I slot update_slots: id  3 | task 5951 | Checking checkpoint with [138885, 138885] against 138889...
2026-06-10 15:32:34 [DEBUG]
 88.17.363.052 W slot update_slots: id  3 | task 5951 | restored context checkpoint (pos_min = 138885, pos_max = 138885, n_tokens = 138886, n_past = 138886, size = 422.974 MiB)
2026-06-10 15:32:34  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 0.0%
2026-06-10 15:32:37  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 98.1%
2026-06-10 15:32:38  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 100.0%
2026-06-10 15:32:52 [DEBUG]
 88.35.286.021 I slot print_timing: id  3 | task 5951 | n_decoded =    102, tg =   7.02 t/s
2026-06-10 15:32:56 [DEBUG]
 88.38.640.241 I slot print_timing: id  3 | task 5951 | n_decoded =    128, tg =   7.16 t/s
2026-06-10 15:32:59 [DEBUG]
 88.41.850.033 I slot print_timing: id  3 | task 5951 | n_decoded =    154, tg =   7.30 t/s
2026-06-10 15:33:02 [DEBUG]
 88.45.060.957 I slot print_timing: id  3 | task 5951 | n_decoded =    177, tg =   7.28 t/s
2026-06-10 15:33:05 [DEBUG]
 88.48.193.931 I slot print_timing: id  3 | task 5951 | n_decoded =    200, tg =   7.29 t/s
2026-06-10 15:33:08 [DEBUG]
 88.51.262.193 I slot print_timing: id  3 | task 5951 | n_decoded =    223, tg =   7.31 t/s
2026-06-10 15:33:12 [DEBUG]
 88.54.617.008 I slot print_timing: id  3 | task 5951 | n_decoded =    247, tg =   7.30 t/s
2026-06-10 15:33:15 [DEBUG]
 88.57.819.669 I slot print_timing: id  3 | task 5951 | n_decoded =    270, tg =   7.29 t/s
2026-06-10 15:33:16 [DEBUG]
 88.59.444.844 I slot print_timing: id  3 | task 5951 | prompt eval time =    3487.09 ms /   213 tokens (   16.37 ms per token,    61.08 tokens per second)
88.59.444.856 I slot print_timing: id  3 | task 5951 |        eval time =   38685.01 ms /   283 tokens (  136.70 ms per token,     7.32 tokens per second)
88.59.444.857 I slot print_timing: id  3 | task 5951 |       total time =   42172.10 ms /   496 tokens
88.59.444.858 I slot print_timing: id  3 | task 5951 |    graphs reused =       5769
88.59.444.860 I slot print_timing: id  3 | task 5951 | draft acceptance = 0.71918 (  210 accepted /   292 generated)
88.59.444.888 I statistics        draft-mtp: #calls(b,g,a) =   23   5879   5879, #gen drafts =   5879, #acc drafts =  5250, #gen tokens =  23516, #acc tokens = 17178, dur(b,g,a) = 0.061, 176332.186, 12.013 ms
2026-06-10 15:33:16 [DEBUG]
 88.59.447.464 I slot      release: id  3 | task 5951 | stop processing: n_tokens = 139383, truncated = 0
88.59.447.489 I srv  update_slots: all slots are idle
2026-06-10 15:33:16 [DEBUG]
 LlamaV4: server assigned slot 3 to task 5951
2026-06-10 15:33:16  [INFO]
 [qwen3.5-122b-a10b-mtp] Finished streaming response

尝试把GPU卸载层数拉满，CPU和内存的开销降下来了，而且生成速度又回到了22t/s

2026-06-10 16:17:07  [INFO]
 [qwen3.5-122b-a10b-mtp] Running chat completion on conversation with 121 messages.
2026-06-10 16:17:07  [INFO]
 [qwen3.5-122b-a10b-mtp] Streaming response...
2026-06-10 16:17:07 [DEBUG]
 LlamaV4::predict slot selection: session_id=<empty> server-selected (LCP/LRU)
2026-06-10 16:17:07 [DEBUG]
 25.51.464.939 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.998 (> 0.100 thold), f_keep = 0.998
25.51.465.315 I slot launch_slot_: id  3 | task 418 | processing task, is_child = 0
25.51.465.451 W slot update_slots: id  3 | task 418 | cache reuse is not supported - ignoring n_cache_reuse = 256
25.51.465.596 I slot update_slots: id  3 | task 418 | Checking checkpoint with [149162, 149162] against 149166...
2026-06-10 16:17:08 [DEBUG]
 25.51.586.591 W slot update_slots: id  3 | task 418 | restored context checkpoint (pos_min = 149162, pos_max = 149162, n_tokens = 149163, n_past = 149163, size = 443.242 MiB)
2026-06-10 16:17:08  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 0.0%
2026-06-10 16:17:11 [DEBUG]
 25.54.740.905 I slot print_timing: id  3 | task 418 | prompt processing, n_tokens =    269, progress = 1.00, t =   3.28 s / 82.12 tokens per second
2026-06-10 16:17:11  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 98.5%
2026-06-10 16:17:11 [DEBUG]
 25.55.058.298 I slot create_check: id  3 | task 418 | created context checkpoint 7 of 32 (pos_min = 149431, pos_max = 149431, n_tokens = 149432, size = 443.773 MiB)
2026-06-10 16:17:11  [INFO]
 [qwen3.5-122b-a10b-mtp] Prompt processing progress: 100.0%
2026-06-10 16:17:17 [DEBUG]
 26.00.902.849 I slot print_timing: id  3 | task 418 | n_decoded =    101, tg =  17.69 t/s
2026-06-10 16:17:20 [DEBUG]
 26.03.956.178 I slot print_timing: id  3 | task 418 | n_decoded =    169, tg =  19.28 t/s
2026-06-10 16:17:23 [DEBUG]
 26.07.049.335 I slot print_timing: id  3 | task 418 | n_decoded =    247, tg =  20.83 t/s
2026-06-10 16:17:26 [DEBUG]
 26.10.077.118 I slot print_timing: id  3 | task 418 | n_decoded =    312, tg =  20.96 t/s
2026-06-10 16:17:29 [DEBUG]
 26.13.142.091 I slot print_timing: id  3 | task 418 | n_decoded =    385, tg =  21.45 t/s
2026-06-10 16:17:32 [DEBUG]
 26.16.323.415 I slot print_timing: id  3 | task 418 | n_decoded =    459, tg =  21.72 t/s
2026-06-10 16:17:35 [DEBUG]
 26.19.398.102 I slot print_timing: id  3 | task 418 | n_decoded =    537, tg =  22.19 t/s
2026-06-10 16:17:38 [DEBUG]
 26.22.406.446 I slot print_timing: id  3 | task 418 | n_decoded =    597, tg =  21.94 t/s
2026-06-10 16:17:41 [DEBUG]
 26.25.455.663 I slot print_timing: id  3 | task 418 | n_decoded =    676, tg =  22.34 t/s
2026-06-10 16:17:45 [DEBUG]
 26.28.510.836 I slot print_timing: id  3 | task 418 | n_decoded =    752, tg =  22.57 t/s
2026-06-10 16:17:48 [DEBUG]
 26.31.560.182 I slot print_timing: id  3 | task 418 | n_decoded =    816, tg =  22.44 t/s
2026-06-10 16:17:48 [DEBUG]
 26.32.140.132 I slot print_timing: id  3 | task 418 | prompt eval time =    3726.37 ms /   273 tokens (   13.65 ms per token,    73.26 tokens per second)
26.32.140.158 I slot print_timing: id  3 | task 418 |        eval time =   36947.50 ms /   830 tokens (   44.52 ms per token,    22.46 tokens per second)
26.32.140.161 I slot print_timing: id  3 | task 418 |       total time =   40673.87 ms /  1103 tokens
26.32.140.168 I slot print_timing: id  3 | task 418 |    graphs reused =        511
26.32.140.172 I slot print_timing: id  3 | task 418 | draft acceptance = 0.81959 (  636 accepted /   776 generated)
26.32.140.222 I statistics        draft-mtp: #calls(b,g,a) =    6    525    525, #gen drafts =    525, #acc drafts =   474, #gen tokens =   2100, #acc tokens =  1614, dur(b,g,a) = 0.019, 18657.710, 1.362 ms
2026-06-10 16:17:48 [DEBUG]
 26.32.147.501 I slot      release: id  3 | task 418 | stop processing: n_tokens = 150266, truncated = 0
26.32.147.572 I srv  update_slots: all slots are idle
2026-06-10 16:17:48 [DEBUG]
 LlamaV4: server assigned slot 3 to task 418
2026-06-10 16:17:48  [INFO]
 [qwen3.5-122b-a10b-mtp] Finished streaming response

总结

本次在amd ai max395上测试了80GB大小的Qwen3.5-122B-A10B-MTP-GGUF Q4_K_XL ，mtp配置到4的情况下输出可以达到30 tokens/s，并且在上下文达到140K时依然能有22 tokens/s的生成速度。
所以目前我本地LLM就有了三个选择：

qwen3.6-27b 编码任务
qwen3.6-35b-a3b 日常任务
qwen3.5-122b-a10b 27b解决不了时的备选

我下载测试了很多模型，下载量已经超过1TB了，但当前留下来的只有这三个，都是qwen的。

不过qwen目前的开源策略令人有点担心，其他家的开源模型不断在出新的，而qwen开源模型已经停更很久了，各个厂家的模型也有些"个性"需要使用人去适应。谁也不想好不容易适应了一个厂家的开源模型后，发现这个厂家的开源模型能力再也没有提升吧。