HP ZBook Ultra 14 Zoll G1a LLM Benchmarks
Authors: Sören Gebbert*, Natalia Gebbert*
Date: 2025-06-15
*Institute for Holistic Technology Research GmbH
Llama-4-Scout-17B-16e-Instruct Q4_K_M
The HP ZBook Ultra 14-inch G1a Mobile Workstation equipped with the AMD Ryzen AI MAX+ PRO 395 processor demonstrates exceptional capabilities for local large language model deployment, particularly excelling with Mixture of Expert (MoE) architectures. The system's substantial 128GB RAM capacity and theoretical memory bandwidth of up to 256GB/s position it as a compelling platform for enterprise-grade AI workloads that demand significant computational resources without relying on cloud infrastructure.
Our comprehensive benchmark results reveal distinct performance characteristics across different model architectures and sizes. Dense models such as the smaller Qwen variants (0.6B and 4B parameters) achieve impressive throughput rates of 44.47 and 19.61 tokens per second respectively, demonstrating the system's efficiency with lightweight models. However, the true strength of this hardware configuration becomes apparent with larger, more complex architectures. The Qwen-30B-A3B models, representing advanced MoE designs, deliver remarkable performance with sustained speeds exceeding 25 tokens per second—a significant achievement for local inference on mobile hardware.
Particularly noteworthy is the system's ability to handle even the most demanding models in our test suite. The Llama-4-Scout-17B model, despite its substantial computational requirements, maintains respectable performance at approximately 6 tokens per second. This performance level is especially impressive considering the model's complexity and the fact that current Vulkan driver limitations restrict GPU memory allocation to 64GB, forcing some layers to execute on CPU cores.
The benchmark data clearly illustrates the substantial impact of KV cache optimization, with second and third runs showing dramatically reduced time-to-first-token latencies (from tens of seconds to milliseconds) while maintaining consistent throughput. This caching efficiency is crucial for interactive applications and demonstrates the system's suitability for real-world deployment scenarios.
A critical finding from our evaluation concerns the Vulkan driver's current 64GB allocation limit, which prevents full GPU utilization for the largest models. Our analysis of the Llama-4-Scout model revealed that only 42 of 48 layers could be offloaded to the GPU, with the remaining layers processed on CPU. Performance monitoring during these hybrid GPU-CPU operations showed notable clock speed reductions from 2650MHz to 850MHz and utilization drops from 100% GPU to 70% GPU and 61% CPU. Addressing this limitation through driver updates could potentially unlock token generation rates exceeding 10 tokens per second for even the largest models in our test suite.
The integrated AMD Radeon 8060S GPU, featuring 40 compute units working in conjunction with the 16-core CPU, provides a well-balanced architecture for AI workloads. With a maximum power consumption of just 70 Watts for the GPU, this APU design offers significant advantages over discrete GPU solutions in mobile form factors, including improved power efficiency and reduced thermal constraints while maintaining competitive performance.
Based on our findings, the HP ZBook Ultra 14-inch G1a represents a strategic investment for organizations seeking to deploy large-scale language models locally on x86 architecture. Its exceptional memory capacity, robust processing capabilities, and proven performance with MoE models up to 100 billion parameters position it as a future-ready platform for advanced AI applications. As driver optimizations and software improvements continue to emerge, this system's already impressive capabilities are likely to expand further, making it an ideal choice for researchers, developers, and enterprises requiring powerful, portable AI inference capabilities without dependence on external cloud services.
This document presents a benchmark of various Large Language Models (LLMs) performed on an HP ZBook Ultra 14-inch G1a Mobile Workstation.
The document includes:
The purpose of this document is to provide a detailed record of LLM performance on the specified mobile workstation.
Model | Run Type | Tokens/sec | Tokens | Time to First Token | Cache Used |
Qwen3-0.6b Q8 | First run | 44.47 | 485 | 2.26s | No |
Qwen3-0.6b Q8 | Second run | 43.71 | 446 | 0.06s | Yes |
Qwen3-0.6b Q8 | Third run | 43.78 | 429 | 0.06s | Yes |
Qwen3-4B Q8 | First run | 19.61 | 627 | 10.09s | No |
Qwen3-4B Q8 | Second run | 19.85 | 697 | 0.09s | Yes |
Qwen3-4B Q8 | Third run | 19.92 | 598 | 0.09s | Yes |
Gemma-3 12b | First run | 7.75 | 324 | 28.61s | No |
Gemma-3 12b | Second run | 7.65 | 289 | 0.17s | Yes |
Gemma-3 12b | Third run | 7.60 | 333 | 0.17s | Yes |
Magistral-Small Q8 | First run | 6.60 | 924 | 80.20s | No |
Magistral-Small Q8 | Second run | 6.56 | 2561 | 0.19s | Yes |
Magistral-Small Q8 | Third run | 6.62 | 1041 | 0.22s | Yes |
Qwen-30B-A3B Q4_K_M | First run | 25.66 | 668 | 67.12s | No |
Qwen-30B-A3B Q4_K_M | Second run | 25.80 | 947 | 0.08s | Yes |
Qwen-30B-A3B Q4_K_M | Third run | 26.07 | 773 | 0.09s | Yes |
Qwen-30B-A3B Q8 | First run | 23.12 | 666 | 62.42s | No |
Qwen-30B-A3B Q8 | Second run | 23.68 | 651 | 0.08s | Yes |
Qwen-30B-A3B Q8 | Third run | 23.51 | 836 | 0.08s | Yes |
QwQ 32B Q8 | First run | 4.65 | 782 | 119.42s | No |
QwQ 32B Q8 | Second run | 4.64 | 1125 | 0.24s | Yes |
QwQ 32B Q8 | Third run | 4.62 | 1297 | 0.25s | Yes |
Llama-4-Scout-17B-16e-Instruct | First run | 5.92 | 301 | 82.54s | No |
Llama-4-Scout-17B-16e-Instruct | Second run | 6.11 | 356 | 0.23s | Yes |
Llama-4-Scout-17B-16e-Instruct | Third run | 6.18 | 286 | 0.21s | Yes |
Nous-Hermes-2-Mixtral-8x7 Q8 | First run | 10.65 | 208 | 38.46 | No |
Nous-Hermes-2-Mixtral-8x7 Q8 | Second run | 10.75 | 225 | 0.13s | Yes |
Nous-Hermes-2-Mixtral-8x7 Q8 | Third run | 10.76 | 266 | 0.13s | Yes |
Website: https://www.hp.com/de-de/workstations/zbook-ultra.html
Device HP ZBook Ultra 14 Zoll G1a Mobile Workstation
Processor AMD RYZEN AI MAX+ PRO 395 Radeon 8060S
RAM 128 GB
VRAM 512MB (set in BIOS)
Edition Windows 11 Pro
Version 24H2
We benchmarked various LLMs using LM Studio v0.3.16, chosen for its reliability. A key part of our evaluation was the consistent use of the Vulcan llama.cpp v1.34.1 runtime. This specific llama.cpp version was selected for its local LLM optimization, efficiency, and consistent results across hardware, ensuring direct comparability and minimizing software variables.
All LLM benchmark runs utilized a standardized configuration to ensure comparability and reproducibility across tests:
A 27KB English Wikipedia article about the ancient Greek philosopher Plato served as the benchmark context. This article was specifically selected because it:
Create a summary of the provided text that includes:
* A brief 2-3 sentence introduction that captures the main topic and purpose
* 4 bullet points highlighting the key findings, arguments, or takeaways
* Keep each bullet point to 1-2 sentences and focus on the most important information
44.47 tok/sec
485 tokens
2.26s to first token ← No cache used
43.71 tok/sec
446 tokens
0.06s to first token ← Cache used
43.78 tok/sec
429 tokens
0.06s to first token ← Cache used
19.61 tok/sec
627 tokens
10.09s to first token ← No cache used
19.85 tok/sec
697 tokens
0.09s to first token ← Cache used
19.92 tok/sec
598 tokens
0.09s to first token
7.75 tok/sec
324 tokens
28.61s to first token ← No cache used
7.65 tok/sec
289 tokens
0.17s to first token ← Cache used
7.60 tok/sec
333 tokens
0.17s to first token ← Cache used
6.60 tok/sec
924 tokens
80.20s to first token ← No cache used
6.56 tok/sec
2561 tokens
0.19s to first token ← Cache used
6.62 tok/sec
1041 tokens
0.22s to first token ← Cache used
25.66 tok/sec
668 tokens
67.12s to first token ← No cache used
25.80 tok/sec
947 tokens
0.08s to first token ← Cache used
26.07 tok/sec
773 tokens
0.09s to first token ← Cache used
The performance screenshot shows the KV Cache computation in the beginning and inference afterwards. Th GPU clock speed and the utilization drops from almost about 2750Mhz to 2500Mhz and 100% to 84%.The provided performance screenshot clearly illustrates two distinct phases of GPU operation: the initial KV Cache computation and the subsequent inference stage. A notable observation is the decrease in GPU clock speed and utilization as the system transitions between these phases. Specifically, the GPU clock speed, which initially hovers around 2750 MHz, experiences a noticeable reduction to approximately 2500 MHz. Concurrently, GPU utilization, which is at a near-peak 100% during the KV Cache computation, drops to about 84% during the inference process. This suggests a potential shift in computational demands or resource allocation as the system moves from the data-intensive KV Cache preparation to the more generalized inference workload. Further analysis could delve into the specific reasons for this reduction, such as thermal throttling, power limits, or a change in the computational parallelism inherent in each phase.
23.12 tok/sec
666 tokens
62.42s to first token ← No cache used
23.68 tok/sec
651 tokens
0.08s to first token ← Cache used
23.51 tok/sec
836 tokens
0.08s to first token ← Cache used
4.65 tok/sec
782 tokens
119.42s to first token ← No cache used
4.64 tok/sec
1125 tokens
0.24s to first token ← Cache used
4.62 tok/sec
1297 tokens
0.25s to first token ← Cache used
Only 42 of 48 layers can be offloaded to the GPU, the other layer will run in CPU mode. There seems to be an allocation limit of 64GB for the Vulkan driver.When processing the model, it was observed that only 42 out of a total of 48 layers could be successfully offloaded to the GPU for accelerated computation. The remaining 6 layers were consequently processed in CPU mode. This limitation appears to stem from an allocation limit imposed by the Vulkan driver, which seems to restrict GPU memory allocation to a maximum of 64GB. As a result, any layers requiring memory beyond this 64GB threshold cannot be fully utilized by the GPU and must fall back to CPU processing, potentially impacting overall performance and processing speed.
The performance screenshot shows the KV Cache computation in the beginning and the inference afterwards. KV Cache seems to be commuted to 100% on the GPU, whereas the inference uses GPU and CPU. The clock speed of the GPU drops significantly if CPU and GPU are used for inference.The performance screenshot details the computational workflow, separating KV Cache computation and subsequent inference. KV Cache computation exclusively utilizes the GPU at 100%, indicating optimized initial data preparation. The inference phase, however, uses both GPU and CPU, suggesting diverse computational tasks. A notable GPU clock speed drop during inference, when both processors are active, could imply power/thermal throttling or a CPU bottleneck, where the GPU idles waiting for data. Further analysis is needed to determine the exact cause and its impact on performance.
5.92 tok/sec
301 tokens
82.54s to first token ← No cache used
6.11 tok/sec
356 tokens
0.23s to first token ← Cache used
6.18 tok/sec
286 tokens
0.21s to first token ← Cache used
10.65 tok/sec
208 tokens
38.46s to first token ← No cache used
10.75 tok/sec
225 tokens
0.13s to first token ← Cache used
10.76 tok/sec
266 tokens
0.13s to first token ← Cache used