HP ZBook Ultra 14 Zoll G1a LLM Benchmarks

Authors:        Sören Gebbert*, Natalia Gebbert*

Date:                2025-06-15

*Institute for Holistic Technology Research GmbH

Conclusion

Summary

LLM Benchmark Runs

Hardware

Software

Windows

LM Studio 0.3.16

LLMs

Benchmark Configuration

Test Dataset

Prompt

Qwen3-0.6b Q8

Setup

First run

Second run

Third run

Qwen3-4B Q8

Setup

Performance

First run

Second run

Third run

Gemma-3 12b

Setup

Performance

First run

Second run

Third run

Magistral-Small Q8

Setup

Performance

First run

Second run

Third run

Qwen-30B-A3B Q4_K_M

Setup

Performance

First run

Second run

Third run

Qwen-30B-A3B Q8

Setup

Performance

First run

Second run

Third run

QwQ 32B Q8

Setup

Performance

First run

Second run

Third run

Llama-4-Scout-17B-16e-Instruct Q4_K_M

Setup

Performance

First run

Second run

Third run

Nous-Hermes-2-Mixtral-8x7 Q8

Setup

Performance

First run

Second run

Third run

Conclusion

The HP ZBook Ultra 14-inch G1a Mobile Workstation equipped with the AMD Ryzen AI MAX+ PRO 395 processor demonstrates exceptional capabilities for local large language model deployment, particularly excelling with Mixture of Expert (MoE) architectures. The system's substantial 128GB RAM capacity and theoretical memory bandwidth of up to 256GB/s position it as a compelling platform for enterprise-grade AI workloads that demand significant computational resources without relying on cloud infrastructure.

Our comprehensive benchmark results reveal distinct performance characteristics across different model architectures and sizes. Dense models such as the smaller Qwen variants (0.6B and 4B parameters) achieve impressive throughput rates of 44.47 and 19.61 tokens per second respectively, demonstrating the system's efficiency with lightweight models. However, the true strength of this hardware configuration becomes apparent with larger, more complex architectures. The Qwen-30B-A3B models, representing advanced MoE designs, deliver remarkable performance with sustained speeds exceeding 25 tokens per second—a significant achievement for local inference on mobile hardware.

Particularly noteworthy is the system's ability to handle even the most demanding models in our test suite. The Llama-4-Scout-17B model, despite its substantial computational requirements, maintains respectable performance at approximately 6 tokens per second. This performance level is especially impressive considering the model's complexity and the fact that current Vulkan driver limitations restrict GPU memory allocation to 64GB, forcing some layers to execute on CPU cores.

The benchmark data clearly illustrates the substantial impact of KV cache optimization, with second and third runs showing dramatically reduced time-to-first-token latencies (from tens of seconds to milliseconds) while maintaining consistent throughput. This caching efficiency is crucial for interactive applications and demonstrates the system's suitability for real-world deployment scenarios.

A critical finding from our evaluation concerns the Vulkan driver's current 64GB allocation limit, which prevents full GPU utilization for the largest models. Our analysis of the Llama-4-Scout model revealed that only 42 of 48 layers could be offloaded to the GPU, with the remaining layers processed on CPU. Performance monitoring during these hybrid GPU-CPU operations showed notable clock speed reductions from 2650MHz to 850MHz and utilization drops from 100% GPU to 70% GPU and 61% CPU. Addressing this limitation through driver updates could potentially unlock token generation rates exceeding 10 tokens per second for even the largest models in our test suite.

The integrated AMD Radeon 8060S GPU, featuring 40 compute units working in conjunction with the 16-core CPU, provides a well-balanced architecture for AI workloads. With a maximum power consumption of just 70 Watts for the GPU, this APU design offers significant advantages over discrete GPU solutions in mobile form factors, including improved power efficiency and reduced thermal constraints while maintaining competitive performance.

Based on our findings, the HP ZBook Ultra 14-inch G1a represents a strategic investment for organizations seeking to deploy large-scale language models locally on x86 architecture. Its exceptional memory capacity, robust processing capabilities, and proven performance with MoE models up to 100 billion parameters position it as a future-ready platform for advanced AI applications. As driver optimizations and software improvements continue to emerge, this system's already impressive capabilities are likely to expand further, making it an ideal choice for researchers, developers, and enterprises requiring powerful, portable AI inference capabilities without dependence on external cloud services.

Summary

This document presents a benchmark of various Large Language Models (LLMs) performed on an HP ZBook Ultra 14-inch G1a Mobile Workstation.

The document includes:

  • Hardware Specifications: Details of the workstation's processor, RAM, and VRAM.
  • Software Environment: Information on the Windows version and LM Studio version used for the benchmarks.
  • Benchmark Configuration: Standardized settings applied to all LLM runs, such as GPU memory management, context window size, and KV cache usage.
  • Test Dataset and Prompt: Description of the English Wikipedia article used as the context and the specific prompt for generating summaries.
  • Individual LLM Benchmarks: Sections dedicated to each LLM tested (Qwen3-0.6b Q8, Qwen3-4B Q8, Gemma-3 12b, Magistral-Small Q8, Qwen-30B-A3B Q4_K_M, Qwen-30B-A3B Q8, QwQ 32B Q8, and Llama-4-Scout-17B-16e-Instruct Q4_K_M, Mixtral-8x7 Q8), each detailing setup notes, performance metrics (tokens/sec, tokens generated, time to first token), and observations for multiple runs (with and without cache).

The purpose of this document is to provide a detailed record of LLM performance on the specified mobile workstation.

LLM Benchmark Runs

Model

Run Type

Tokens/sec

Tokens

Time to First Token

Cache Used

Qwen3-0.6b Q8

First run

44.47

485

2.26s

No

Qwen3-0.6b Q8

Second run

43.71

446

0.06s

Yes

Qwen3-0.6b Q8

Third run

43.78

429

0.06s

Yes

Qwen3-4B Q8

First run

19.61

627

10.09s

No

Qwen3-4B Q8

Second run

19.85

697

0.09s

Yes

Qwen3-4B Q8

Third run

19.92

598

0.09s

Yes

Gemma-3 12b

First run

7.75

324

28.61s

No

Gemma-3 12b

Second run

7.65

289

0.17s

Yes

Gemma-3 12b

Third run

7.60

333

0.17s

Yes

Magistral-Small Q8

First run

6.60

924

80.20s

No

Magistral-Small Q8

Second run

6.56

2561

0.19s

Yes

Magistral-Small Q8

Third run

6.62

1041

0.22s

Yes

Qwen-30B-A3B Q4_K_M

First run

25.66

668

67.12s

No

Qwen-30B-A3B Q4_K_M

Second run

25.80

947

0.08s

Yes

Qwen-30B-A3B Q4_K_M

Third run

26.07

773

0.09s

Yes

Qwen-30B-A3B Q8

First run

23.12

666

62.42s

No

Qwen-30B-A3B Q8

Second run

23.68

651

0.08s

Yes

Qwen-30B-A3B Q8

Third run

23.51

836

0.08s

Yes

QwQ 32B Q8

First run

4.65

782

119.42s

No

QwQ 32B Q8

Second run

4.64

1125

0.24s

Yes

QwQ 32B Q8

Third run

4.62

1297

0.25s

Yes

Llama-4-Scout-17B-16e-Instruct

First run

5.92

301

82.54s

No

Llama-4-Scout-17B-16e-Instruct

Second run

6.11

356

0.23s

Yes

Llama-4-Scout-17B-16e-Instruct

Third run

6.18

286

0.21s

Yes

Nous-Hermes-2-Mixtral-8x7 Q8

First run

10.65

208

38.46

No

Nous-Hermes-2-Mixtral-8x7 Q8

Second run

10.75

225

0.13s

Yes

Nous-Hermes-2-Mixtral-8x7 Q8

Third run

10.76

266

0.13s

Yes

Hardware

Website: https://www.hp.com/de-de/workstations/zbook-ultra.html

Device                HP ZBook Ultra 14 Zoll G1a Mobile Workstation

Processor        AMD RYZEN AI MAX+ PRO 395 Radeon 8060S

RAM                128 GB

VRAM                512MB (set in BIOS)

Software

Windows

Edition                Windows 11 Pro

Version        24H2

LM Studio 0.3.16

We benchmarked various LLMs using LM Studio v0.3.16, chosen for its reliability. A key part of our evaluation was the consistent use of the Vulcan llama.cpp v1.34.1 runtime. This specific llama.cpp version was selected for its local LLM optimization, efficiency, and consistent results across hardware, ensuring direct comparability and minimizing software variables. 

LLMs

Benchmark Configuration

All LLM benchmark runs utilized a standardized configuration to ensure comparability and reproducibility across tests:

  • GPU Memory Management: Full model offload to GPU when hardware permitted
  • Context Window: Fixed at 8,192 tokens for all evaluations
  • KV Cache: GPU-accelerated key-value cache enabled for optimal performance

Test Dataset

A 27KB English Wikipedia article about the ancient Greek philosopher Plato served as the benchmark context. This article was specifically selected because it:

  • Fits within the 8,192-token context window without truncation
  • Represents a realistic use case with dense, informational content
  • Provides consistent complexity for fair cross-model comparison

Prompt

Create a summary of the provided text that includes:

* A brief 2-3 sentence introduction that captures the main topic and purpose

* 4 bullet points highlighting the key findings, arguments, or takeaways

* Keep each bullet point to 1-2 sentences and focus on the most important information

Qwen3-0.6b Q8

Setup

First run

44.47 tok/sec

485 tokens

2.26s to first token ← No cache used

Second run

43.71 tok/sec

446 tokens

0.06s to first token ← Cache used

Third run

43.78 tok/sec

429 tokens

0.06s to first token ← Cache used

Qwen3-4B Q8

Setup

Performance

First run

19.61 tok/sec

627 tokens

10.09s to first token ← No cache used

Second run

19.85 tok/sec

697 tokens

0.09s to first token ← Cache used

Third run

19.92 tok/sec

598 tokens

0.09s to first token

Gemma-3 12b

Setup

Performance

First run

7.75 tok/sec

324 tokens

28.61s to first token ← No cache used

Second run

7.65 tok/sec

289 tokens

0.17s to first token ← Cache used

Third run

7.60 tok/sec

333 tokens

0.17s to first token ← Cache used

Magistral-Small Q8

Setup

Performance

First run

6.60 tok/sec

924 tokens

80.20s to first token ← No cache used

Second run

6.56 tok/sec

2561 tokens

0.19s to first token ← Cache used

Third run

6.62 tok/sec

1041 tokens

0.22s to first token ← Cache used

Qwen-30B-A3B Q4_K_M

Setup

Performance

First run

25.66 tok/sec

668 tokens

67.12s to first token ← No cache used

Second run

25.80 tok/sec

947 tokens

0.08s to first token ← Cache used

Third run

26.07 tok/sec

773 tokens

0.09s to first token ← Cache used

Qwen-30B-A3B Q8

Setup

Performance

The performance screenshot shows the KV Cache computation in the beginning and inference afterwards. Th GPU clock speed and the utilization drops from almost about 2750Mhz to 2500Mhz and 100% to 84%.The provided performance screenshot clearly illustrates two distinct phases of GPU operation: the initial KV Cache computation and the subsequent inference stage. A notable observation is the decrease in GPU clock speed and utilization as the system transitions between these phases. Specifically, the GPU clock speed, which initially hovers around 2750 MHz, experiences a noticeable reduction to approximately 2500 MHz. Concurrently, GPU utilization, which is at a near-peak 100% during the KV Cache computation, drops to about 84% during the inference process. This suggests a potential shift in computational demands or resource allocation as the system moves from the data-intensive KV Cache preparation to the more generalized inference workload. Further analysis could delve into the specific reasons for this reduction, such as thermal throttling, power limits, or a change in the computational parallelism inherent in each phase.

First run

23.12 tok/sec

666 tokens

62.42s to first token ← No cache used

Second run

23.68 tok/sec

651 tokens

0.08s to first token ← Cache used

Third run

23.51 tok/sec

836 tokens

0.08s to first token ← Cache used

QwQ 32B Q8

Setup

Performance

First run

4.65 tok/sec

782 tokens

119.42s to first token ← No cache used

Second run

4.64 tok/sec

1125 tokens

0.24s to first token ← Cache used

Third run

4.62 tok/sec

1297 tokens

0.25s to first token ← Cache used

Llama-4-Scout-17B-16e-Instruct Q4_K_M

Setup

Only 42 of 48 layers can be offloaded to the GPU, the other layer will run in CPU mode. There seems to be an allocation limit of 64GB for the Vulkan driver.When processing the model, it was observed that only 42 out of a total of 48 layers could be successfully offloaded to the GPU for accelerated computation. The remaining 6 layers were consequently processed in CPU mode. This limitation appears to stem from an allocation limit imposed by the Vulkan driver, which seems to restrict GPU memory allocation to a maximum of 64GB. As a result, any layers requiring memory beyond this 64GB threshold cannot be fully utilized by the GPU and must fall back to CPU processing, potentially impacting overall performance and processing speed.

Performance

The performance screenshot shows the KV Cache computation in the beginning and the inference afterwards. KV Cache seems to be commuted to 100% on the GPU, whereas the inference uses GPU and CPU. The clock speed of the GPU drops significantly if CPU and GPU are used for inference.The performance screenshot details the computational workflow, separating KV Cache computation and subsequent inference. KV Cache computation exclusively utilizes the GPU at 100%, indicating optimized initial data preparation. The inference phase, however, uses both GPU and CPU, suggesting diverse computational tasks. A notable GPU clock speed drop during inference, when both processors are active, could imply power/thermal throttling or a CPU bottleneck, where the GPU idles waiting for data. Further analysis is needed to determine the exact cause and its impact on performance.

First run

5.92 tok/sec

301 tokens

82.54s to first token ← No cache used

Second run

6.11 tok/sec

356 tokens

0.23s to first token ← Cache used

Third run

6.18 tok/sec

286 tokens

0.21s to first token ← Cache used

Nous-Hermes-2-Mixtral-8x7 Q8

Setup

Performance

First run

10.65 tok/sec

208 tokens

38.46s to first token ← No cache used

Second run

10.75 tok/sec

225 tokens

0.13s to first token ← Cache used

Third run

10.76 tok/sec

266 tokens

0.13s to first token ← Cache used