The HP ZBook Ultra 14-inch G1a Mobile Workstation equipped with the AMD Ryzen AI MAX+ PRO 395 processor demonstrates exceptional capabilities for local large language model deployment, particularly excelling with Mixture of Expert (MoE) architectures. The system's substantial 128GB RAM capacity and theoretical memory bandwidth of up to 256GB/s position it as a compelling platform for enterprise-grade AI workloads that demand significant computational resources without relying on cloud infrastructure.

Our comprehensive benchmark results reveal distinct performance characteristics across different model architectures and sizes. Dense models such as the smaller Qwen variants (0.6B and 4B parameters) achieve impressive throughput rates of 44.47 and 19.61 tokens per second respectively, demonstrating the system's efficiency with lightweight models. However, the true strength of this hardware configuration becomes apparent with larger, more complex architectures. The Qwen-30B-A3B models, representing advanced MoE designs, deliver remarkable performance with sustained speeds exceeding 25 tokens per second—a significant achievement for local inference on mobile hardware.

Particularly noteworthy is the system's ability to handle even the most demanding models in our test suite. The Llama-4-Scout-17B model, despite its substantial computational requirements, maintains respectable performance at approximately 6 tokens per second. This performance level is especially impressive considering the model's complexity and the fact that current Vulkan driver limitations restrict GPU memory allocation to 64GB, forcing some layers to execute on CPU cores.

The benchmark data clearly illustrates the substantial impact of KV cache optimization, with second and third runs showing dramatically reduced time-to-first-token latencies (from tens of seconds to milliseconds) while maintaining consistent throughput. This caching efficiency is crucial for interactive applications and demonstrates the system's suitability for real-world deployment scenarios.

A critical finding from our evaluation concerns the Vulkan driver's current 64GB allocation limit, which prevents full GPU utilization for the largest models. Our analysis of the Llama-4-Scout model revealed that only 42 of 48 layers could be offloaded to the GPU, with the remaining layers processed on CPU. Performance monitoring during these hybrid GPU-CPU operations showed notable clock speed reductions from 2650MHz to 850MHz and utilization drops from 100% GPU to 70% GPU and 61% CPU. Addressing this limitation through driver updates could potentially unlock token generation rates exceeding 10 tokens per second for even the largest models in our test suite.

The integrated AMD Radeon 8060S GPU, featuring 40 compute units working in conjunction with the 16-core CPU, provides a well-balanced architecture for AI workloads. With a maximum power consumption of just 70 Watts for the GPU, this APU design offers significant advantages over discrete GPU solutions in mobile form factors, including improved power efficiency and reduced thermal constraints while maintaining competitive performance.

Based on our findings, the HP ZBook Ultra 14-inch G1a represents a strategic investment for organizations seeking to deploy large-scale language models locally on x86 architecture. Its exceptional memory capacity, robust processing capabilities, and proven performance with MoE models up to 100 billion parameters position it as a future-ready platform for advanced AI applications. As driver optimizations and software improvements continue to emerge, this system's already impressive capabilities are likely to expand further, making it an ideal choice for researchers, developers, and enterprises requiring powerful, portable AI inference capabilities without dependence on external cloud services.

Summary

This document presents a benchmark of various Large Language Models (LLMs) performed on an HP ZBook Ultra 14-inch G1a Mobile Workstation.

The document includes:

Hardware Specifications: Details of the workstation's processor, RAM, and VRAM.
Software Environment: Information on the Windows version and LM Studio version used for the benchmarks.
Benchmark Configuration: Standardized settings applied to all LLM runs, such as GPU memory management, context window size, and KV cache usage.
Test Dataset and Prompt: Description of the English Wikipedia article used as the context and the specific prompt for generating summaries.
Individual LLM Benchmarks: Sections dedicated to each LLM tested (Qwen3-0.6b Q8, Qwen3-4B Q8, Gemma-3 12b, Magistral-Small Q8, Qwen-30B-A3B Q4_K_M, Qwen-30B-A3B Q8, QwQ 32B Q8, and Llama-4-Scout-17B-16e-Instruct Q4_K_M, Mixtral-8x7 Q8), each detailing setup notes, performance metrics (tokens/sec, tokens generated, time to first token), and observations for multiple runs (with and without cache).

The purpose of this document is to provide a detailed record of LLM performance on the specified mobile workstation.

LLM Benchmark Runs

Model	Run Type	Tokens/sec	Tokens	Time to First Token	Cache Used
Qwen3-0.6b Q8	First run	44.47	485	2.26s	No
Qwen3-0.6b Q8	Second run	43.71	446	0.06s	Yes
Qwen3-0.6b Q8	Third run	43.78	429	0.06s	Yes
Qwen3-4B Q8	First run	19.61	627	10.09s	No
Qwen3-4B Q8	Second run	19.85	697	0.09s	Yes
Qwen3-4B Q8	Third run	19.92	598	0.09s	Yes
Gemma-3 12b	First run	7.75	324	28.61s	No
Gemma-3 12b	Second run	7.65	289	0.17s	Yes
Gemma-3 12b	Third run	7.60	333	0.17s	Yes
Magistral-Small Q8	First run	6.60	924	80.20s	No
Magistral-Small Q8	Second run	6.56	2561	0.19s	Yes
Magistral-Small Q8	Third run	6.62	1041	0.22s	Yes
Qwen-30B-A3B Q4_K_M	First run	25.66	668	67.12s	No
Qwen-30B-A3B Q4_K_M	Second run	25.80	947	0.08s	Yes
Qwen-30B-A3B Q4_K_M	Third run	26.07	773	0.09s	Yes
Qwen-30B-A3B Q8	First run	23.12	666	62.42s	No
Qwen-30B-A3B Q8	Second run	23.68	651	0.08s	Yes
Qwen-30B-A3B Q8	Third run	23.51	836	0.08s	Yes
QwQ 32B Q8	First run	4.65	782	119.42s	No
QwQ 32B Q8	Second run	4.64	1125	0.24s	Yes
QwQ 32B Q8	Third run	4.62	1297	0.25s	Yes
Llama-4-Scout-17B-16e-Instruct	First run	5.92	301	82.54s	No
Llama-4-Scout-17B-16e-Instruct	Second run	6.11	356	0.23s	Yes
Llama-4-Scout-17B-16e-Instruct	Third run	6.18	286	0.21s	Yes
Nous-Hermes-2-Mixtral-8x7 Q8	First run	10.65	208	38.46	No
Nous-Hermes-2-Mixtral-8x7 Q8	Second run	10.75	225	0.13s	Yes
Nous-Hermes-2-Mixtral-8x7 Q8	Third run	10.76	266	0.13s	Yes

Hardware

Website: https://www.hp.com/de-de/workstations/zbook-ultra.html

Device HP ZBook Ultra 14 Zoll G1a Mobile Workstation

Processor AMD RYZEN AI MAX+ PRO 395 Radeon 8060S

RAM 128 GB

VRAM 512MB (set in BIOS)

Software

Windows

Edition Windows 11 Pro

Version 24H2

LM Studio 0.3.16

We benchmarked various LLMs using LM Studio v0.3.16, chosen for its reliability. A key part of our evaluation was the consistent use of the Vulcan llama.cpp v1.34.1 runtime. This specific llama.cpp version was selected for its local LLM optimization, efficiency, and consistent results across hardware, ensuring direct comparability and minimizing software variables.

LLMs

Benchmark Configuration

All LLM benchmark runs utilized a standardized configuration to ensure comparability and reproducibility across tests:

GPU Memory Management: Full model offload to GPU when hardware permitted
Context Window: Fixed at 8,192 tokens for all evaluations
KV Cache: GPU-accelerated key-value cache enabled for optimal performance

Test Dataset

A 27KB English Wikipedia article about the ancient Greek philosopher Plato served as the benchmark context. This article was specifically selected because it:

Fits within the 8,192-token context window without truncation
Represents a realistic use case with dense, informational content
Provides consistent complexity for fair cross-model comparison

Prompt

Create a summary of the provided text that includes:

* A brief 2-3 sentence introduction that captures the main topic and purpose

* 4 bullet points highlighting the key findings, arguments, or takeaways

* Keep each bullet point to 1-2 sentences and focus on the most important information

Qwen3-0.6b Q8

Setup

First run

44.47 tok/sec

485 tokens

2.26s to first token ← No cache used

Second run

43.71 tok/sec

446 tokens

0.06s to first token ← Cache used

Third run

43.78 tok/sec

429 tokens

0.06s to first token ← Cache used

Qwen3-4B Q8

Setup

Performance

First run

19.61 tok/sec

627 tokens

10.09s to first token ← No cache used

Second run

19.85 tok/sec

697 tokens

0.09s to first token ← Cache used

Third run

19.92 tok/sec

598 tokens

0.09s to first token

Gemma-3 12b

Setup

Performance

First run

7.75 tok/sec

324 tokens

28.61s to first token ← No cache used

Second run

7.65 tok/sec

289 tokens

0.17s to first token ← Cache used

Third run

7.60 tok/sec

333 tokens

0.17s to first token ← Cache used

Magistral-Small Q8

Setup

Performance

First run

6.60 tok/sec

924 tokens

80.20s to first token ← No cache used

Second run

6.56 tok/sec

2561 tokens

0.19s to first token ← Cache used

Third run

6.62 tok/sec

1041 tokens

0.22s to first token ← Cache used

Qwen-30B-A3B Q4_K_M

Setup

Performance

First run

25.66 tok/sec

668 tokens

67.12s to first token ← No cache used

Second run

25.80 tok/sec

947 tokens

0.08s to first token ← Cache used

Third run

26.07 tok/sec

773 tokens

0.09s to first token ← Cache used

Qwen-30B-A3B Q8

Setup

Performance

The performance screenshot shows the KV Cache computation in the beginning and inference afterwards. Th GPU clock speed and the utilization drops from almost about 2750Mhz to 2500Mhz and 100% to 84%.The provided performance screenshot clearly illustrates two distinct phases of GPU operation: the initial KV Cache computation and the subsequent inference stage. A notable observation is the decrease in GPU clock speed and utilization as the system transitions between these phases. Specifically, the GPU clock speed, which initially hovers around 2750 MHz, experiences a noticeable reduction to approximately 2500 MHz. Concurrently, GPU utilization, which is at a near-peak 100% during the KV Cache computation, drops to about 84% during the inference process. This suggests a potential shift in computational demands or resource allocation as the system moves from the data-intensive KV Cache preparation to the more generalized inference workload. Further analysis could delve into the specific reasons for this reduction, such as thermal throttling, power limits, or a change in the computational parallelism inherent in each phase.

First run

23.12 tok/sec

666 tokens

62.42s to first token ← No cache used

Second run

23.68 tok/sec

651 tokens

0.08s to first token ← Cache used

Third run

23.51 tok/sec

836 tokens

0.08s to first token ← Cache used

QwQ 32B Q8

Setup

Performance

First run

4.65 tok/sec

782 tokens

119.42s to first token ← No cache used

Second run

4.64 tok/sec

1125 tokens

0.24s to first token ← Cache used

Third run

4.62 tok/sec

1297 tokens

0.25s to first token ← Cache used

Llama-4-Scout-17B-16e-Instruct Q4_K_M

Setup

Only 42 of 48 layers can be offloaded to the GPU, the other layer will run in CPU mode. There seems to be an allocation limit of 64GB for the Vulkan driver.When processing the model, it was observed that only 42 out of a total of 48 layers could be successfully offloaded to the GPU for accelerated computation. The remaining 6 layers were consequently processed in CPU mode. This limitation appears to stem from an allocation limit imposed by the Vulkan driver, which seems to restrict GPU memory allocation to a maximum of 64GB. As a result, any layers requiring memory beyond this 64GB threshold cannot be fully utilized by the GPU and must fall back to CPU processing, potentially impacting overall performance and processing speed.

Performance

The performance screenshot shows the KV Cache computation in the beginning and the inference afterwards. KV Cache seems to be commuted to 100% on the GPU, whereas the inference uses GPU and CPU. The clock speed of the GPU drops significantly if CPU and GPU are used for inference.The performance screenshot details the computational workflow, separating KV Cache computation and subsequent inference. KV Cache computation exclusively utilizes the GPU at 100%, indicating optimized initial data preparation. The inference phase, however, uses both GPU and CPU, suggesting diverse computational tasks. A notable GPU clock speed drop during inference, when both processors are active, could imply power/thermal throttling or a CPU bottleneck, where the GPU idles waiting for data. Further analysis is needed to determine the exact cause and its impact on performance.