How We Got FP16 GPU Tests Running on GitHub Actions

What GitHub Actions runners actually give you, why OpenCL on CPU is harder than it sounds, and the one Intel package that saved our CI

If you’ve ever tried to run GPU code in CI, you’ve probably hit the same wall: standard runners don’t have GPUs. That’s fine for most projects - but what if your code generates GPU kernels, and you need to verify they’re correct? Not fast, just correct. That’s the problem we ran into while building an OpenCL-based LLM inference engine on the JVM, and solving it taught us more about the OpenCL runtime landscape than we ever expected to learn.

This post is about the practical challenges of getting GPU-targeting code tested in environments without GPU hardware - specifically GitHub Actions. The project behind it uses Project Babylon / HAT (an experimental OpenJDK fork that compiles Java to GPU kernels), but the lessons apply to anyone doing OpenCL, CUDA, or heterogeneous compute work in CI.

1. What GitHub Actions Actually Gives You

Let’s start with the hardware. When you spin up an ubuntu-latest runner on GitHub Actions, you’re getting a virtual machine with an AMD EPYC processor - typically a 2-core or 4-core slice of a server-class CPU. There is no GPU. No discrete card, no integrated graphics, nothing. The VM doesn’t even have a display adapter.

This is perfectly fine for 99% of CI workloads. But the moment your project involves GPU compute - OpenCL kernels, CUDA code, Metal shaders, anything that targets an accelerator - you’re in trouble. GitHub’s larger runners do offer GPU options (NVIDIA T4 and L4 since late 2024), but they’re expensive, limited in availability, and often overkill if all you need is a correctness check… plus they are not available for free.

The question then becomes: can you test GPU-targeting code on a CPU? The answer is “yes, but” - and the “but” is where things get interesting.

2. The Two Things You’re Actually Testing

When you run GPU tests in CI, it’s worth separating two concerns that are easy to conflate:

Correctness: Does the generated kernel code compile and produce the right results? This is what you care about on every PR. It’s the “does it work” question.

Performance: How fast does it actually run on real hardware? This matters for benchmarks and optimization, but you don’t need it on every push.

We started with GCP T4 instances for both - real NVIDIA hardware with full OpenCL and CUDA support. The numbers were validating (10× speedup from GPU HAT kernels vs. plain Java), but the cost and availability constraints made it impractical as a per-PR gate. What we really wanted was: real GPU benchmarks on a schedule, correctness checks on every push. And for correctness checks, you don’t need a GPU - you need an OpenCL runtime.

3. PoCL - The Obvious First Choice (and Its Limits)

PoCL (Portable Computing Language) is the go-to CPU-based OpenCL implementation. It’s open source, well-maintained, and trivially installable on Ubuntu:

For F32 (single-precision floating point) workloads, PoCL works great. We ran our benchmarks and nightly validation workflows on it for weeks without issues. If your GPU code only uses standard 32-bit operations, PoCL might be all you ever need.

But here’s the catch: PoCL does not support cl_khr_fp16.

That’s the OpenCL extension for half-precision (16-bit) floating-point operations. If your code uses half types, which is increasingly common in ML workloads where models store weights in FP16 to save memory, PoCL simply can’t execute those kernels. The runtime will reject them at compile time with:

This was exactly our situation. We’d switched to native FP16 weight storage (saving ~1.8 GB for a 1B parameter model), and suddenly our CI couldn’t validate the generated kernels at all. PoCL went from “perfect” to “completely insufficient” with a single feature change.

4. The OpenCL Runtime Landscape on Linux (It’s a Mess)

If you’ve never had to shop for OpenCL runtimes, congratulations - it’s one of those experiences that makes you appreciate how well-organized the CUDA ecosystem is by comparison. Here’s what we found when searching for cl_khr_fp16 support on a CPU, running on Ubuntu 24.04:

Attempt 1: intel-opencl-icd from Ubuntu Repos

The package simply doesn’t exist in Ubuntu 24.04 (noble). It was available in older releases through Intel’s unofficial PPA, but that ship has sailed. PoCL remains the only OpenCL platform you’ll get from standard repositories.

Attempt 2: Intel Compute Runtime (NEO)

Intel’s compute-runtime (codenamed NEO) is actively maintained, has GitHub releases with .deb packages, and supports cl_khr_fp16. Looks perfect on paper.

The problem: NEO is a GPU-only runtime. It targets Intel integrated graphics - Gen9, Gen11, Xe, Arc. On an AMD EPYC processor (which is what GitHub Actions gives you), it installs fine, but clinfo shows zero devices. NEO literally has nothing to talk to. Dead end.

This is a crucial distinction that cost us a few hours: Intel compute-runtime ≠ Intel CPU runtime. The naming is confusing, and Google results mix them freely. If you see “Intel OpenCL” in a Stack Overflow answer, make sure you know which one they’re talking about.

Attempt 3: Intel oneAPI CPU Runtime

Intel’s oneAPI toolkit includes a CPU-based OpenCL runtime that actually runs on any x86-64 processor, regardless of whether Intel graphics are present. It lives in a dedicated APT repository:

The critical detail: FP16 support is behind an environment variable:

With that set, clinfo shows the Intel platform with cl_khr_fp16 in the extensions list. The word “experimental” might raise eyebrows, but remember, we’re not shipping production inference on this. We’re validating that the generated OpenCL C code compiles and computes correctly. For that purpose, “experimental” is more than sufficient.

5. The Platform Selection Trap

Getting the right OpenCL runtime installed is only half the battle. The other half is making sure your code actually uses it.

OpenCL has a concept of platforms - each installed runtime registers itself as a platform, and the application (or library) chooses which one to use. On our GitHub Actions runner, after installing Intel oneAPI alongside PoCL, clinfo reported:

Guess which one our runtime picked by default? Platform #0 - PoCL. The one without FP16 support.

This is a general problem with OpenCL multi-platform setups. Many libraries and frameworks simply grab the first available platform, or the first device, without checking for specific extension support. You can build sophisticated platform selection logic, but in CI the pragmatic solution is much simpler: remove the runtime you don’t want.

Is it elegant? No. Does it guarantee the right platform is selected? Absolutely. In a CI environment where you control the entire stack, there’s no shame in brute-force simplicity.

6. What You Can (and Can’t) Validate Without a GPU

With Intel oneAPI’s CPU runtime, here’s what we’re actually able to test on every push:

Codegen correctness: the generated OpenCL C code compiles without errors. This catches syntax bugs, type mismatches, and incorrect pointer dereferences - the kind of issues that are embarrassing to ship and hard to debug on remote GPU hardware.

Semantic correctness: the kernel produces the right numerical results. FP16 operations on CPU might differ slightly from GPU in the last few ULPs (units of least precision), but for our test suite, an exact match was sufficient.

Extension compatibility: the generated code only uses OpenCL extensions that are available on the target platform. This is the cl_khr_fp16 check that PoCL couldn’t give us.

What you can’t validate:

Real-world performance: a CPU emulating FP16 GPU operations is orders of magnitude slower than an actual GPU. Our benchmarks still run on dedicated GCP T4 instances on a schedule.

GPU-specific bugs: driver quirks, memory coalescing issues, warp/wavefront divergence - these only manifest on real hardware. CPU OpenCL runtimes use very different execution models.

Memory constraints: GPU VRAM limits are completely different from system RAM. A kernel that fits comfortably in 16 GB of system memory might OOM on a 4 GB GPU.

The key insight is that most CI failures aren’t GPU-specific. They’re codegen bugs, type errors, and logic mistakes - exactly the things a CPU-based OpenCL runtime can catch. Save the real GPU time for benchmarks and integration tests.

7. The Bug That Proved the Approach

To illustrate why this matters, here’s the actual codegen bug that motivated the whole effort. Our FP16 kernel code was generating:

Two bugs: a spurious address-of operator &, and -> instead of . (treating a struct as a pointer). This kind of codegen error would crash on any OpenCL platform - it’s not a GPU-vs-CPU issue. But without any OpenCL runtime in CI, we wouldn’t have caught it until someone ran the code on a real GPU.

With the CPU runtime in place, we built a reproducer workflow that tracked this upstream bug. We parameterized it with repository and branch inputs so we could test proposed fixes without merging anything. When the fix landed, our CI went green automatically. That’s the kind of feedback loop CI should give you.

8. A Practical Cheat Sheet for GPU Testing on GitHub Actions

If you’re setting up GPU-targeting CI on GitHub Actions, here’s the decision tree:

What you need	Runtime	Notes
OpenCL F32 only	PoCL	apt-get install pocl-opencl-icd
OpenCL with FP16	Intel oneAPI CPU	Needs CL_CONFIG_CPU_EXPERIMENTAL_FP16=1
Real GPU perf	GCP/AWS GPU runners	T4 or L4, schedule-based
CUDA kernels	No CPU fallback	Needs actual NVIDIA hardware

A few more things to keep in mind:

Cache aggressively. If your project depends on a custom JDK, framework build, or any compilation step that takes more than a few minutes, invest in a proper caching strategy early. We cache our entire JDK fork build on a single hash key, and it cuts workflow time from 20 minutes to under 3.

Watch out for multi-platform conflicts. If you install multiple OpenCL runtimes, your library might pick the wrong one. Either remove the runtime you don’t need, or implement explicit platform selection. In CI, removing is simpler and more reliable.

Parameterize everything. Use workflow_dispatch inputs for repository URLs and branch names. When you depend on an upstream project that’s still in active development, being able to test any fork/branch combination without modifying your own code is invaluable.

Separate correctness from performance. Run correctness checks (CPU OpenCL) on every push. Run performance benchmarks (real GPU) on a nightly or weekly schedule. This gives you fast feedback where it matters and real-world numbers where they matter.

The underlying truth here is unsexy but important: most GPU CI failures have nothing to do with the GPU. There are syntax errors, type bugs, and logic mistakes in the generated code. A CPU-based OpenCL runtime won’t catch driver quirks or memory bottlenecks, but it’ll catch the things that actually break on every second PR. And it’ll catch them on a free GitHub Actions runner, in under five minutes, on every push.

Intel’s intel-oneapi-runtime-opencl with CL_CONFIG_CPU_EXPERIMENTAL_FP16=1 isn’t glamorous. The word “experimental” doesn’t inspire confidence. But it fills a gap in the CI toolchain that nothing else currently fills - and for a project where the code generator is a moving target, having that safety net is worth every minute we spent fighting OpenCL runtimes.