Researchers Demo VLA Code Gen That Scales Across Arm SVE Hardware Configurations

A team of researchers from Jan Moritz Joseph's group has published work tackling one of the thornier problems in modern compiler design: generating efficient machine learning code for vector-length-agnostic (VLA) instruction sets like Arm SVE, where hardware can have wildly different vector lengths that aren't known until runtime. The paper, posted to arXiv on May 12 and revised May 18, presents an approach built into MLIR/IREE that lets compilers make smart tiling, fusion, and vectorization decisions even when the target's vector length is a moving target.

The VLA Compilation Problem

Traditional compiler code generation assumes you know the hardware's capabilities at compile time. Vector length, register sizes, tile dimensions—these are typically baked in. But SVE and similar scalable vector instruction sets break that assumption. A server chip might have 256-bit vectors while an embedded device has 1024-bit vectors, and both need to run the same binary efficiently. The researchers' answer is vector-length-aware packed data layouts: instead of fixing data layout decisions upfront, their system generates code that adapts its tiling and memory access patterns based on whatever vector length it encounters at runtime.

Integration Into MLIR/IREE

The work extends IREE's compilation pipeline with mechanisms for handling scalable vector lengths. Tiling operations now consider VLA constraints rather than assuming fixed-width vectors. Fusion decisions account for how packed layouts will behave across different hardware configurations. Vectorization adapts its strategy based on detected or simulated vector length rather than emitting one-size-fits-all IR. This isn't theoretical—it's been validated against real-world ML workloads running on actual Arm CPUs, not just synthetic benchmarks.

Performance Results That Actually Matter

The numbers are worth dissecting carefully. Against IREE's existing NEON-based code generation (Arm's older fixed-width SIMD), the SVE output achieved up to 1.45x speedup on production-style workloads. More tellingly, they benchmarked against popular PyTorch ecosystem tools—ExecuTorch, TorchInductor, and eager execution—and consistently outperformed all of them. This matters because migrating ML inference pipelines to new hardware typically means accepting performance regressions or doing painful manual optimization; the researchers demonstrate that a properly tuned VLA compiler backend can eliminate that friction entirely.

Simulator Study Validates Scalability Claims

Beyond real hardware testing, the team ran simulator-based experiments across different SVE vector lengths. On compute-bound workloads, generated code scaled predictably as vector length increased—no surprises there, but confirming this behavior matters for performance portability planning. Hardware vendors shipping chips with longer vectors can now trust that existing compiled binaries will extract proportional benefits without recompilation or hand-tuning.

Key Takeaways

VLA instruction sets like Arm SVE require rethinking traditional compiler tiling and layout decisions
MLIR/IREE extended with vector-length-aware packed layouts achieves up to 1.45x speedup over NEON code generation
Outperforms PyTorch ecosystem frameworks including ExecuTorch, TorchInductor, and eager execution
Simulator validation confirms performance scales with increasing vector length on compute-bound workloads

The Bottom Line

This is the kind of foundational work that makes hardware diversity less painful for ML deployment. If you're building inference infrastructure that needs to span Arm chips from IoT to server class, getting VLA right in your compiler stack matters more than most benchmark showboats realize. Kudos to the IREE team for pushing this upstream rather than keeping it as another research prototype.

> Researchers Demo VLA Code Gen That Scales Across Arm SVE Hardware Configurations