Apple's On-Device AI Stack Gets Ground-Up Rebuild at WWDC 2026

WWDC 2026 brought no new silicon, but it delivered something arguably more significant: a structural rebuild of how AI actually runs on Apple hardware. The keynote headlines were all about consumer features, but dig into the developer documentation, WWDC session pages (324 "Meet Core AI", 325, 326, 330), and Apple's own machine learning research posts, and you find a clearer roadmap than Cupertino was willing to spell out on stage. I spent the night reading every layer of this so you don't have to—and some of what I found is genuinely odd.

The Framework Handoff: Core AI Replaces Core ML for Neural Networks

For a decade, Core ML was the answer to "run a model on an iPhone." That's over now. Apple introduced Core AI at WWDC 2026 with explicit framing that reads like a handover document, not an addition. Core AI's documentation sends legacy cases back to Core ML: "If your app uses model types other than neural networks, such as decision trees or tabular feature engineering, see Core ML." Meanwhile, Core ML's docs now point forward: "If your app integrates AI models using the latest architectures and inference techniques, see Core AI." Read together, it's a clean split—Core ML narrows to classic non-neural machine learning while neural networks and transformers move wholesale to Core AI. The tooling tells confirm this: Apple's new Core AI debug gauge explicitly states it "does not support the Core ML framework." The old APIs remain intact for backward compatibility, but the center of gravity has shifted.

A New Artifact: The Half-Open .aimodel Bundle

Core AI ships with a fresh on-disk format called .aimodel, and here's where it gets interesting—it's not actually a file. It's a directory. Apple's open coreai-models repository treats it as one throughout; the Python exporter deletes old artifacts with a directory-only call, and the Swift runtime resolves it as a ".aimodel directory." Inside sits a plain-JSON metadata.json (schema version 0.2) recording model kind, tokenizer, vocabulary size, context length, compression preset, and which file is the actual model weight payload. That JSON is documented and parseable—which means you can inspect what's in an .aimodel bundle without Apple's tools. But the weight data itself gets written by an opaque framework call with no published byte layout. Half-open: a readable manifest wrapped around an undocumented blob. Developers prepare models using Core AI Optimization (coreai-opt, coremltools' successor) and Core AI PyTorch Extensions (coreai-torch), then optionally compile ahead-of-time into per-architecture .aimodelc assets. The compression menu is wide—integer weights at 2, 4, and 8 bits; float micro-formats including FP8 (E4M3) and FP4 (E2M1); block-scaled MXFP8; palettization from 1 to 8 bits; plus activation quantization like w4a8 and w4a16. Given Apple's install base, the formats Apple blesses could end up shaping how sub-100B models ship industry-wide.

The Hardware Tell: Neural Accelerators Inside Every GPU Shader Core

No new chip generation was announced, but WWDC 2026 made the M5 and A19 GPU story explicit—and it's the clearest hardware signal of the week. Directly from Apple's "Accelerate your machine learning workloads with the M5 and A19 GPUs" tech talk: "Neural accelerators are dedicated hardware in M5 purpose built for matrix multiplication. They're built into each shader core right alongside the other GPU pipelines such as ALU, raytracing... Each shader core has its own neural accelerator." Apple's claimed numbers: matrix multiplications up to 4 to 8 times faster, LLM time-to-first-token up to four times faster on prefill (compute-bound), token generation up to 25% faster on decode (memory-bound). The underlying framing now matches what local-LLM runners have known for years—the roofline model is now Apple's own language in their Metal Performance Primitives Programming Guide: "GEMMs with low arithmetic intensity are memory bound workloads, and GEMMs with high arithmetic intensity are compute bound workloads." A second tell hides in code: the coreai-models source infers a model's preferred compute unit from its graph structure—chunked static-shape graphs prefer the Neural Engine; dynamic-shape graphs prefer the GPU. That quietly formalizes what Apple's been hinting at for years.

The Model: AFM 3 and the Bandwidth Wall Apple Explicitly Acknowledged

Apple's third-generation Foundation Models landed too: a 3-billion-parameter dense model (AFM 3 Core) and a 20-billion-parameter sparse mixture-of-experts variant (AFM 3 Core Advanced, natively multimodal, activating just 1 to 4 billion parameters at inference time). But the interesting part is where Apple admits the constraint plainly in their ML research post: "the full model is stored in flash memory (NAND)" and "NAND-to-DRAM bandwidth is too slow to swap weights token by token." That's Apple describing the exact wall every local-LLM runner hits—model too big for DRAM, paying in bytes moved per token. Their answer is mixture-of-experts with always-active shared experts plus input-dependent routed experts, keeping the shared weights resident while streaming minimal activations. It's a reminder that Apple isn't exempt from physics; it's just unusually candid about it in a research post.

The Cloud Boundary: Google and NVIDIA Run Apple's Flagship Model

Here's where it gets surprising. Apple's foundation models now span on-device to cloud, but the cloud end has an unexpected shape. From the AFM 3 research post: "we worked with Google and NVIDIA to extend Private Cloud Compute to NVIDIA GPUs in Google Cloud." And from Apple's security team directly: "collaborating with Google and NVIDIA to run new Apple Intelligence workloads on Google Cloud." The most demanding model runs on NVIDIA GPUs, in Google's cloud, built with Google's infrastructure. For a company that designs its own silicon and markets heavily on device-side privacy, the flagship cloud model living on competitor hardware in a competitor's cloud is the most surprising tell of WWDC 2026. What's still undocumented: exactly when a request transparently offloads from on-device to Private Cloud Compute, and whether that routing decision is visible to developers or users afterward. The spectrum is real; the switch mechanism and its auditability are simply not publicly specified.

What Developers Can Actually Profile

Core AI ships three profiling tools—a standalone Debugger app, an Xcode debug gauge, and an Instruments template—and they measure something real: "profiles execution timing across the CPU, GPU, and Neural Engine... such as which compute units run your model. The trace correlates Core AI events with hardware activity." Latency, token counts, which compute unit ran the model—inside Xcode for your own app's Core AI calls. What's notably absent from the profiling docs: energy consumption, memory bandwidth utilization, and thermal state don't appear anywhere in Core AI tooling documentation. That's a statement about what Apple chose to instrument, not an accident—and it's a notable gap given how much on-device performance is decided by exactly those three constraints.

The Parallel Track: MLX Keeps Its Own Path

Running alongside all of this, Apple continued investing in MLX as the bring-your-own-weights path for power users. WWDC 2026 added distributed inference across multiple Macs via a new JACCL backend over Thunderbolt 5, an OpenAI-compatible mlx_lm.server, and an agentic-on-Mac story built around it. Tellingly, the MLX sessions draw no line back to Core AI or Foundation Models—a deliberate two-track posture: Apple's own models run on Core AI and Foundation Models; the open community's models run on MLX.

Key Takeaways

Core AI replaces Core ML for neural networks—Core ML narrows to classic decision trees and tabular work while transformers move to the new stack
The .aimodel format is half-open: documented JSON manifest wrapped around opaque weight blobs you can't inspect without Apple's tools
M5/A19 Neural Accelerators live inside every GPU shader core—the matmul has officially moved from the Neural Engine to the GPU for transformer workloads
Apple explicitly acknowledges the NAND-to-DRAM bandwidth wall in published research, using mixture-of-experts as their answer
The cloud boundary remains undocumented: when Core AI transparently offloads and whether developers can audit it is simply not specified publicly
Apple's flagship cloud model runs on NVIDIA GPUs in Google Cloud—partnering with competitors on the most privacy-sensitive inference workloads

The Bottom Line

Pull back and the roadmap is clear: Apple just made on-device AI a first-class platform capability with its own framework, format, toolchain, and profiler. But they also shipped it faster than the story explaining it—Core ML, Core AI, and MLX now coexist, and developers were already asking which to use within hours of announcement. The most interesting tell isn't what Apple announced; it's that their flagship cloud model runs on Google/NVIDIA hardware while the switch mechanism between local and cloud inference remains opaque. For anyone who picked Apple silicon specifically to keep inference on-device, that's a trust-and-architecture question worth watching closely.

> Apple's On-Device AI Stack Gets Ground-Up Rebuild at WWDC 2026