Developer Ports 11 Model Families to Apple's Core AI On-Device Framework

Apple's Core AI framework quietly shipped as part of iOS 27 and macOS 27 beta, and one developer just made it significantly more useful. John Rocky published coreai-model-zoo on GitHub Thursday, a comprehensive port of 11 major model families to Apple's new .aimodel format — including Qwen3.5, Qwen3.6, GLM-4.7-Flash, Gemma 4, LFM2.5, Granite 4.0-H, and vision-language variants across both text and image understanding tasks.

What's Actually Available

The zoo covers an impressive range of architectures: Qwen3.5 in 0.8B and 2B parameter sizes (Apache-2.0), the dense Qwen3.6-27B, and the Mixture-of-Experts Qwen3.6-35B-A3B with ~3B active parameters. GLM-4.7-Flash brings Multi-head Latent Attention to Apple's framework for the first time — a significant technical achievement given MLA's complexity across 47 layers. Gemma 4 comes in E2B and E4B variants, including official quantized QAT int4 weights. Vision models include Qwen3-VL (2B/4B/8B) and Gemma 4 E2B VL for image+text tasks. Object detection fans get RF-DETR with NMS-free inference at 33-39 FPS live on iPhone 17 Pro, plus instance segmentation in six sizes.

Real Performance Numbers

Benchmarks are measured on the iOS 27/macOS 27 beta using Apple's coreai-pipelined GPU engine — no custom kernels. On iPhone 17 Pro GPU: Qwen3.5-0.8B hits 71.9 tok/s, LFM2.5-1.2B reaches 45.4 tok/s, and Gemma 4 E2B manages 30.3 tok/s (QAT variant gets 30.7). The ANE accelerator column shows modest gains for smaller models — Gemma 4 E2B hits 6 tok/s on the Neural Engine. Mac users with M4 Max see dramatic improvements: Qwen3.5-0.8B at 210 tok/s, LFM2.5-1.2B at 276.5 tok/s, and even Qwen3.6-35B-A3B (Mac-only due to memory requirements) hits 30.9 tok/s despite being a 35B parameter model.

Developer Resources That Actually Help

Beyond the models themselves, Rocky includes conversion scripts in PyTorch → .aimodel format, verified knowledge base docs on compression techniques and custom Metal kernels, plus CoreAIRunner — a Swift package that drives .aimodel bundles including architectures beyond Apple's standard runtime. The repository documents stateful KV cache handling, AOT compilation strategies, compute-unit routing rules, and the full Swift runtime API. Two demo apps ship with in-app model download: CoreAIChat runs Gemma 4 E2B GPU/ANE/pipelined alongside Qwen3.5 variants, while QwenChatFast showcases static kernel optimization.

The Beta Caveats

This is shipping on beta software, and Rocky doesn't hide the rough edges. A critical bug exists with in-graph KV-write causing crashes — workarounds are documented including an input-mask escape hatch (FB23024751 tracked with Apple). Dense models like Qwen3.6-27B read the entire model per token versus MoE's ~3B-active approach, making them slower despite potentially better quality output at int8 precision matching fp16.

Key Takeaways

11 model families now run natively on iOS 27/macOS 27 Core AI with downloadable .aimodel files on Hugging Face
M4 Max GPU throughput is impressive: Qwen3.5-0.8B hits 210 tok/s, LFM2.5-1.2B reaches 276.5 tok/s
Vision-language models (Qwen3-VL, Gemma 4 VL) and object detection (RF-DETR) extend beyond text-only use cases
Full conversion toolchain included — PyTorch → .aimodel with verification scripts and compression documentation

The Bottom Line

This is the kind of ecosystem work that Apple rarely does itself. One developer filling the model gap for Core AI while it's still in beta shows where Apple's priorities aren't — and that's exactly why open source bridges matter. If you're building anything on-device for iOS 27, this zoo is your starting point.

> Developer Ports 11 Model Families to Apple's Core AI On-Device Framework