Building AI‑Native iOS Features: On‑device LLMs with Core ML and MLX

Building AI‑Native iOS Features: On‑device LLMs with Core ML and MLX

Ship a semantic search feature that feels instant, works offline, preserves privacy, and raises engagement — built natively on iOS with Core ML and MLX. Success looks like faster findability, longer sessions, and users trusting the app with their notes, docs, or content.

What Success Looks Like

  • Latency: <100ms for typical queries on modern iPhones
  • Privacy: zero network calls for core interactions; clear user consent
  • Engagement: +20–30% more successful searches; +10–15% longer sessions
  • Reliability: graceful degradation when indexing is interrupted; safe cancellation

The User Problem

Users don’t remember exact words — they remember ideas. Literal search makes them feel clumsy and slow. We want the app to understand meaning: “winter boarding checklist,” “Swift actor pattern,” “sound design notes” — even if phrased differently.

Our Constraints

  • On‑device by default (ANE/GPU/CPU), no per‑keystroke backend calls
  • Battery‑aware and memory‑bounded; performance budgets per screen
  • Simple, testable architecture; no mystery schedulers or hidden queues

The Plan

  1. Represent meaning with embeddings
  2. Make retrieval fast with a local index
  3. Keep it private and instant with aggressive caching
  4. Respect energy and memory budgets
  5. Tell a clear UX story with tight feedback loops

Prototyping on Mac with MLX

We began on Apple Silicon, where iteration speed wins. MLX let us test small transformer‑based embedding models, tune dimensions, and measure throughput across real text (notes, code snippets, short docs). We weren’t chasing leaderboard scores — we optimized for consistency and speed in our domain.

Core ML Conversion for iOS

Once our embedding model behaved well, we converted it to Core ML (coremltools) and compiled it into .mlmodelc assets. This brought ANE acceleration and stable APIs. We wrapped the model in a boring interface: “give me a string, I’ll return a float vector.” No surprises.

A Boring, Reliable Wrapper

import CoreML

final class TextEmbeddingModel {
    enum ModelError: Error { case outputMissing }
    private let model: MLModel

    init() {
        let url = Bundle.main.url(forResource: "TextEmbed", withExtension: "mlmodelc")!
        model = try! MLModel(contentsOf: url)
    }

    func embed(_ text: String) throws -> [Float] {
        let input = try MLDictionaryFeatureProvider(dictionary: ["text": text])
        let output = try model.prediction(from: input)
        guard let arr = output.featureValue(for: "embedding")?.multiArrayValue else { throw ModelError.outputMissing }
        var result = [Float](repeating: 0, count: arr.count)
        for i in 0..<arr.count { result[i] = arr[i].floatValue }
        return result
    }
}

Pipeline: Normalize, Cache, Index

The model is a component; the pipeline is the feature. We normalized inputs, cached aggressively, and isolated mutation with actors.

struct EmbeddingResult: Sendable { let vector: [Float]; let key: String }

actor EmbeddingCache {
    private var store: [String: [Float]] = [:]
    func get(_ key: String) -> [Float]? { store[key] }
    func put(_ key: String, _ vector: [Float]) { store[key] = vector }
}

struct TextPreprocessor {
    static func normalize(_ s: String) -> String { s.lowercased().trimmingCharacters(in: .whitespacesAndNewlines) }
    static func key(for s: String) -> String { String(s.hashValue) } // replace with stable hash
}

actor EmbeddingService {
    private let cache = EmbeddingCache()
    private let model = TextEmbeddingModel()

    func embed(_ text: String) async throws -> EmbeddingResult {
        let clean = TextPreprocessor.normalize(text)
        let key = TextPreprocessor.key(for: clean)
        if let cached = await cache.get(key) { return EmbeddingResult(vector: cached, key: key) }
        let vector = try model.embed(clean)
        await cache.put(key, vector)
        return EmbeddingResult(vector: vector, key: key)
    }
}

Local Retrieval with Cosine Similarity

Cosine similarity is simple and effective for semantic search. We kept writes serialized and reads fast.

actor VectorIndex {
    struct Item: Sendable { let id: String; let vector: [Float] }
    private var items: [Item] = []

    func upsert(_ item: Item) {
        if let idx = items.firstIndex(where: { $0.id == item.id }) { items[idx] = item } else { items.append(item) }
    }

    func topK(query: [Float], k: Int = 10) -> [Item] {
        let scored = items.map { ($0, cosine($0.vector, query)) }
        return scored.sorted(by: { $0.1 > $1.1 }).prefix(k).map { $0.0 }
    }

    private func cosine(_ a: [Float], _ b: [Float]) -> Float {
        var dot: Float = 0, na: Float = 0, nb: Float = 0
        for i in 0..<min(a.count, b.count) { dot += a[i]*b[i]; na += a[i]*a[i]; nb += b[i]*b[i] }
        return dot / (sqrt(na) * sqrt(nb) + 1e-6)
    }
}

UX: Instant Feedback, Honest Ranking

Users type; results update. We debounced input, embedded the query, fetched top matches locally, and updated the UI — no waiting on a network round‑trip. We surfaced “why” explanations next to results (“matched concepts: winter boarding, checklist”).

@MainActor
final class SearchViewModel: ObservableObject {
    @Published var query: String = ""
    @Published var results: [Doc] = []
    private let embedder = EmbeddingService()
    private let index = VectorIndex()

    func search(_ q: String) {
        query = q
        Task { [weak self] in
            guard let self else { return }
            do {
                let emb = try await self.embedder.embed(q)
                let local = await self.index.topK(query: emb.vector, k: 20)
                await MainActor.run { self.results = local.map(toDoc) }
            } catch {
                // handle gracefully
            }
        }
    }
}

Budgets and Profiling (ANE/Metal)

We set measurable budgets and stuck to them:

  • Concurrency: 4 tasks for search, 6 for background indexing
  • Memory: cap vector lengths; compress idle caches
  • Energy: no long‑running work triggered by typing

We used Instruments — Energy, Time Profiler, Allocations, Concurrency — and added signposts around embedding and retrieval.

import os.signpost
let log = OSLog(subsystem: "com.app", category: "ai")
let sp = OSSignposter(log: log)

func signposted<T>(_ name: StaticString, _ op: () async throws -> T) async rethrows -> T {
    let s = sp.beginInterval(name); defer { sp.endInterval(name, s) }
    return try await op()
}

Guardrails and Trust (Consent, Accessibility, Explainability)

We treated AI like a respectful assistant:

  • Transparent consent and on‑device defaults
  • Clear controls to pause/stop generation
  • Constrained prompts and output lengths
  • Simple “why” explanations to reduce surprise

Persistence and Resilience

We persisted embeddings and outputs with lightweight indexing, batched writes, and versioned caches. When the model changed, we invalidated cleanly and rebuilt in the background. Checkpoints let long jobs resume.

App Intents (Shortcuts)

We exposed quick actions and Shortcuts so users could jump directly to “ideas about audio” or “notes on actors,” making the feature feel native beyond the app.

Keeping the Binary Lean

We shipped a base embedding model, downloaded larger variants on demand, and audited assets ruthlessly. Smaller apps install more, start faster, and crash less.

Testing and CI/CD

We tested actor‑isolated caches and indexes, verified cancellation, used fixtures for embeddings, and avoided sleeps. In CI, we staged model assets and gated releases with end‑to‑end tests. Budgets were checked on physical devices before TestFlight.

Results

  • Latency consistently under 100ms for typical queries
  • Dramatic increase in successful searches and longer sessions
  • Fewer support tickets about “can’t find my note”
  • Positive reviews citing speed and trust (“works offline, feels instant”)

Lessons

  • The model is not the feature; the pipeline is
  • Ownership and isolation prevent heisenbugs and copy storms
  • Budgets make performance a product choice, not luck
  • On‑device by default earns trust and word‑of‑mouth

Implementation Checklist

  • [ ] Define objective and metrics (latency, privacy, engagement)
  • [ ] Prototype embeddings with MLX on Mac (tune dimensions/tokenization)
  • [ ] Convert to Core ML (.mlmodelc) and wrap a stable API
  • [ ] Build pipeline: normalize, cache, index
  • [ ] Implement local retrieval and “why” explanations
  • [ ] Set concurrency/memory/energy budgets; add signposts; profile on device
  • [ ] Persist vectors; batch writes; version caches; checkpoints
  • [ ] Integrate ## App Intents (Shortcuts) (Shortcuts) for quick actions
  • [ ] Keep binary lean; stage assets; test on physical devices
  • [ ] Monitor results; iterate

FAQs

  • What is on‑device AI for iOS?
    • Running models locally on iPhone/iPad using Core ML/Metal/ANE, keeping latency low and data private.
  • Core ML vs MLX — which should I use?
    • Use MLX on Mac for rapid prototyping and custom layers; convert to Core ML for production iOS deployment with ANE acceleration.
  • Can iPhones run LLMs?
    • Yes, small distilled models are practical for templated generation, short summaries, and classification with rationale.
  • How do I keep battery usage low?
    • Cap concurrency, use ANE where available, measure with Instruments, avoid long tasks on user input.
  • How do I ensure privacy?
    • Avoid per‑keystroke network calls; keep embeddings and retrieval on device; offer opt‑in for remote expansion.
  • How do I tune search quality?
    • Normalize inputs, cache aggressively, and tune embedding dimensions/tokenization for your domain; surface “why” explanations.

Where This Goes Next

We’ve reused the pipeline to power intent suggestions, lightweight categorization, and short previews. The same embedding cache and index became a platform inside the app. Small, reliable pieces compound.

Spread the love

Comments

Leave a Reply

Index