How it works

Features

Pricing

Blog

About us

Get Weeve

Research

How we picked a language model

Dylan de Heer

May 8, 2026

When you commit to running a language model on the user's machine, the first thing that happens is the menu shrinks. Frontier models are out. Anything that needs more memory than a typical Mac has is out. Anything that takes ten seconds to start producing tokens is out, because nobody will wait.

What's left is a specific corner of the open-weight ecosystem, and the work is figuring out which model in that corner does the meeting summarization job best.

The constraints

We targeted Apple Silicon Macs with sixteen gigabytes of unified memory or more. We needed Q4 quantized models that fit comfortably alongside the rest of the operating system. We needed first-token latency under one second on an M-series chip. We needed quality that didn't make us wince when we read the summaries.

That filtered the field to roughly three families: Qwen3, Mistral Nemo, and the smaller GLM variants.

What we tested

We built an internal eval set of around four hundred meeting transcripts spanning sales calls, customer interviews, 1:1s, and a handful of clinical session simulations. For each one we had a human-written reference summary. We ran every candidate model against the set and scored on three things.

Completeness: did the summary contain the actual decisions and commitments? Accuracy: did anything in the summary contradict the transcript? Structure: was the output usable without editing?

Completeness mattered most. A summary that's missing the one thing you needed to remember is worse than a longer summary you have to scan.

What won, the first time

Qwen3 in the four-to-eight billion parameter range produced the best balance for our workload. Mistral Nemo was close on accuracy and slightly weaker on completeness. GLM-4.7-Flash in Q4 was promising for the long-context cases but inconsistent on shorter meetings.

We shipped Qwen3 as the default with a Mistral Nemo option for users who wanted it. Both ran through mlx-swift-lm.

That answer held until Gemma 4 dropped.

Why we switched to Gemma 4

In April, Google released Gemma 4 under Apache 2.0, built from the same research as Gemini 3, with four sizes ranging from a 2B effective-parameter edge model up to a 31B dense model. Day-one MLX support meant we could pull it into our stack without writing a runtime around it. We re-ran the full eval.

The candidate that mattered for us was E4B, the four-billion effective-parameter model. It uses Per-Layer Embeddings, which means the parameter count loaded into memory is larger than the count active during inference. That trick is what makes it run comfortably on a sixteen-gigabyte Mac without trading off latency.

Three things came out of the re-eval.

On completeness, our most important metric, E4B beat Qwen3 by a margin that was visible in single-meeting reads, not only in the aggregate scores. The decisions and commitments buried at the end of long calls were caught more often.

On structure, Gemma 4's native function-calling and structured JSON output made our Templates feature dramatically cleaner to wire up. We had been wrapping the previous model's outputs in defensive parsing logic to recover from malformed extractions. Most of that code is gone.

On license, Apache 2.0 is the cleanest answer we could ask for when enterprise legal teams want to read the terms of every dependency in our stack themselves.

What's running now

Gemma 4 E4B is the default for every Weeve user. The 26B Mixture-of-Experts variant ships as an opt-in for users on higher-memory Macs who want better performance on long meetings. The MoE only activates around four billion parameters per token but needs all twenty-six loaded into memory, which is the price of that throughput. Both run through mlx-swift-lm. The download happens once, in the background, and lives on your machine after that.

Qwen3 remains available as an alternative for anyone who has less than 16GB ram available. We don't think there's a strong reason to pick it over E4B for the meeting summarization workload for stronger hardware, but we'd rather offer the choice than remove it.

Keep reading

Achmea invests €460K as Weeve finds market fit

Business

On craft, restraint, and the software you don't notice

Product updates

Weeve

Your work woven together.

Get Weeve

Product

How it works

Features

Pricing

Company

About us

Blog

Support

Email us

Stay in the loop

Get updates on new features and company updates.

Cookie Policy

Weeve

Your work woven together.

Get Weeve

Product

How it works

Features

Pricing

Company

About us

Blog

Support

Email us

Stay in the loop

Get updates on new features and company updates.

Cookie Policy

Weeve

Your work woven together.

Get Weeve

Product

How it works

Features

Pricing

Company

About us

Blog

Support

Email us

Stay in the loop

Get updates on new features and company updates.

Cookie Policy