How AI food tracking actually works — and where it still trips up

When you point your camera at a plate of food, a lot has to happen in the couple of seconds before your macros show up. This post is a plain-English walkthrough of what the model is actually doing, what it's good at, and the specific places it still misses — so you know when to trust the estimate and when to adjust.

The pipeline in three steps

1. Identify the foods. The image is passed to a vision model that labels what's on the plate. "Grilled chicken breast. Jasmine rice. Steamed broccoli." It can pick out three to five items on a single plate reliably; dense mixed dishes (like a stir-fry) compress into fewer labels.

2. Estimate portion size. This is the hard part. The model looks at relative area, depth cues from the plate rim, and typical serving size for the food in question. It converts pixels to grams — which, on a photo taken from above, with no known reference, is fundamentally a guess. Our model errs on the side of "average serving" and gives you a slider to adjust.

3. Compute macros. Once the food and portion are locked in, macros are a database lookup: grams → protein, carbs, fat, calories. This step is boring and accurate. All of the interesting error lives in step 2.

What it's genuinely good at

Single plates, from above, in reasonable light. The "phone over the dinner table" shot is the format the model is trained on most.
Common restaurant and home-cooked dishes. Chicken bowls, pasta plates, burgers, salads — things that look like what they are.
Multiple items per plate. The detection will typically catch 3–5 separate foods without you having to tap-select.
Side-by-side portions. Two tacos look like two tacos; it counts them.

Where it trips up

Dense ambiguous food. A ground-beef chili and a lentil stew look very similar through a lens. Voice or manual search is faster there.
Liquid measurements. Oils, dressings, and sauces have wildly different caloric density per volume. If a dish is swimming in something, nudge the fat slider up.
Empty-calorie density. A croissant and a bagel are roughly the same shape; their macros aren't. The model is aware of this class of ambiguity but can't resolve it from a photo alone.
Unusual angles. Low-angle shots and close-ups lose the size reference. Top-down is best.

What we do, step by step

Auto-crop and normalize. We strip simulator/phone chrome and resize to a standard resolution so the model sees a consistent input.
Detection pass. The image goes to our vision provider with explicit instructions: no image retention, no training use, return structured output with portion in grams. We include a short list of recently-logged foods from your diary as prior — if you had chicken last Tuesday, the model weights that slightly higher today.
Fallback matching. If confidence drops below a threshold, we offer a manual search modal instead of serving a bad guess. A wrong log is worse than one you had to type.
Your correction is a signal. When you adjust a portion, that adjustment sticks for that food in your diary. The system learns your typical serving sizes over time.

The trade-off in one sentence

Photo logging turns 90 seconds of typing into five seconds of aiming, at the cost of some portion-estimation slop you can fix with one slider. For most meals on most days, that's the right trade.

When to skip the camera

You're logging something you eat identically every day — make it a custom food and one-tap it in.
The food is packaged with a nutrition label — the barcode scanner will always be more accurate.
You know the exact weight you cooked — enter it manually.

The camera is an accelerator for the messy middle, not a replacement for the other inputs. Use whichever is fastest for the meal in front of you.