What stack did you use to build the salon voice agent?

Vapi for realtime voice orchestration, GPT-4o as the LLM, Deepgram for speech-to-text, ElevenLabs for text-to-speech, n8n cloud for webhook orchestration, Supabase Postgres for the database (multi-tenant with Row-Level Security), Next.js 14 on Vercel for the operator dashboard, and Twilio underneath Vapi for the US DID. Total monthly cost is roughly $35-55 at pilot volume.

Why bake the service catalog into the prompt instead of using a tool call?

The salon has 85 services plus 10 active specials, which fits in about 4 KB of markdown — a rounding error against GPT-4o's 128K context window. Baking the catalog into the system prompt eliminates a 1-2 second tool round-trip on every service mention. Updating prices means editing one row in Supabase and re-running the publish script, which takes 3 seconds.

How do you handle speech-to-text errors on uncommon names?

Deepgram reliably mishears non-common names (Satyarthi becomes Satyati on the first try). The fix lives in the system prompt: anything beyond common American/English names triggers an explicit 'could you spell your last name?' step, captured letter-by-letter. Common names like John Smith are not asked to spell, so the friction is targeted.

How do you prevent the agent from passing 'yes' as a phone number?

Defense-in-depth across five layers: a CRITICAL prompt rule forbidding non-numeric values for customer_phone, n8n webhook normalization that strips formatting and keeps the last 10 digits, placeholder rejection for known fake numbers (1000000000, 1234567890), a Postgres CHECK constraint on customers.phone enforcing E.164 format, and a caller_phone/customer_phone split where the dashboard surfaces any mismatch.

How does the AI agent get routed to the salon's existing phone number?

Three layers, ordered by complexity. Layer 1 is a carrier-level conditional forward — the owner dials *71 followed by the Vapi number on the salon's existing line, and from then on calls that ring 4-6 times without being picked up forward automatically to the AI. Layer 2 is a full Twilio number port with TwiML time-of-day routing. Layer 3 is making Vapi the front door and having the AI transfer to the desk during hours. The pilot started at Layer 1 for $0 and 30 seconds of setup.

I Built an AI Receptionist for a Hair Salon. Here's What Actually Mattered.

The four engineering decisions that separate a working voice agent from a demo that falls apart on the first real call.

Voice agent answering a salon phone, booking flowing into a dashboard

TL;DR

I built a 24/7 voice-booking agent for Hair Time Salon in Franklin Park, NJ, on the Vapi platform with GPT-4o, Deepgram STT, ElevenLabs TTS, and a Supabase + Next.js back office. The agent answers the salon's existing line when nobody picks up, books appointments straight into a multi-tenant Postgres schema, and lets the owner manage everything from one iPad dashboard. Total monthly cost at pilot volume is roughly $35-55, well under what one captured after-hours booking is worth. Most of the engineering value was not in the model choice. It was in four decisions that almost every "I built a voice agent" tutorial skips: baking the service catalog into the prompt to remove a tool round-trip, fighting the STT and TTS edge cases that mangle phone numbers and dates, enforcing phone normalization at every single layer, and choosing the right call-routing strategy from the salon's existing carrier line. A 30-day production results report is coming on June 22, 2026.

Key Takeaways

The platform choice is the easy part. Vapi handles orchestration, Deepgram handles STT, ElevenLabs handles TTS, GPT-4o is the brain. The hard work is everything around these pieces.
Bake the catalog into the prompt, not a tool. 85 services and 10 specials live in the system prompt itself, rendered from Supabase at publish time. Eliminates one tool round-trip per booking and saves 1-2 seconds on every call.
STT and TTS are the silent killers. Deepgram mishears non-common names and ElevenLabs mangles ordinals and colon-separated times. Both have to be fixed in the prompt, not the audio layer.
Phone normalization is a defense-in-depth problem. Without enforcement at every layer (prompt rule + webhook check + database CHECK constraint + placeholder rejection), the agent will eventually pass the word "yes" as customer_phone and you will lose a booking.
Start call routing at the carrier, graduate to Twilio when you grow. A 30-second *71 conditional forward on the salon's existing line is the cheapest reliable way to get the AI in front of real callers without porting numbers or changing how the salon operates.

The Salon Problem
The Stack
Decision 1: No find_service Tool
Decision 2: Beating Deepgram and ElevenLabs Edge Cases
Decision 3: Phone Normalization Defense-in-Depth
Decision 4: Where the AI Sits in the Call Path
Three Caller-ID Scenarios
Multi-Tenant From Day One
What the 30-Day Report Will Cover

The owner of Hair Time Salon runs a one-to-two-chair shop in Franklin Park, NJ. He cuts hair from 10 AM to 8 PM most days. While he is cutting, he cannot pick up the phone. After 8 PM and on Sundays, nobody can. Each missed call is a potential booking that leaks to whoever a customer dials next.

The simplest possible AI receptionist would solve this. Answer the phone when he cannot. Quote prices accurately. Pick a slot. Take a name and a phone. Write a row to a database he can read on an iPad.

That sentence describes maybe 5% of the work. Below is the other 95%.

The Salon Problem

Three constraints shaped every decision.

No staff to babysit the system. The salon owner is a hairdresser. He is not going to log into a dashboard at 8 AM to "check the queue." Anything that requires daily ops attention is dead on arrival.

Customers are real humans calling from cell phones. They mumble. They say "uh" and "yeah, that one" and "the special one with the color." They have names like Priya Sharma and Satyarthi that Deepgram has never been trained on. They give phone numbers as "five-one-seven, four-six, six-eight, nine-two" with weird groupings.

The salon has 85 services and 10 active specials, with rules. Senior cut $20 is Mondays and Tuesdays only. Some specials are cash only. Some are women only. Some require a minimum service duration. The agent has to know all of this without sounding like it is reading from a script.

Hold these three constraints in mind. Every decision below is in service of one of them.

The Stack

Customer phone (732-419-3941)
   │
   │  *71 conditional forward
   ▼
Vapi DID (732-813-0948)
   │
   ├──► Deepgram (STT) ──► GPT-4o (LLM) ──► ElevenLabs (TTS) ──► back to caller
   │
   └──► tool calls ──► n8n cloud (7 webhooks)
                          │
                          ▼
                       Supabase Postgres
                       (multi-tenant, RLS-scoped)
                          ▲
                          │
                       Next.js dashboard on Vercel
                       (owner reads on iPad)

Layer	Choice	Why
Telephony	Vapi + Twilio underneath	Vapi handles the realtime orchestration; Twilio gives us a US DID and SIP control
STT	Deepgram (Vapi-managed)	Lowest end-to-end latency for English; works with their realtime pipeline
LLM	GPT-4o (Vapi-managed)	128K context lets us bake the full catalog into the prompt; function-calling is reliable
TTS	ElevenLabs (Vapi-managed)	Voice quality is several notches above the Vapi defaults; matters for elderly callers
Orchestration	n8n cloud	Visual workflows let the owner inspect tool calls later if something breaks
Database	Supabase (Postgres + RLS)	Multi-tenant from day one; auth + RLS comes free; Postgres is the right boring choice
Dashboard	Next.js 14 on Vercel	Fast iteration; same stack as the rest of my work

This is not the cheapest stack. It is the stack with the fewest moving parts I had to wire myself. Saving a few dollars a month by self-hosting any of these layers would have added engineering time worth more than the savings for years.

Decision 1: No `find_service` Tool

The naive design has the agent call a find_service(query) tool every time the caller mentions a service. Caller says "men's haircut," agent calls the tool, tool returns SVC-M01 — Regular Cut, $25, 30 min. Then the agent quotes the price.

That round-trip is 1-2 seconds of latency. Per service mentioned. Per call.

The salon has 85 services. They fit in roughly 2,500 characters of markdown. The same is true for the 10 specials. Total catalog is about 4 KB.

GPT-4o's context window is 128,000 tokens. Spending 4 KB on a baked-in catalog is a rounding error. So the agent does not call a find_service tool at all. The catalog is rendered directly into the system prompt at publish time.

Here is the publish flow:

# scripts/publish_vapi.py (simplified)

def publish():
    template = read("vapi-prompt.template.md")
    services = supabase.table("services").select("*").execute()
    specials = supabase.table("specials").select("*").execute()

    services_table = render_markdown_table(services.data)
    specials_table = render_markdown_table(specials.data)

    final_prompt = (
        template
        .replace("{{SERVICES_TABLE}}", services_table)
        .replace("{{SPECIALS_TABLE}}", specials_table)
    )

    vapi.patch_assistant(assistant_id, system_prompt=final_prompt)

if __name__ == "__main__":
    publish()

The final prompt is about 16,000 characters. The agent knows every service ID, price, duration, gender restriction, and day restriction without ever calling a tool.

Updating prices means editing a row in Supabase and re-running the publish script. Takes 3 seconds.

The trade-off is that the prompt is larger on every call. GPT-4o caches stable prefixes, so this is amortized across the conversation. The latency saved on every service lookup pays for the slightly heavier first turn many times over.

This pattern generalizes. If you have a small-to-medium reference catalog (under 50 KB of structured text) and you call it on most turns, the prompt is the better place for it. Tools are for actions that mutate state, hit external APIs, or return data too large to inline.

Decision 2: Beating Deepgram and ElevenLabs Edge Cases

This was the hardest engineering work, and almost none of it shows up in the architecture diagram.

Deepgram misheard non-common names

Test call: "My name is Priya Satyarthi." Deepgram transcribed: "My name is Priya Satyati."

That single STT error is fatal. The booking gets written under the wrong name. The owner cannot find the customer when she calls back. The customer thinks the salon is incompetent.

The fix is in the prompt:

**If the name sounds uncommon or has unusual spelling**
(anything beyond common American/English names like John, Sarah,
Smith, Johnson) — ask them to spell the last name:

  "Thanks, {first_name} — could you spell your last name for me?"

Capture the spelling letter-by-letter. This is critical because
Deepgram STT mishears non-common names (e.g., "Satyarthi" → "Satyati").

The agent now reliably asks Indian, Latin, Slavic, and East Asian names to spell themselves. It does not ask "John Smith" to spell anything. The line between common and uncommon is fuzzy, but GPT-4o handles the fuzziness well in practice.

ElevenLabs mangled times and ordinals

Test call: agent reads back "Saturday, May 23rd at 2:15 PM."

What ElevenLabs actually said: "Saturday, May twenty rd at two two PM."

The "rd" got pronounced as the letters "R, D." The colon-separated time was read as two separate numbers. Customer hangs up confused.

Two fixes in the prompt. First, every time:

- Top of the hour → just the hour: "two PM" (not "two oh oh")
- Quarter past / half past / quarter to → spelled fractions:
  "two fifteen", "two thirty", "two forty-five"

❌ Wrong: "I have 1:45, 2:15, or 3:00"
✅ Right: "I have one forty-five, two PM, or two fifteen"

Second, every date is spelled with explicit ordinal words. The prompt includes a full mapping table for days 1 through 31:

- 1 → "first", 2 → "second", 3 → "third", ...
- 21 → "twenty-first", 22 → "twenty-second", 23 → "twenty-third", ...
- 30 → "thirtieth", 31 → "thirty-first"

Examples: "Saturday, May twenty-third" ✓
  Never "May 23rd" ✗
  Never "May twenty three" ✗
  Never "May 20 third" ✗

This kind of detail is invisible if you only test by typing into a chat playground. You only find these failures by making actual phone calls and listening to the audio.

Phone digit confirmation

Phone numbers are a worse problem than dates. "Five-one-seven-six-eight-six-six-eight-nine-two" said quickly is genuinely hard for any STT system. And even when Deepgram gets it right, the agent reading it back as "5-1-7-6-8-6-6-8-9-2" sounds robotic.

The prompt forces a specific cadence:

Read back the phone digit-by-digit in groups of 3-3-4:

  "Just to confirm — that's nine five four, six eight six,
   six eight nine two, right?"

If they correct any digit, re-read the corrected number and ask again.
Don't skip this — STT mishears phone digits often, and a wrong number
means we can't contact the customer.

The cadence matches how humans speak phone numbers in the US. Customers correct one digit on roughly one in five calls. Without the explicit re-read step, half of those wrong numbers would have been silently committed to the database.

Decision 3: Phone Normalization Defense-in-Depth

This is the one I almost shipped without and would have regretted.

Early in testing, the agent occasionally called book_appointment with customer_phone set to the string "yes". Or "this one". Or "same as before".

This happens because GPT-4o is helpful. When the caller says "yes use that one" in response to "is this the best number to reach you," the model interprets "that one" as the phone value. From the model's point of view this is a reasonable inference. From the database's point of view, "yes" is not a phone number, and the booking is now broken.

The fix is enforcement at every layer.

Layer 1: The prompt.

**Phone rule (CRITICAL):** Whenever a tool needs `customer_phone`,
you MUST pass actual digits — either the caller's stated number,
or the caller ID above. NEVER pass the literal strings "yes", "no",
"this one", "same", "unknown", or any non-numeric reply.

This catches 95% of cases. GPT-4o respects strong negative constraints when they are explicit.

Layer 2: The webhook.

The n8n workflow that fronts book_appointment normalizes every incoming phone string. Strips spaces, parens, dashes. Keeps the last 10 digits. Rejects anything that does not match ^[0-9]{10}$ after stripping.

Layer 3: Placeholder rejection.

The webhook also rejects known placeholders. 1000000000, 1234567890, 0000000000, 5555555555. These appear when the model is hallucinating a number it should not have.

Layer 4: Database CHECK constraint.

The customers.phone column has a Postgres CHECK constraint that enforces E.164 format. If layers 1, 2, and 3 all fail, the database refuses the row. The agent gets back an error and is forced to ask again.

Layer 5: caller_phone vs customer_phone split.

Every booking captures two phone fields. caller_phone is what Vapi reports as the caller ID — this is ground truth. customer_phone is what the caller stated during the conversation. The dashboard surfaces any mismatch so the owner can call back if needed.

Five layers. Each catches something the others miss. None of them are clever. All of them are necessary.

Decision 4: Where the AI Sits in the Call Path

The salon already has a published number — the one on their website, their Google listing, their business cards. We could (a) keep that number and forward to the AI, or (b) ask everyone to call the AI directly.

Asking customers to call a different number is a non-starter. So the question is how we forward.

Three layers, ordered by complexity and control:

Layer 1: Carrier-level forwarding.

The owner dials *71 17328130948 from the salon's phone, hangs up, done. From then on, any call that the salon doesn't pick up within 4-6 rings forwards to the AI. To cancel: dial *73.

Cost: $0. Setup time: 30 seconds. Works on virtually every US business landline and VoIP system. The salon's existing voicemail, caller ID display, and billing all stay the same.

Downsides: no time-of-day logic. The forward fires whether it is 10 AM or 11 PM. For our pilot this is a feature, not a bug — overflow during the day is exactly where the missed bookings hide.

Layer 2: Twilio + TwiML.

Port the salon's number into Twilio. Write 30 lines of TwiML that route based on time of day:

on incoming call:
  if current_time within salon_hours:
    forward → salon desk phone (over SIP)
  else:
    forward → Vapi number

Cost: $1/month for the Twilio number, plus minimal usage. Setup: 3-5 business days porting window. Owner loses some carrier-level features (their existing voicemail, the carrier-side caller ID display name) but those are rarely missed.

Upside: programmatic control. The hours rule lives in code. We can change it without the owner touching anything. We can do "forward to AI if desk hasn't answered in 4 rings AND it's a weekday." We get full call analytics in the Twilio dashboard.

Layer 3: Vapi as the front door.

All calls go directly to Vapi. The AI's prompt has a transfer rule: "During business hours, immediately say 'let me get you the front desk' and SIP-transfer the call." After hours, the AI takes the booking itself.

Upside: every single call gets logged. The owner literally never touches anything.

Downside: every caller hears 1-2 seconds of AI before being transferred during business hours. Some customers might not love that.

What we shipped

Layer 1. *71 conditional forward.

The plan is to run it for 20-30 real calls, look at how many the AI handled vs the owner picked up, count bookings created vs missed, then decide whether to graduate to Layer 2 or 3.

This is not premature optimization. It is sequenced infrastructure. You earn the right to Layer 3 by surviving the volume that justifies it.

Three Caller-ID Scenarios

The agent handles three scenarios for caller identity, all baked into the prompt.

Scenario A: returning customer matched by caller ID.

Welcome back, Priya! How can I help you today?

(after they answer)

Want me to confirm under the number ending in 1234,
or a different one?

The agent never asks for the phone again unless the caller offers a new one. The visit shows up under the same customer_id automatically.

Scenario B: new customer with caller ID available.

Got it. Can I get your first and last name?
(captures name; spells last name if uncommon)

And is the best number to reach you the one you're calling from,
ending in 9-8-7-6, or a different one?

Caller ID is the default. The caller can override if they prefer a different number.

Scenario C: caller ID blocked or anonymous.

Can I get your name and the best number to reach you?

The agent never tries to pretend the blocked caller ID is real. The phone gets explicitly asked for, digit-confirmed in groups of 3-3-4, then committed.

Three branches of the same intent. Distinct in the prompt, distinct in the conversation flow, but the user experience feels seamless because the agent picks the right branch automatically.

Multi-Tenant From Day One

Hair Time has a second location in North Brunswick, NJ. It is not live on the system yet. But the Supabase schema treats every domain table as multi-tenant from the very first row.

Every record has a salon_id. Row-Level Security policies scope queries by salon. The book_appointment stored function takes p_salon_id as a parameter. n8n resolves the right salon_id from message.call.assistantId in the Vapi webhook payload, so two different Vapi assistants can hit the same n8n endpoint and land in different rows.

Adding North Brunswick is roughly three hours of engineering:

Insert a row into public.salons with the address, hours, phone.
Create a second Vapi assistant in the dashboard. Provision a second Twilio DID. Attach.
Re-publish the prompt template against the new salon (the script accepts --salon-slug).
Update n8n's salon_id resolver to map the new assistant ID.
Create the owner's dashboard account with scripts/create_owner.py.
Owner dials *71 on the existing North Brunswick line.

After that, two salons share one database, one dashboard, one prompt template, and one set of tools. Adding a third later is the same six steps.

The cost of building this in at the start was maybe four extra hours during initial schema design. The cost of retrofitting it later would have been a week.

What the 30-Day Report Will Cover

This post is the build log. On June 22, 2026, after roughly 30 days of pilot calls, I will publish a follow-up with real numbers:

How many calls the AI actually handled
How many turned into bookings
What percentage were after-hours (the bookings that would have been missed otherwise)
Total Vapi spend for the month
Cost per booking
Owner's subjective take after 30 days
The bugs we hit in production and how they got fixed

The follow-up will live at /case-studies/hair-time-voice-agent, which is the structured client-facing version of this story. This blog post is the engineering depth. The case study is the outcome.

If you are thinking about building something like this for your own business — a salon, a clinic, a dental office, a restaurant, a small-business front desk — the pattern in this post is reusable. The platform decisions might be different by the time you read this. The four engineering problems are not.

A voice agent dashboard quietly running on an iPad on the salon counter

A voice agent that handles real customers on a real phone line is a different category of thing from a voice agent that handles a demo. The four decisions above are the gap. The 30-day report tells you whether they were enough.