Observability and monitoring

Your AI features work. Users are happy. But something will break eventually—and you want to know before your users tell you. Observability means understanding what's happening in production without staring at logs all day.

Outcome

Set up monitoring for your AI features using AI Gateway analytics, structured logging, and cost alerts.

Fast track

Vercel → AI Gateway → Analytics to review requests/tokens/costs breakdown
Add console.log(JSON.stringify({ event, requestId, productSlug, duration, tokens })) to AI functions
AI Gateway → Settings → Alerts: set daily cost threshold ($10) and error rate threshold (5%)

Hands-on exercise 3.3

Add observability to your AI features:

Requirements:

Review AI Gateway analytics (requests, tokens, costs, errors)
Add structured logging to summarizeReviews and getReviewInsights
Configure alerts for cost thresholds and error rates
Test logging by generating some AI requests

Implementation hints:

AI Gateway dashboard shows real-time and historical data
Log request metadata (product slug, token count, duration)
Alerts can notify via email, Slack, or webhooks
Start with conservative thresholds and adjust based on real usage

AI Gateway analytics

AI Gateway tracks everything automatically. No code changes needed.

Find your analytics:

Vercel Dashboard → AI Gateway
Click Analytics tab

What you'll see:

Overview (Last 7 days)
────────────────────────────────────────
Total requests:     2,847
Success rate:       99.2%
Avg latency:        1,847ms
Total tokens:       2.1M
Estimated cost:     $6.32

Requests by model:
├─ anthropic/claude-sonnet-4.5    2,614 (91.8%)
├─ anthropic/claude-haiku-3.5      198 (7.0%)
└─ openai/gpt-4-turbo               35 (1.2%)

Errors:
├─ 429 Rate Limited:    18
├─ 503 Service Error:    4
└─ Timeout:              1

Key metrics to watch:

Metric	Healthy	Investigate	Alert
Success rate	Above 99%	95-99%	Below 95%
Avg latency	Under 2s	2-5s	Over 5s
Error rate	Under 1%	1-5%	Over 5%
Daily cost	Within budget	2x budget	5x budget

Understanding the dashboard

Requests over time: Shows request volume by hour/day. Look for:

Unexpected spikes (bot traffic? viral post?)
Sudden drops (deployment broke something?)
Patterns (peak hours, quiet periods)

Latency distribution: Shows p50, p90, p99 latency. Look for:

p50 ~1-2s (typical AI generation)
p99 under 5s (occasional slow requests are normal)
p99 over 10s (something's wrong)

Token usage: Shows input vs output tokens. Look for:

Input tokens >> Output tokens (normal for summarization)
Unexpected token growth (prompts getting longer?)
Spikes correlating with specific products (long reviews?)

Cost breakdown: Shows cost by model and day. Look for:

Steady growth (normal with traffic)
Sudden jumps (fallbacks triggering? new feature?)
Model distribution (are fallbacks firing more than expected?)

Adding structured logging

AI Gateway tracks aggregate metrics. For debugging specific requests, add your own logging.

Update lib/ai-summary.ts to add logging while keeping the "use cache" directive from Lesson 3.1:

lib/ai-summary.ts

import { generateText, generateObject } from "ai";
import { cacheLife, cacheTag } from "next/cache";
import { Product, ReviewInsights, ReviewInsightsSchema } from "./types";
 
export async function summarizeReviews(product: Product): Promise<string> {
  "use cache";
  cacheLife("hours");
  cacheTag(`product-summary-${product.slug}`);
 
  const startTime = Date.now();
  const requestId = crypto.randomUUID();
 
  console.log(JSON.stringify({
    event: "ai_request_start",
    requestId,
    function: "summarizeReviews",
    productSlug: product.slug,
    reviewCount: product.reviews.length,
    timestamp: new Date().toISOString(),
  }));
 
  const averageRating =
    product.reviews.reduce((acc, review) => acc + review.stars, 0) /
    product.reviews.length;
 
  const prompt = `Write a summary of the reviews for the ${product.name} product...`; // Your existing prompt
 
  try {
    const { text, usage } = await generateText({
      model: "anthropic/claude-sonnet-4.5",
      prompt,
      maxTokens: 1000,
      temperature: 0.75,
    });
 
    const duration = Date.now() - startTime;
 
    console.log(JSON.stringify({
      event: "ai_request_success",
      requestId,
      function: "summarizeReviews",
      productSlug: product.slug,
      duration,
      inputTokens: usage?.promptTokens,
      outputTokens: usage?.completionTokens,
      totalTokens: usage?.totalTokens,
      timestamp: new Date().toISOString(),
    }));
 
    return text.trim();
  } catch (error) {
    const duration = Date.now() - startTime;
 
    console.error(JSON.stringify({
      event: "ai_request_error",
      requestId,
      function: "summarizeReviews",
      productSlug: product.slug,
      duration,
      error: error instanceof Error ? error.message : "Unknown error",
      timestamp: new Date().toISOString(),
    }));
 
    throw new Error("Unable to generate review summary. Please try again.");
  }
}

What this logs:

Request start:

{
  "event": "ai_request_start",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "function": "summarizeReviews",
  "productSlug": "mower",
  "reviewCount": 12,
  "timestamp": "2024-01-15T14:32:01.234Z"
}

Request success:

{
  "event": "ai_request_success",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "function": "summarizeReviews",
  "productSlug": "mower",
  "duration": 2341,
  "inputTokens": 847,
  "outputTokens": 89,
  "totalTokens": 936,
  "timestamp": "2024-01-15T14:32:03.575Z"
}

Request error:

{
  "event": "ai_request_error",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "function": "summarizeReviews",
  "productSlug": "mower",
  "duration": 5023,
  "error": "Rate limit exceeded",
  "timestamp": "2024-01-15T14:32:06.257Z"
}

Why structured logging?

Searchable — Find all errors for a specific product
Parseable — Tools like Vercel Logs, Datadog, or Axiom can parse JSON
Correlatable — Request IDs link start → success/error
Measurable — Track duration, tokens, and patterns over time

Viewing logs in Vercel

Find your logs:

Vercel Dashboard → Your project
Click Logs tab
Filter by:
- Level: error (show only errors)
- Time: Last hour/day/week
- Search: ai_request_error or productSlug: mower

Example log search:

// Find all AI errors in the last 24 hours
ai_request_error

// Find all requests for a specific product
productSlug: aquaHeat

// Find slow requests (>3 seconds)
duration > 3000

Setting up alerts

Don't wait for users to tell you something's broken. Set up alerts.

AI Gateway alerts:

Vercel Dashboard → AI Gateway → Settings
Scroll to Alerts
Configure thresholds:

Cost alerts:
├─ Daily spend > $10     → Email notification
├─ Daily spend > $50     → Slack notification
└─ Daily spend > $100    → PagerDuty (wake someone up)

Error alerts:
├─ Error rate > 5%       → Email notification
├─ Error rate > 10%      → Slack notification
└─ Error rate > 25%      → PagerDuty

Latency alerts:
├─ p99 latency > 10s     → Email notification
└─ p99 latency > 30s     → Slack notification

Project-level alerts (Vercel):

Project → Settings → Notifications
Configure:
- Deployment failures
- Function errors
- Usage thresholds

Start conservative: It's better to get too many alerts initially and tune them down than to miss something critical.

Debugging production issues

When something goes wrong, here's how to investigate:

1. Check AI Gateway dashboard

Error spike? What time did it start?
Which model? (primary or fallback?)
What error codes? (429, 503, timeout?)

2. Check Vercel logs

Search for ai_request_error
Filter to the timeframe
Look for patterns (same product? same error?)

3. Correlate with deployments

Did a deployment happen right before the errors?
Check deployment logs for build issues

4. Check provider status

Anthropic Status
OpenAI Status
If provider is down, your fallbacks should be handling it

Common issues and causes:

Symptom	Likely cause	Fix
Sudden 429 spike	Rate limit hit	Add fallback model, implement backoff
All requests failing	Bad API key	Check env vars in Vercel
Slow responses	Provider degradation	Fallbacks should kick in
Cost spike	Cache not working	Check `"use cache"` directive and `cacheLife` config
Token overflow	Long reviews	Truncate input or paginate

Production monitoring checklist

Before going live, verify:

AI Gateway analytics accessible — Can you see requests, costs, errors?
Structured logging added — JSON logs with request IDs and metadata
Cost alerts configured — Get notified before bills surprise you
Error alerts configured — Know when things break
Fallbacks working — Verified backup models are configured
Logs searchable — Can find specific requests when debugging

Try it

Explore your AI Gateway dashboard:
- How many requests have you made?
- What's your average latency?
- Any errors?
Add structured logging:
- Update summarizeReviews with the logging code
- Generate a few summaries
- Check Vercel logs for the JSON output
Set up a cost alert:
- AI Gateway → Settings → Alerts
- Set a daily spend threshold (even $1 for testing)
- Verify you receive the alert notification
Simulate an error (optional):
- Temporarily break your API key
- Visit a product page
- Check that error logs appear correctly
- Fix the API key

Commit

git add lib/ai-summary.ts
git commit -m "feat(observability): add structured logging to AI functions"
git push

Done-when

Explored AI Gateway analytics dashboard
Understand key metrics (requests, latency, tokens, costs)
Added structured logging to AI functions
Configured at least one alert (cost or error)
Know how to search logs in Vercel
Understand debugging workflow for production issues

What's next

Your AI features are observable. You'll know when things break, why they broke, and how much it's costing. Time to wrap up the course and review everything you've built.

Sources: