Vercel Logo

Observability and monitoring

Your AI features work. Users are happy. But something will break eventually—and you want to know before your users tell you. Observability means understanding what's happening in production without staring at logs all day.

Outcome

Set up monitoring for your AI features using AI Gateway analytics, structured logging, and cost alerts.

Fast track

  1. Vercel → AI Gateway → Analytics to review requests/tokens/costs breakdown
  2. Add console.log(JSON.stringify({ event, requestId, productSlug, duration, tokens })) to AI functions
  3. AI Gateway → Settings → Alerts: set daily cost threshold ($10) and error rate threshold (5%)

Hands-on exercise 3.3

Add observability to your AI features:

Requirements:

  1. Review AI Gateway analytics (requests, tokens, costs, errors)
  2. Add structured logging to summarizeReviews and getReviewInsights
  3. Configure alerts for cost thresholds and error rates
  4. Test logging by generating some AI requests

Implementation hints:

  • AI Gateway dashboard shows real-time and historical data
  • Log request metadata (product slug, token count, duration)
  • Alerts can notify via email, Slack, or webhooks
  • Start with conservative thresholds and adjust based on real usage

AI Gateway analytics

AI Gateway tracks everything automatically. No code changes needed.

Find your analytics:

  1. Vercel Dashboard → AI Gateway
  2. Click Analytics tab

What you'll see:

Overview (Last 7 days)
────────────────────────────────────────
Total requests:     2,847
Success rate:       99.2%
Avg latency:        1,847ms
Total tokens:       2.1M
Estimated cost:     $6.32

Requests by model:
├─ anthropic/claude-sonnet-4.5    2,614 (91.8%)
├─ anthropic/claude-haiku-3.5      198 (7.0%)
└─ openai/gpt-4-turbo               35 (1.2%)

Errors:
├─ 429 Rate Limited:    18
├─ 503 Service Error:    4
└─ Timeout:              1

Key metrics to watch:

MetricHealthyInvestigateAlert
Success rateAbove 99%95-99%Below 95%
Avg latencyUnder 2s2-5sOver 5s
Error rateUnder 1%1-5%Over 5%
Daily costWithin budget2x budget5x budget

Understanding the dashboard

Requests over time: Shows request volume by hour/day. Look for:

  • Unexpected spikes (bot traffic? viral post?)
  • Sudden drops (deployment broke something?)
  • Patterns (peak hours, quiet periods)

Latency distribution: Shows p50, p90, p99 latency. Look for:

  • p50 ~1-2s (typical AI generation)
  • p99 under 5s (occasional slow requests are normal)
  • p99 over 10s (something's wrong)

Token usage: Shows input vs output tokens. Look for:

  • Input tokens >> Output tokens (normal for summarization)
  • Unexpected token growth (prompts getting longer?)
  • Spikes correlating with specific products (long reviews?)

Cost breakdown: Shows cost by model and day. Look for:

  • Steady growth (normal with traffic)
  • Sudden jumps (fallbacks triggering? new feature?)
  • Model distribution (are fallbacks firing more than expected?)

Adding structured logging

AI Gateway tracks aggregate metrics. For debugging specific requests, add your own logging.

Update lib/ai-summary.ts to add logging while keeping the "use cache" directive from Lesson 3.1:

lib/ai-summary.ts
import { generateText, generateObject } from "ai";
import { cacheLife, cacheTag } from "next/cache";
import { Product, ReviewInsights, ReviewInsightsSchema } from "./types";
 
export async function summarizeReviews(product: Product): Promise<string> {
  "use cache";
  cacheLife("hours");
  cacheTag(`product-summary-${product.slug}`);
 
  const startTime = Date.now();
  const requestId = crypto.randomUUID();
 
  console.log(JSON.stringify({
    event: "ai_request_start",
    requestId,
    function: "summarizeReviews",
    productSlug: product.slug,
    reviewCount: product.reviews.length,
    timestamp: new Date().toISOString(),
  }));
 
  const averageRating =
    product.reviews.reduce((acc, review) => acc + review.stars, 0) /
    product.reviews.length;
 
  const prompt = `Write a summary of the reviews for the ${product.name} product...`; // Your existing prompt
 
  try {
    const { text, usage } = await generateText({
      model: "anthropic/claude-sonnet-4.5",
      prompt,
      maxTokens: 1000,
      temperature: 0.75,
    });
 
    const duration = Date.now() - startTime;
 
    console.log(JSON.stringify({
      event: "ai_request_success",
      requestId,
      function: "summarizeReviews",
      productSlug: product.slug,
      duration,
      inputTokens: usage?.promptTokens,
      outputTokens: usage?.completionTokens,
      totalTokens: usage?.totalTokens,
      timestamp: new Date().toISOString(),
    }));
 
    return text.trim();
  } catch (error) {
    const duration = Date.now() - startTime;
 
    console.error(JSON.stringify({
      event: "ai_request_error",
      requestId,
      function: "summarizeReviews",
      productSlug: product.slug,
      duration,
      error: error instanceof Error ? error.message : "Unknown error",
      timestamp: new Date().toISOString(),
    }));
 
    throw new Error("Unable to generate review summary. Please try again.");
  }
}

What this logs:

Request start:

{
  "event": "ai_request_start",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "function": "summarizeReviews",
  "productSlug": "mower",
  "reviewCount": 12,
  "timestamp": "2024-01-15T14:32:01.234Z"
}

Request success:

{
  "event": "ai_request_success",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "function": "summarizeReviews",
  "productSlug": "mower",
  "duration": 2341,
  "inputTokens": 847,
  "outputTokens": 89,
  "totalTokens": 936,
  "timestamp": "2024-01-15T14:32:03.575Z"
}

Request error:

{
  "event": "ai_request_error",
  "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "function": "summarizeReviews",
  "productSlug": "mower",
  "duration": 5023,
  "error": "Rate limit exceeded",
  "timestamp": "2024-01-15T14:32:06.257Z"
}

Why structured logging?

  • Searchable — Find all errors for a specific product
  • Parseable — Tools like Vercel Logs, Datadog, or Axiom can parse JSON
  • Correlatable — Request IDs link start → success/error
  • Measurable — Track duration, tokens, and patterns over time

Viewing logs in Vercel

Find your logs:

  1. Vercel Dashboard → Your project
  2. Click Logs tab
  3. Filter by:
    • Level: error (show only errors)
    • Time: Last hour/day/week
    • Search: ai_request_error or productSlug: mower

Example log search:

// Find all AI errors in the last 24 hours
ai_request_error

// Find all requests for a specific product
productSlug: aquaHeat

// Find slow requests (>3 seconds)
duration > 3000

Setting up alerts

Don't wait for users to tell you something's broken. Set up alerts.

AI Gateway alerts:

  1. Vercel Dashboard → AI GatewaySettings
  2. Scroll to Alerts
  3. Configure thresholds:
Cost alerts:
├─ Daily spend > $10     → Email notification
├─ Daily spend > $50     → Slack notification
└─ Daily spend > $100    → PagerDuty (wake someone up)

Error alerts:
├─ Error rate > 5%       → Email notification
├─ Error rate > 10%      → Slack notification
└─ Error rate > 25%      → PagerDuty

Latency alerts:
├─ p99 latency > 10s     → Email notification
└─ p99 latency > 30s     → Slack notification

Project-level alerts (Vercel):

  1. Project → SettingsNotifications
  2. Configure:
    • Deployment failures
    • Function errors
    • Usage thresholds

Start conservative: It's better to get too many alerts initially and tune them down than to miss something critical.

Debugging production issues

When something goes wrong, here's how to investigate:

1. Check AI Gateway dashboard

  • Error spike? What time did it start?
  • Which model? (primary or fallback?)
  • What error codes? (429, 503, timeout?)

2. Check Vercel logs

  • Search for ai_request_error
  • Filter to the timeframe
  • Look for patterns (same product? same error?)

3. Correlate with deployments

  • Did a deployment happen right before the errors?
  • Check deployment logs for build issues

4. Check provider status

Common issues and causes:

SymptomLikely causeFix
Sudden 429 spikeRate limit hitAdd fallback model, implement backoff
All requests failingBad API keyCheck env vars in Vercel
Slow responsesProvider degradationFallbacks should kick in
Cost spikeCache not workingCheck "use cache" directive and cacheLife config
Token overflowLong reviewsTruncate input or paginate

Production monitoring checklist

Before going live, verify:

  • AI Gateway analytics accessible — Can you see requests, costs, errors?
  • Structured logging added — JSON logs with request IDs and metadata
  • Cost alerts configured — Get notified before bills surprise you
  • Error alerts configured — Know when things break
  • Fallbacks working — Verified backup models are configured
  • Logs searchable — Can find specific requests when debugging

Try it

  1. Explore your AI Gateway dashboard:

    • How many requests have you made?
    • What's your average latency?
    • Any errors?
  2. Add structured logging:

    • Update summarizeReviews with the logging code
    • Generate a few summaries
    • Check Vercel logs for the JSON output
  3. Set up a cost alert:

    • AI Gateway → Settings → Alerts
    • Set a daily spend threshold (even $1 for testing)
    • Verify you receive the alert notification
  4. Simulate an error (optional):

    • Temporarily break your API key
    • Visit a product page
    • Check that error logs appear correctly
    • Fix the API key

Commit

git add lib/ai-summary.ts
git commit -m "feat(observability): add structured logging to AI functions"
git push

Done-when

  • Explored AI Gateway analytics dashboard
  • Understand key metrics (requests, latency, tokens, costs)
  • Added structured logging to AI functions
  • Configured at least one alert (cost or error)
  • Know how to search logs in Vercel
  • Understand debugging workflow for production issues

What's next

Your AI features are observable. You'll know when things break, why they broke, and how much it's costing. Time to wrap up the course and review everything you've built.


Sources: