Observability and monitoring
Your AI features work. Users are happy. But something will break eventually—and you want to know before your users tell you. Observability means understanding what's happening in production without staring at logs all day.
Outcome
Set up monitoring for your AI features using AI Gateway analytics, structured logging, and cost alerts.
Fast track
- Vercel → AI Gateway → Analytics to review requests/tokens/costs breakdown
- Add
console.log(JSON.stringify({ event, requestId, productSlug, duration, tokens }))to AI functions - AI Gateway → Settings → Alerts: set daily cost threshold ($10) and error rate threshold (5%)
Hands-on exercise 3.3
Add observability to your AI features:
Requirements:
- Review AI Gateway analytics (requests, tokens, costs, errors)
- Add structured logging to
summarizeReviewsandgetReviewInsights - Configure alerts for cost thresholds and error rates
- Test logging by generating some AI requests
Implementation hints:
- AI Gateway dashboard shows real-time and historical data
- Log request metadata (product slug, token count, duration)
- Alerts can notify via email, Slack, or webhooks
- Start with conservative thresholds and adjust based on real usage
AI Gateway analytics
AI Gateway tracks everything automatically. No code changes needed.
Find your analytics:
- Vercel Dashboard → AI Gateway
- Click Analytics tab
What you'll see:
Overview (Last 7 days)
────────────────────────────────────────
Total requests: 2,847
Success rate: 99.2%
Avg latency: 1,847ms
Total tokens: 2.1M
Estimated cost: $6.32
Requests by model:
├─ anthropic/claude-sonnet-4.5 2,614 (91.8%)
├─ anthropic/claude-haiku-3.5 198 (7.0%)
└─ openai/gpt-4-turbo 35 (1.2%)
Errors:
├─ 429 Rate Limited: 18
├─ 503 Service Error: 4
└─ Timeout: 1
Key metrics to watch:
| Metric | Healthy | Investigate | Alert |
|---|---|---|---|
| Success rate | Above 99% | 95-99% | Below 95% |
| Avg latency | Under 2s | 2-5s | Over 5s |
| Error rate | Under 1% | 1-5% | Over 5% |
| Daily cost | Within budget | 2x budget | 5x budget |
Understanding the dashboard
Requests over time: Shows request volume by hour/day. Look for:
- Unexpected spikes (bot traffic? viral post?)
- Sudden drops (deployment broke something?)
- Patterns (peak hours, quiet periods)
Latency distribution: Shows p50, p90, p99 latency. Look for:
- p50 ~1-2s (typical AI generation)
- p99 under 5s (occasional slow requests are normal)
- p99 over 10s (something's wrong)
Token usage: Shows input vs output tokens. Look for:
- Input tokens >> Output tokens (normal for summarization)
- Unexpected token growth (prompts getting longer?)
- Spikes correlating with specific products (long reviews?)
Cost breakdown: Shows cost by model and day. Look for:
- Steady growth (normal with traffic)
- Sudden jumps (fallbacks triggering? new feature?)
- Model distribution (are fallbacks firing more than expected?)
Adding structured logging
AI Gateway tracks aggregate metrics. For debugging specific requests, add your own logging.
Update lib/ai-summary.ts to add logging while keeping the "use cache" directive from Lesson 3.1:
import { generateText, generateObject } from "ai";
import { cacheLife, cacheTag } from "next/cache";
import { Product, ReviewInsights, ReviewInsightsSchema } from "./types";
export async function summarizeReviews(product: Product): Promise<string> {
"use cache";
cacheLife("hours");
cacheTag(`product-summary-${product.slug}`);
const startTime = Date.now();
const requestId = crypto.randomUUID();
console.log(JSON.stringify({
event: "ai_request_start",
requestId,
function: "summarizeReviews",
productSlug: product.slug,
reviewCount: product.reviews.length,
timestamp: new Date().toISOString(),
}));
const averageRating =
product.reviews.reduce((acc, review) => acc + review.stars, 0) /
product.reviews.length;
const prompt = `Write a summary of the reviews for the ${product.name} product...`; // Your existing prompt
try {
const { text, usage } = await generateText({
model: "anthropic/claude-sonnet-4.5",
prompt,
maxTokens: 1000,
temperature: 0.75,
});
const duration = Date.now() - startTime;
console.log(JSON.stringify({
event: "ai_request_success",
requestId,
function: "summarizeReviews",
productSlug: product.slug,
duration,
inputTokens: usage?.promptTokens,
outputTokens: usage?.completionTokens,
totalTokens: usage?.totalTokens,
timestamp: new Date().toISOString(),
}));
return text.trim();
} catch (error) {
const duration = Date.now() - startTime;
console.error(JSON.stringify({
event: "ai_request_error",
requestId,
function: "summarizeReviews",
productSlug: product.slug,
duration,
error: error instanceof Error ? error.message : "Unknown error",
timestamp: new Date().toISOString(),
}));
throw new Error("Unable to generate review summary. Please try again.");
}
}What this logs:
Request start:
{
"event": "ai_request_start",
"requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"function": "summarizeReviews",
"productSlug": "mower",
"reviewCount": 12,
"timestamp": "2024-01-15T14:32:01.234Z"
}Request success:
{
"event": "ai_request_success",
"requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"function": "summarizeReviews",
"productSlug": "mower",
"duration": 2341,
"inputTokens": 847,
"outputTokens": 89,
"totalTokens": 936,
"timestamp": "2024-01-15T14:32:03.575Z"
}Request error:
{
"event": "ai_request_error",
"requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"function": "summarizeReviews",
"productSlug": "mower",
"duration": 5023,
"error": "Rate limit exceeded",
"timestamp": "2024-01-15T14:32:06.257Z"
}Why structured logging?
- Searchable — Find all errors for a specific product
- Parseable — Tools like Vercel Logs, Datadog, or Axiom can parse JSON
- Correlatable — Request IDs link start → success/error
- Measurable — Track duration, tokens, and patterns over time
Viewing logs in Vercel
Find your logs:
- Vercel Dashboard → Your project
- Click Logs tab
- Filter by:
- Level:
error(show only errors) - Time: Last hour/day/week
- Search:
ai_request_errororproductSlug: mower
- Level:
Example log search:
// Find all AI errors in the last 24 hours
ai_request_error
// Find all requests for a specific product
productSlug: aquaHeat
// Find slow requests (>3 seconds)
duration > 3000
Setting up alerts
Don't wait for users to tell you something's broken. Set up alerts.
AI Gateway alerts:
- Vercel Dashboard → AI Gateway → Settings
- Scroll to Alerts
- Configure thresholds:
Cost alerts:
├─ Daily spend > $10 → Email notification
├─ Daily spend > $50 → Slack notification
└─ Daily spend > $100 → PagerDuty (wake someone up)
Error alerts:
├─ Error rate > 5% → Email notification
├─ Error rate > 10% → Slack notification
└─ Error rate > 25% → PagerDuty
Latency alerts:
├─ p99 latency > 10s → Email notification
└─ p99 latency > 30s → Slack notification
Project-level alerts (Vercel):
- Project → Settings → Notifications
- Configure:
- Deployment failures
- Function errors
- Usage thresholds
Start conservative: It's better to get too many alerts initially and tune them down than to miss something critical.
Debugging production issues
When something goes wrong, here's how to investigate:
1. Check AI Gateway dashboard
- Error spike? What time did it start?
- Which model? (primary or fallback?)
- What error codes? (429, 503, timeout?)
2. Check Vercel logs
- Search for
ai_request_error - Filter to the timeframe
- Look for patterns (same product? same error?)
3. Correlate with deployments
- Did a deployment happen right before the errors?
- Check deployment logs for build issues
4. Check provider status
- Anthropic Status
- OpenAI Status
- If provider is down, your fallbacks should be handling it
Common issues and causes:
| Symptom | Likely cause | Fix |
|---|---|---|
| Sudden 429 spike | Rate limit hit | Add fallback model, implement backoff |
| All requests failing | Bad API key | Check env vars in Vercel |
| Slow responses | Provider degradation | Fallbacks should kick in |
| Cost spike | Cache not working | Check "use cache" directive and cacheLife config |
| Token overflow | Long reviews | Truncate input or paginate |
Production monitoring checklist
Before going live, verify:
- AI Gateway analytics accessible — Can you see requests, costs, errors?
- Structured logging added — JSON logs with request IDs and metadata
- Cost alerts configured — Get notified before bills surprise you
- Error alerts configured — Know when things break
- Fallbacks working — Verified backup models are configured
- Logs searchable — Can find specific requests when debugging
Try it
-
Explore your AI Gateway dashboard:
- How many requests have you made?
- What's your average latency?
- Any errors?
-
Add structured logging:
- Update
summarizeReviewswith the logging code - Generate a few summaries
- Check Vercel logs for the JSON output
- Update
-
Set up a cost alert:
- AI Gateway → Settings → Alerts
- Set a daily spend threshold (even $1 for testing)
- Verify you receive the alert notification
-
Simulate an error (optional):
- Temporarily break your API key
- Visit a product page
- Check that error logs appear correctly
- Fix the API key
Commit
git add lib/ai-summary.ts
git commit -m "feat(observability): add structured logging to AI functions"
git pushDone-when
- Explored AI Gateway analytics dashboard
- Understand key metrics (requests, latency, tokens, costs)
- Added structured logging to AI functions
- Configured at least one alert (cost or error)
- Know how to search logs in Vercel
- Understand debugging workflow for production issues
What's next
Your AI features are observable. You'll know when things break, why they broke, and how much it's costing. Time to wrap up the course and review everything you've built.
Sources:
Was this helpful?