Large language models went from research curiosity to production tool in under two years. Every SaaS product is adding AI features, and the developers building them need more than API keys and copy-pasted prompts. LLMs require structured integration: reliable API calls, well-crafted prompts, context management, and output validation. The difference between a demo and a production feature is how seriously you treat these concerns.
Calling the API from a TypeScript backend
The Anthropic and OpenAI SDKs are straightforward. The complexity lives in how you structure the call, handle errors, and parse the response. Streaming is essential for user-facing features because waiting 10 seconds for a complete response kills the experience.
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function generateSummary(content: string): Promise<string> {
const message = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Summarize this article in 2-3 sentences:\n\n${content}`,
},
],
});
const block = message.content[0];
if (block.type === 'text') return block.text;
throw new Error('Unexpected response type');
}
Type-checking the response block matters. The API returns an array of content blocks that can be text or tool use results. Assuming the first block is always text without checking leads to runtime errors in production.
Prompt engineering that works
The gap between a mediocre prompt and a good one is the gap between a useless feature and a reliable one. System prompts define the model’s behavior, constraints, and output format. User prompts provide the specific task.
const systemPrompt = `You are a code review assistant for a TypeScript/React codebase.
Rules:
- Focus on bugs, security issues, and performance problems
- Ignore style preferences
- Return JSON with this shape: { issues: Array<{ line: number, severity: "error" | "warning", message: string }> }
- If no issues found, return { issues: [] }`;
async function reviewCode(code: string, filename: string) {
const message = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 2048,
system: systemPrompt,
messages: [{ role: 'user', content: `Review this file (${filename}):\n\n${code}` }],
});
const block = message.content[0];
if (block.type !== 'text') throw new Error('Unexpected response type');
return JSON.parse(block.text);
}
Specifying the exact JSON shape in the system prompt and parsing the output with JSON.parse is the minimum. For production, validate the parsed output with Zod to catch cases where the model drifts from the expected format.
Retrieval-Augmented Generation
LLMs hallucinate when they lack context. RAG solves this by fetching relevant documents before generating a response. The model answers based on your data instead of its training data.
async function answerQuestion(question: string) {
const embedding = await generateEmbedding(question);
const relevantDocs = await vectorDb.search(embedding, { limit: 5 });
const context = relevantDocs.map((doc) => doc.content).join('\n\n---\n\n');
const message = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system:
'Answer based only on the provided context. If the context does not contain the answer, say so.',
messages: [
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`,
},
],
});
const block = message.content[0];
if (block.type !== 'text') throw new Error('Unexpected response type');
return block.text;
}
The vector database stores embeddings of your documents. At query time, the question is embedded, similar documents are retrieved, and the model generates an answer grounded in that context. This pattern powers documentation chatbots, support assistants, and internal knowledge bases.
Cost and latency management
LLM API calls are expensive and slow compared to traditional backend operations. A naive implementation that sends the full page content on every keystroke will burn through budget and frustrate users.
import { useRef, useCallback } from 'react';
function useDebouncedAI(delay: number) {
const timeoutRef = useRef<NodeJS.Timeout>();
const abortRef = useRef<AbortController>();
const request = useCallback(
async (input: string, onResult: (text: string) => void) => {
clearTimeout(timeoutRef.current);
abortRef.current?.abort();
timeoutRef.current = setTimeout(async () => {
abortRef.current = new AbortController();
const response = await fetch('/api/ai/suggest', {
method: 'POST',
body: JSON.stringify({ input }),
signal: abortRef.current.signal,
});
const data = await response.json();
onResult(data.suggestion);
}, delay);
},
[delay]
);
return request;
}
Debouncing prevents redundant calls. Aborting in-flight requests avoids stale responses overwriting newer ones. Caching identical prompts in Redis eliminates repeated API calls for the same input. These patterns are standard in web development but critical with LLMs where each call costs money and takes seconds.
Conclusion
LLMs are a powerful primitive for developers, not a magic solution. Structured API integration, validated outputs, RAG for grounded answers, and cost-conscious patterns are what turn a prototype into a production feature. The tooling is mature enough for real applications, and the developers who understand both the capabilities and the constraints will build the most useful products.