To create an AI agent, you pick a framework, give it a model (Claude, GPT-4o, or open weights), define the tools it can call, write the system prompt that tells it how to plan, and add a memory or state layer so it can reason across steps. Production agents add evaluation, guardrails, and a way to hand off to a human when confidence is low. The whole loop is the same whether the agent qualifies inbound leads, drafts outbound emails, or runs a renewal motion. What changes is the tools you wire up and the data it can read.
This guide walks through the five-step build pattern and then maps it to the four agent frameworks GTM teams actually ship with in 2026: the OpenAI Assistants API, Anthropic's tool-use API, CrewAI, and LangChain (with LangGraph). It closes with three GTM use cases (lead enrichment, outbound personalization, renewal triage), the most common failure modes, and how to decide between build and buy.
An agent is a loop. The model reads a goal, decides which tool to call, calls it, reads the result, and either continues or stops. Every framework is a wrapper around that loop. The five parts you have to define yourself are the model, the tools, the system prompt, the memory, and the exit condition.
The mistake new builders make is starting from the framework. The framework matters less than the five definitions above. A 200-line script that wires the five together with the OpenAI or Anthropic API is a production-ready agent. The frameworks below add structure for multi-agent orchestration and observability, not new capability.
Four frameworks cover most GTM agent builds in 2026. Each makes different tradeoffs on speed-to-prototype, observability, and how much code you own.
| Framework | Best for | Strength | Tradeoff |
|---|---|---|---|
| OpenAI Assistants API | Single-agent GTM workflows on the OpenAI stack | Built-in threads, retrieval, code interpreter | Lock-in to OpenAI; thread state can drift on long workflows |
| Anthropic tools (Claude API) | Tool-heavy single-agent workflows with long context | 200K context, strong tool-call accuracy, prompt caching at 90% discount | You manage threads and memory yourself |
| CrewAI | Multi-agent role-based workflows (researcher, writer, reviewer) | Fastest path to a multi-agent prototype, clean role abstraction | Less control over the loop; debugging multi-agent failures is harder |
| LangChain + LangGraph | Production agents with explicit state machines | Full observability, retries, branching, human-in-the-loop nodes | Steeper learning curve; more code to ship the first version |
A working rule of thumb: prototype on the OpenAI Assistants API or Anthropic tools in a single afternoon, prove the agent works on five real cases, then port to LangGraph when you need retries, branching, or human-in-the-loop. Use CrewAI when the natural decomposition is role-based and the agents need to talk to each other.
The simplest production agent pattern is a tool-use loop against the Claude API. The pseudocode below shows the structure. Real code from the Anthropic SDK quickstart and the LangChain agent tutorials follows the same shape.
# Minimal agent loop, ~40 lines
import anthropic
client = anthropic.Anthropic()
tools = [
{"name": "lookup_account", "description": "Pull CRM data for a company domain.",
"input_schema": {"type": "object", "properties": {"domain": {"type": "string"}}, "required": ["domain"]}},
{"name": "search_intent", "description": "Check Bombora intent surge for a topic at a domain.",
"input_schema": {"type": "object", "properties": {"domain": {"type": "string"}, "topic": {"type": "string"}}, "required": ["domain", "topic"]}},
]
def run_tool(name, args):
if name == "lookup_account":
return crm.get(args["domain"]) # your CRM client
if name == "search_intent":
return bombora.surge(args["domain"], args["topic"])
return {"error": "unknown tool"}
messages = [{"role": "user", "content": "Qualify acme.com for our cybersecurity ICP and tell me if they are showing intent."}]
while True:
resp = client.messages.create(
model="claude-opus-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages,
)
if resp.stop_reason == "end_turn":
print(resp.content[0].text)
break
# Tool call requested
messages.append({"role": "assistant", "content": resp.content})
tool_results = []
for block in resp.content:
if block.type == "tool_use":
result = run_tool(block.name, block.input)
tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": str(result)})
messages.append({"role": "user", "content": tool_results})
That is a working agent. It reads a goal, decides which tools to call, calls them, reads the results, and either continues or returns a final answer. The same pattern works against the OpenAI Chat Completions API with function calling. Move to LangGraph or CrewAI when you need a state machine, retries, or multi-agent coordination, not before.
The most common first agent on a GTM team enriches and qualifies inbound leads. The agent reads a form submission, pulls the company from Clearbit or ZoomInfo, checks the CRM for prior history, runs the ICP scoring rule, and routes to the right SDR with a one-paragraph brief. The reason this works as a starter project is the inputs and outputs are crisp, the failure mode is recoverable (a misrouted lead is easy to fix), and the manual baseline is well understood.
The tools you wire up: a CRM lookup, an enrichment API call, an intent data check, an ICP scoring function (this is just a Python function the agent can call, not its own model), a Slack notification, and a CRM update. The system prompt instructs the agent on the order: enrich first, then check intent, then score, then route. Use Claude Haiku or GPT-4o-mini here. The reasoning is shallow and the volume is high. For deeper coverage on what the modern AI SDR stack looks like, see the AI SDR and outbound directory.
The second common build is an outbound personalization agent. The agent reads a target account list, pulls a recent signal per account (a job posting, a 10-K mention, a LinkedIn announcement), drafts a personalized opening line, and writes the message into Outreach or Apollo as a draft for the AE to review.
This is the build that Clay and Lavender productized. The reason teams still build their own is twofold. First, the model and prompt matter more than the framework: a 200-line script using Claude Sonnet with a strong system prompt and three example outputs outperforms a generic SaaS tool on accounts the SaaS tool does not have signals for. Second, the agent can read your CRM, your product usage data, and your internal account notes, which a vendor cannot.
Tools: a CRM lookup, a web search, a LinkedIn scraper (or a vendor like PhantomBuster), a recent-news API, a prompt for opener generation, and an Outreach API draft create. Use Claude Sonnet or GPT-4o for the writing. Cache the system prompt and the example outputs (Anthropic charges 10% of input price for cached tokens), and the cost per personalized opener drops to a fraction of a cent.
The third pattern is a renewal triage agent that runs nightly across the customer base. The agent reads product usage data, support tickets, NPS scores, and CRM notes per account, decides which renewals are at risk, and writes a Slack brief for each CSM with the top three accounts to call this week and the specific signal that triggered the flag.
The agent does not run the renewal call. It triages. The reason this works is the signal data is structured (usage drops, ticket spikes, NPS dips) and the model only ranks the signals it sees, instead of predicting renewal risk from scratch. Use Claude Sonnet with prompt caching. The CSM keeps full control of the relationship and uses the brief as their starting agenda for the week. The customer success directory covers the platforms that already do parts of this (Gainsight, ChurnZero, Catalyst) and where a custom agent still wins.
Five failures account for most production agent issues:
The first three are prompt and engineering work. The last two are operational. None of them go away by switching frameworks.
Build when one of three conditions is true. Your data is the moat (the agent needs to read CRM, product usage, or internal docs no vendor can). The vendor pricing breaks at your volume (most GTM AI tools price per seat or per enriched record, which gets expensive past a few hundred reps or a few hundred thousand records). Or the workflow is specific to your operating model and a generic tool would need too much customization.
Buy when one of three conditions is true. The use case is generic (cold email writing, meeting note summarization, calendar scheduling). The volume is low and you do not have engineering bandwidth. Or the vendor has signal data you cannot easily replicate (intent, technographics, hiring activity). The workflow automation and voice AI directories cover the vendor landscape if buy is the right call.
A current production stack for a custom GTM agent looks roughly like this. Model: Claude Opus or Sonnet for reasoning, Haiku for routing and classification, GPT-4o as a fallback or for ecosystem tooling. Framework: Anthropic SDK or OpenAI SDK for single-agent loops, LangGraph for state machines with retries and human-in-the-loop, CrewAI for multi-agent role play. Observability: LangSmith, Braintrust, or Helicone for tracing every tool call and prompt version. Evaluation: Promptfoo or a custom eval harness running on a labeled set of 50 to 200 cases. Memory: a vector store (Pinecone, Weaviate, or Postgres with pgvector) for long-term recall, a structured database for state.
The most important advice for a first build: start with the smallest agent that produces measurable lift on a real workflow. Wire five tools, not fifty. Run it on 20 real cases before scaling. Watch every trace for the first week. Most of the value of an agent comes from being honest about its failures, not from being clever about its prompt.
For a single-afternoon prototype, the OpenAI Assistants API or the Anthropic tool-use API. Both let you define tools, pass a goal, and get a working agent loop in roughly 40 lines of code. Move to LangGraph when you need retries, branching, or a human-in-the-loop step. Use CrewAI when the natural decomposition is multiple specialist agents (researcher, writer, reviewer).
For reasoning-heavy multi-step workflows, Claude Opus or GPT-4o. For routing, classification, and high-volume enrichment, Claude Haiku or GPT-4o-mini at roughly one-tenth the cost. For high-volume cases where data residency matters, an open-weights model like Llama 3.1 70B or Qwen 2.5 served on your own infrastructure. Most production agents use a tiered model strategy, not one model for everything.
Three steps. First, use prompt caching: Anthropic charges 10% of input price for cached tokens, and OpenAI offers a similar discount on its caching tier. Cache the system prompt and the example outputs. Second, route trivial calls (intent classification, simple extraction) to a cheaper model. Third, cap the agent loop at a fixed number of iterations (often 6 to 10) so a single run cannot rack up an unbounded bill.
A workflow automation runs a fixed sequence of steps you defined in advance (Zapier, n8n, Make). An AI agent decides at each step which tool to call next based on the goal and what it has learned so far. Use a workflow when the path is deterministic. Use an agent when the path depends on what the data tells you. Many production GTM stacks use both: an automation for the predictable plumbing and an agent for the decision points.
Build a labeled evaluation set of 50 to 200 real cases with known good outputs before shipping. Run every prompt change against the eval set and track precision and recall on the outputs that matter. Tools like Promptfoo, Braintrust, and LangSmith automate the harness. Without an eval set, every change to the prompt is a guess and regressions ship silently to production.
Partly. Tools like Lindy, Gumloop, and Relay let you assemble agent-like workflows with a visual builder, and they cover most lead routing, enrichment, and summary use cases. The tradeoff is the same as no-code automation in general: fast to build, harder to debug, and the moment you need a custom tool or a non-standard model call, you are back in code. Most teams shipping high-volume GTM agents use a hybrid: a no-code layer for plumbing and a code layer for the agent loop itself.
A working prototype takes a single afternoon. A version you trust in front of real customers usually takes four to eight weeks of iteration: writing the system prompt, building the evaluation set, fixing the failure modes that show up on real cases, adding the observability and retry logic, and getting sign-off from the team whose workflow the agent is changing. The agent loop is easy. The operational discipline around it is the work.