Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments
AI
This tutorial implements a complete Langfuse pipeline for tracing, prompt management, scoring, datasets, and experiments, using either OpenAI or a mock LLM.
Intelligence Insights
The Big Picture
The article provides a step-by-step guide to building a full Langfuse observability and evaluation pipeline. It covers setting up credentials, tracing simple functions and a RAG pipeline, managing prompts centrally, attaching evaluation scores, and running dataset-based experiments. The tutorial supports both real OpenAI API calls and a deterministic mock LLM, ensuring all features are demonstrated without requiring paid access. Key components include decorator-based tracing, manual RAG tracing with user/session attributes, prompt versioning, numeric/categorical/boolean scoring, and experiment execution with evaluators. The guide also optionally covers LangChain integration and concludes with instructions for viewing results in the Langfuse dashboard.
Why It Matters
This tutorial shows how to build a production-ready LLM observability pipeline using Langfuse, an open-source platform. By integrating tracing, prompt management, scoring, and experiments, developers can systematically debug, evaluate, and improve LLM applications—moving from ad-hoc testing to structured, data-driven development. This is crucial as LLM apps become more complex and need robust monitoring to ensure reliability and performance in production.
Deepen your understanding
Use our AI to break down complex signals.
Select an AI action to generate more depth.
In this tutorial, we implement the Langfuse (an open-source LLM engineering platform) pipeline for tracing, prompt management, scoring, datasets, and experiments. We build a complete workflow that works with either a real OpenAI key or a deterministic mock LLM, so we can understand every major Langfuse feature without depending on paid model access. We start by setting up credentials and connecting to Langfuse. We trace simple function calls, instrument a small RAG pipeline, manage prompts centrally, attach evaluation scores, and run dataset-based experiments. Also, we see how Langfuse helps us observe, evaluate, and improve LLM applications in a structured and production-ready way.
import subprocess, sys
def pip_install(*pkgs):
subprocess.run([sys.executable, "-m", "pip", "install", "-qU", *pkgs], check=True)
pip_install("langfuse", "openai")
import os
from getpass import getpass
def _ask(var, prompt, secret=True, default=None):
if os.environ.get(var):
return os.environ[var]
val = (getpass(prompt) if secret else input(prompt)).strip()
if not val and default is not None:
val = default
os.environ[var] = val
return val
print("Enter your Langfuse credentials (input is hidden):")
_ask("LANGFUSE_PUBLIC_KEY", " Langfuse PUBLIC key (pk-lf-...): ")
_ask("LANGFUSE_SECRET_KEY", " Langfuse SECRET key (sk-lf-...): ")
region = (input(" Region — EU (default) / US / or paste a self-hosted URL: ")
.strip().lower())
if region.startswith("http"):
HOST = region
elif region in ("2", "us"):
HOST = "https://us.cloud.langfuse.com"
else:
HOST = "https://cloud.langfuse.com"
os.environ["LANGFUSE_HOST"] = HOST
OPENAI_API_KEY = (getpass(" OpenAI key (optional, press Enter to skip): ").strip())
if OPENAI_API_KEY:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
USE_OPENAI = bool(OPENAI_API_KEY)
DEFAULT_MODEL = "gpt-4o-mini" if USE_OPENAI else "mock-llm-v1"
from langfuse import get_client, observe, propagate_attributes, Evaluation
langfuse = get_client()
assert langfuse.auth_check(), "Auth failed — double-check keys/region."
print(f"\n✅ Connected to Langfuse at {HOST}")
print(f" LLM backend: {'OpenAI (' + DEFAULT_MODEL + ')' if USE_OPENAI else 'built-in mock'}\n")
We begin by installing the required Langfuse and OpenAI packages inside the Colab environment. We then collect Langfuse credentials, choose the correct Langfuse region or self-hosted URL, and optionally accept an OpenAI API key. We finally initialize the Langfuse client, verify authentication, and confirm whether we are using OpenAI or the built-in mock LLM.
if USE_OPENAI:
from langfuse.openai import openai
_MOCK_FACTS = {
"france": "Paris", "germany": "Berlin", "japan": "Tokyo",
"italy": "Rome", "spain": "Madrid", "india": "New Delhi",
}
def _mock_answer(user_text: str) -> str:
t = user_text.lower()
for country, capital in _MOCK_FACTS.items():
if country in t:
return capital
if "langfuse" in t:
return ("Langfuse is an open-source LLM engineering platform for "
"observability, prompt management, evaluation and datasets.")
return "This is a mock response. Provide an OpenAI key for real generations."
def llm_chat(messages, *, model=DEFAULT_MODEL, temperature=0.3,
name=None, langfuse_prompt=None) -> str:
"""Return assistant text; the call is traced as a Langfuse generation."""
if USE_OPENAI:
kwargs = dict(model=model, messages=messages, temperature=temperature)
if name: kwargs["name"] = name
if langfuse_prompt: kwargs["langfuse_prompt"] = langfuse_prompt
resp = openai.chat.completions.create(**kwargs)
return resp.choices[0].message.content
last_user = next((m["content"] for m in reversed(messages)
if m["role"] == "user"), "")
answer = _mock_answer(last_user)
gen_kwargs = dict(as_type="generation", name=name or "mock-llm",
model=model, input=messages)
if langfuse_prompt is not None:
gen_kwargs["prompt"] = langfuse_prompt
with langfuse.start_as_current_observation(**gen_kwargs) as gen:
gen.update(output=answer,
usage_details={"input_tokens": 24, "output_tokens": 12})
return answer
print("PART 1 ── Decorator tracing -------------------------------------------")
@observe()
def write_story(topic: str) -> str:
return llm_chat(
[{"role": "user", "content": f"Write a one-sentence story about {topic}."}],
name="story-generation",
)
@observe()
def story_pipeline(topic: str) -> str:
return write_story(topic)
print(" →", story_pipeline("a debugging robot"))
We define the LLM helper that supports both real OpenAI generations and deterministic mock responses. We also make sure that even the mock path creates a proper Langfuse generation observation, so the tutorial remains fully traceable without an OpenAI key. We then demonstrate basic decorator-based tracing by wrapping a simple story-generation pipeline with @observe.
print("\nPART 2 ── Manual RAG trace --------------------------------------------")
_KB = {
"refund": "Refunds are processed within 5–7 business days to the original method.",
"warranty": "All products carry a 1-year limited manufacturer warranty.",
}
@observe(name="retrieve")
def retrieve(question: str):
q = question.lower()
hits = [v for k, v in _KB.items() if k in q] or list(_KB.values())
return hits[:2]
@observe(name="rag-pipeline")
def rag_pipeline(question: str, user_id="user-42", session_id="sess-001") -> str:
with propagate_attributes(user_id=user_id, session_id=session_id,
tags=["rag", "support-bot", "tutorial"]):
context = "\n".join(retrieve(question))
return llm_chat(
[{"role": "system",
"content": "Answer the question using ONLY the provided context."},
{"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"}],
name="rag-answer",
)
rag_answer = rag_pipeline("How long do refunds take?")
rag_trace_id = langfuse.get_current_trace_id()
print(" →", rag_answer)
We build a small manual RAG pipeline using a simple in-memory knowledge base for refunds, shipping, and warranty information. We trace the retrieval step separately and use propagate_attributes to attach user ID, session ID, and tags across the full trace. We then run a refund-related question and capture the trace ID so we can attach scores to it later.
print("\nPART 3 ── Prompt management -------------------------------------------")
langfuse.create_prompt(
name="support-agent",
type="chat",
prompt=[
{"role": "system",
"content": "You are a {{tone}} customer-support agent for {{company}}. "
"Be concise."},
{"role": "user", "content": "{{question}}"},
],
labels=["production"],
config={"model": DEFAULT_MODEL, "temperature": 0.2},
)
prompt = langfuse.get_prompt("support-agent", type="chat")
compiled = prompt.compile(tone="friendly", company="Acme",
question="Do you offer express shipping?")
print(" compiled prompt:", compiled)
@observe(name="prompt-managed-call")
def answer_with_managed_prompt():
return llm_chat(compiled, name="support-reply", langfuse_prompt=prompt)
print(" →", answer_with_managed_prompt())
print("\nPART 4 ── Scoring -----------------------------------------------------")
def keyword_overlap(answer: str, expected_keyword: str) -> float:
return 1.0 if expected_keyword.lower() in (answer or "").lower() else 0.0
langfuse.create_score(
name="groundedness",
value=keyword_overlap(rag_answer, "5"),
trace_id=rag_trace_id,
data_type="NUMERIC",
comment="Heuristic: mentions the documented refund window.",
)
langfuse.create_score(name="user_feedback", value="helpful",
trace_id=rag_trace_id, data_type="CATEGORICAL")
langfuse.create_score(name="resolved", value=1,
trace_id=rag_trace_id, data_type="BOOLEAN")
@observe(name="scored-call")
def scored_call():
out = llm_chat([{"role": "user", "content": "What is the capital of Japan?"}],
name="capital-q")
with langfuse.start_as_current_observation(as_type="span", name="grade") as span:
span.score(name="correct", value=keyword_overlap(out, "Tokyo"),
data_type="NUMERIC")
span.score_trace(name="trace_quality", value=0.9, data_type="NUMERIC")
return out
print(" →", scored_call(), "(scores attached)")
We create a managed Langfuse chat prompt, compile it with runtime variables, and link the prompt version to a traced generation. We then add different score types to the earlier RAG trace, including numeric, categorical, and boolean scores. We also demonstrate inline scoring by grading a capital-city answer inside the current observed span and trace.
print("\nPART 5 ── Datasets & experiments --------------------------------------")
DATASET = "capital-cities-tutorial"
langfuse.create_dataset(name=DATASET, description="Capital-city QA benchmark")
_items = [
("What is the capital of France?", "Paris"),
("What is the capital of Germany?", "Berlin"),
("What is the capital of Japan?", "Tokyo"),
("What is the capital of Italy?", "Rome"),
]
for i, (q, a) in enumerate(_items):
langfuse.create_dataset_item(dataset_name=DATASET, id=f"cap-{i}",
input={"question": q}, expected_output=a)
def capital_task(*, item, **kwargs):
question = item.input["question"] if isinstance(item.input, dict) else item.input
return llm_chat([{"role": "user", "content": question}], name="experiment-answer")
def accuracy(*, input, output, expected_output, metadata=None, **kwargs):
hit = bool(expected_output) and expected_output.lower() in (output or "").lower()
return Evaluation(name="accuracy", value=1.0 if hit else 0.0,
comment="exact-match contains check")
def conciseness(*, input, output, **kwargs):
return Evaluation(name="char_length", value=float(len(output or "")))
def mean_accuracy(*, item_results, **kwargs):
vals = [e.value for r in item_results for e in r.evaluations if e.name == "accuracy"]
avg = sum(vals) / len(vals) if vals else 0.0
return Evaluation(name="mean_accuracy", value=avg, comment=f"{avg:.0%} correct")
dataset = langfuse.get_dataset(DATASET)
result = dataset.run_experiment(
name="capitals-baseline",
description="Baseline run from the Colab tutorial",
task=capital_task,
evaluators=[accuracy, conciseness],
run_evaluators=[mean_accuracy],
max_concurrency=4,
)
print(result.format())
We create a Langfuse dataset for capital-city questions and add deterministic items to ensure repeated runs remain idempotent. We define a task function that answers each item, along with item-level evaluators for accuracy and response length. We then run an experiment on the dataset and print a formatted summary of item-level and aggregate results.
if USE_OPENAI:
print("\nPART 6 ── LangChain integration ---------------------------------------")
pip_install("langchain-core", "langchain-openai")
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langfuse.langchain import CallbackHandler
handler = CallbackHandler()
chain = (ChatPromptTemplate.from_template("Explain {concept} in one sentence.")
| ChatOpenAI(model="gpt-4o-mini", temperature=0))
lc_out = chain.invoke({"concept": "observability"},
config={"callbacks": [handler]})
print(" →", lc_out.content)
else:
print("\nPART 6 ── LangChain integration skipped (no OpenAI key).")
langfuse.flush()
print("Open your project at", HOST)
print(" • Tracing tab .... Parts 1–4 traces (incl. user/session/tags)")
print(" • Prompts tab .... the versioned 'support-agent' prompt")
print(" • Scores ......... groundedness / user_feedback / resolved / accuracy")
print(" • Datasets tab ... '%s' with the 'capitals-baseline' experiment run" % DATASET)
We optionally demonstrate the LangChain integration when an OpenAI key is available, using the Langfuse callback handler to trace chain execution. If no OpenAI key is provided, we skip this section while keeping the rest of the tutorial fully functional. We finally flush all buffered events to Langfuse and print where to inspect traces, prompts, scores, and dataset experiment results.
In conclusion, we created a practical end-to-end Langfuse workflow that covers the most important parts of LLM observability and evaluation. We learned how to trace both automatic and manual operations, link prompt versions to generations, score outputs, and benchmark an application using datasets and experiments. We also kept the tutorial flexible by supporting both OpenAI-powered generation and a mock LLM path, making it easier to test the full pipeline in any environment. Also, we gained an understanding of how Langfuse helps us monitor LLM behavior, compare experiment runs, and build more reliable AI applications.