Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments

In this tutorial, we implement the Langfuse (an open-source LLM engineering platform) pipeline for tracing, prompt management, scoring, datasets, and experiments. We build a complete workflow that works with either a real OpenAI key or a deterministic mock LLM, so we can understand every major Langfuse feature without depending on paid model access. We start by setting up credentials and connecting to Langfuse. We trace simple function calls, instrument a small RAG pipeline, manage prompts centrally, attach evaluation scores, and run dataset-based experiments. Also, we see how Langfuse helps us observe, evaluate, and improve LLM applications in a structured and production-ready way.

Copy CodeCopiedUse a different Browser

import subprocess, sys
def pip_install(*pkgs):
   subprocess.run([sys.executable, "-m", "pip", "install", "-qU", *pkgs], check=True)
pip_install("langfuse", "openai")
import os
from getpass import getpass
def _ask(var, prompt, secret=True, default=None):
   if os.environ.get(var):
       return os.environ[var]
   val = (getpass(prompt) if secret else input(prompt)).strip()
   if not val and default is not None:
       val = default
   os.environ[var] = val
   return val
print("Enter your Langfuse credentials (input is hidden):")
_ask("LANGFUSE_PUBLIC_KEY", "  Langfuse PUBLIC key (pk-lf-...): ")
_ask("LANGFUSE_SECRET_KEY", "  Langfuse SECRET key (sk-lf-...): ")
region = (input("  Region — EU (default) / US / or paste a self-hosted URL: ")
         .strip().lower())
if region.startswith("http"):
   HOST = region
elif region in ("2", "us"):
   HOST = "https://us.cloud.langfuse.com"
else:
   HOST = "https://cloud.langfuse.com"
os.environ["LANGFUSE_HOST"] = HOST
OPENAI_API_KEY = (getpass("  OpenAI key (optional, press Enter to skip): ").strip())
if OPENAI_API_KEY:
   os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
USE_OPENAI = bool(OPENAI_API_KEY)
DEFAULT_MODEL = "gpt-4o-mini" if USE_OPENAI else "mock-llm-v1"
from langfuse import get_client, observe, propagate_attributes, Evaluation
langfuse = get_client()
assert langfuse.auth_check(), "Auth failed — double-check keys/region."
print(f"\n✅ Connected to Langfuse at {HOST}")
print(f"   LLM backend: {'OpenAI (' + DEFAULT_MODEL + ')' if USE_OPENAI else 'built-in mock'}\n")

We begin by installing the required Langfuse and OpenAI packages inside the Colab environment. We then collect Langfuse credentials, choose the correct Langfuse region or self-hosted URL, and optionally accept an OpenAI API key. We finally initialize the Langfuse client, verify authentication, and confirm whether we are using OpenAI or the built-in mock LLM.

Copy CodeCopiedUse a different Browser

if USE_OPENAI:
   from langfuse.openai import openai
_MOCK_FACTS = {
   "france": "Paris", "germany": "Berlin", "japan": "Tokyo",
   "italy": "Rome", "spain": "Madrid", "india": "New Delhi",
}
def _mock_answer(user_text: str) -> str:
   t = user_text.lower()
   for country, capital in _MOCK_FACTS.items():
       if country in t:
           return capital
   if "langfuse" in t:
       return ("Langfuse is an open-source LLM engineering platform for "
               "observability, prompt management, evaluation and datasets.")
   return "This is a mock response. Provide an OpenAI key for real generations."
def llm_chat(messages, *, model=DEFAULT_MODEL, temperature=0.3,
            name=None, langfuse_prompt=None) -> str:
   """Return assistant text; the call is traced as a Langfuse generation."""
   if USE_OPENAI:
       kwargs = dict(model=model, messages=messages, temperature=temperature)
       if name:            kwargs["name"] = name
       if langfuse_prompt: kwargs["langfuse_prompt"] = langfuse_prompt
       resp = openai.chat.completions.create(**kwargs)
       return resp.choices[0].message.content
   last_user = next((m["content"] for m in reversed(messages)
                     if m["role"] == "user"), "")
   answer = _mock_answer(last_user)
   gen_kwargs = dict(as_type="generation", name=name or "mock-llm",
                     model=model, input=messages)
   if langfuse_prompt is not None:
       gen_kwargs["prompt"] = langfuse_prompt
   with langfuse.start_as_current_observation(**gen_kwargs) as gen:
       gen.update(output=answer,
                  usage_details={"input_tokens": 24, "output_tokens": 12})
   return answer
print("PART 1 ── Decorator tracing -------------------------------------------")
@observe()
def write_story(topic: str) -> str:
   return llm_chat(
       [{"role": "user", "content": f"Write a one-sentence story about {topic}."}],
       name="story-generation",
   )
@observe()
def story_pipeline(topic: str) -> str:
   return write_story(topic)
print("  →", story_pipeline("a debugging robot"))

We define the LLM helper that supports both real OpenAI generations and deterministic mock responses. We also make sure that even the mock path creates a proper Langfuse generation observation, so the tutorial remains fully traceable without an OpenAI key. We then demonstrate basic decorator-based tracing by wrapping a simple story-generation pipeline with @observe.

Copy CodeCopiedUse a different Browser

print("\nPART 2 ── Manual RAG trace --------------------------------------------")
_KB = {
   "refund": "Refunds are processed within 5–7 business days to the original method.",
   "warranty": "All products carry a 1-year limited manufacturer warranty.",
}
@observe(name="retrieve")
def retrieve(question: str):
   q = question.lower()
   hits = [v for k, v in _KB.items() if k in q] or list(_KB.values())
   return hits[:2]
@observe(name="rag-pipeline")
def rag_pipeline(question: str, user_id="user-42", session_id="sess-001") -> str:
   with propagate_attributes(user_id=user_id, session_id=session_id,
                             tags=["rag", "support-bot", "tutorial"]):
       context = "\n".join(retrieve(question))
       return llm_chat(
           [{"role": "system",
             "content": "Answer the question using ONLY the provided context."},
            {"role": "user",
             "content": f"Context:\n{context}\n\nQuestion: {question}"}],
           name="rag-answer",
       )
rag_answer = rag_pipeline("How long do refunds take?")
rag_trace_id = langfuse.get_current_trace_id()
print("  →", rag_answer)

We build a small manual RAG pipeline using a simple in-memory knowledge base for refunds, shipping, and warranty information. We trace the retrieval step separately and use propagate_attributes to attach user ID, session ID, and tags across the full trace. We then run a refund-related question and capture the trace ID so we can attach scores to it later.

Copy CodeCopiedUse a different Browser

print("\nPART 3 ── Prompt management -------------------------------------------")
langfuse.create_prompt(
   name="support-agent",
   type="chat",
   prompt=[
       {"role": "system",
        "content": "You are a {{tone}} customer-support agent for {{company}}. "
                   "Be concise."},
       {"role": "user", "content": "{{question}}"},
   ],
   labels=["production"],
   config={"model": DEFAULT_MODEL, "temperature": 0.2},
)
prompt = langfuse.get_prompt("support-agent", type="chat")
compiled = prompt.compile(tone="friendly", company="Acme",
                         question="Do you offer express shipping?")
print("  compiled prompt:", compiled)
@observe(name="prompt-managed-call")
def answer_with_managed_prompt():
   return llm_chat(compiled, name="support-reply", langfuse_prompt=prompt)
print("  →", answer_with_managed_prompt())
print("\nPART 4 ── Scoring -----------------------------------------------------")
def keyword_overlap(answer: str, expected_keyword: str) -> float:
   return 1.0 if expected_keyword.lower() in (answer or "").lower() else 0.0
langfuse.create_score(
   name="groundedness",
   value=keyword_overlap(rag_answer, "5"),
   trace_id=rag_trace_id,
   data_type="NUMERIC",
   comment="Heuristic: mentions the documented refund window.",
)
langfuse.create_score(name="user_feedback", value="helpful",
                     trace_id=rag_trace_id, data_type="CATEGORICAL")
langfuse.create_score(name="resolved", value=1,
                     trace_id=rag_trace_id, data_type="BOOLEAN")
@observe(name="scored-call")
def scored_call():
   out = llm_chat([{"role": "user", "content": "What is the capital of Japan?"}],
                  name="capital-q")
   with langfuse.start_as_current_observation(as_type="span", name="grade") as span:
       span.score(name="correct", value=keyword_overlap(out, "Tokyo"),
                  data_type="NUMERIC")
       span.score_trace(name="trace_quality", value=0.9, data_type="NUMERIC")
   return out
print("  →", scored_call(), "(scores attached)")

We create a managed Langfuse chat prompt, compile it with runtime variables, and link the prompt version to a traced generation. We then add different score types to the earlier RAG trace, including numeric, categorical, and boolean scores. We also demonstrate inline scoring by grading a capital-city answer inside the current observed span and trace.

Copy CodeCopiedUse a different Browser

print("\nPART 5 ── Datasets & experiments --------------------------------------")
DATASET = "capital-cities-tutorial"
langfuse.create_dataset(name=DATASET, description="Capital-city QA benchmark")
_items = [
   ("What is the capital of France?",  "Paris"),
   ("What is the capital of Germany?", "Berlin"),
   ("What is the capital of Japan?",   "Tokyo"),
   ("What is the capital of Italy?",   "Rome"),
]
for i, (q, a) in enumerate(_items):
   langfuse.create_dataset_item(dataset_name=DATASET, id=f"cap-{i}",
                                input={"question": q}, expected_output=a)
def capital_task(*, item, **kwargs):
   question = item.input["question"] if isinstance(item.input, dict) else item.input
   return llm_chat([{"role": "user", "content": question}], name="experiment-answer")
def accuracy(*, input, output, expected_output, metadata=None, **kwargs):
   hit = bool(expected_output) and expected_output.lower() in (output or "").lower()
   return Evaluation(name="accuracy", value=1.0 if hit else 0.0,
                     comment="exact-match contains check")
def conciseness(*, input, output, **kwargs):
   return Evaluation(name="char_length", value=float(len(output or "")))
def mean_accuracy(*, item_results, **kwargs):
   vals = [e.value for r in item_results for e in r.evaluations if e.name == "accuracy"]
   avg = sum(vals) / len(vals) if vals else 0.0
   return Evaluation(name="mean_accuracy", value=avg, comment=f"{avg:.0%} correct")
dataset = langfuse.get_dataset(DATASET)
result = dataset.run_experiment(
   name="capitals-baseline",
   description="Baseline run from the Colab tutorial",
   task=capital_task,
   evaluators=[accuracy, conciseness],
   run_evaluators=[mean_accuracy],
   max_concurrency=4,
)
print(result.format())

We create a Langfuse dataset for capital-city questions and add deterministic items to ensure repeated runs remain idempotent. We define a task function that answers each item, along with item-level evaluators for accuracy and response length. We then run an experiment on the dataset and print a formatted summary of item-level and aggregate results.

Copy CodeCopiedUse a different Browser

if USE_OPENAI:
   print("\nPART 6 ── LangChain integration ---------------------------------------")
   pip_install("langchain-core", "langchain-openai")
   from langchain_openai import ChatOpenAI
   from langchain_core.prompts import ChatPromptTemplate
   from langfuse.langchain import CallbackHandler
   handler = CallbackHandler()
   chain = (ChatPromptTemplate.from_template("Explain {concept} in one sentence.")
            | ChatOpenAI(model="gpt-4o-mini", temperature=0))
   lc_out = chain.invoke({"concept": "observability"},
                         config={"callbacks": [handler]})
   print("  →", lc_out.content)
else:
   print("\nPART 6 ── LangChain integration skipped (no OpenAI key).")
langfuse.flush()
print("Open your project at", HOST)
print("   • Tracing tab .... Parts 1–4 traces (incl. user/session/tags)")
print("   • Prompts tab .... the versioned 'support-agent' prompt")
print("   • Scores ......... groundedness / user_feedback / resolved / accuracy")
print("   • Datasets tab ... '%s' with the 'capitals-baseline' experiment run" % DATASET)

We optionally demonstrate the LangChain integration when an OpenAI key is available, using the Langfuse callback handler to trace chain execution. If no OpenAI key is provided, we skip this section while keeping the rest of the tutorial fully functional. We finally flush all buffered events to Langfuse and print where to inspect traces, prompts, scores, and dataset experiment results.

In conclusion, we created a practical end-to-end Langfuse workflow that covers the most important parts of LLM observability and evaluation. We learned how to trace both automatic and manual operations, link prompt versions to generations, score outputs, and benchmark an application using datasets and experiments. We also kept the tutorial flexible by supporting both OpenAI-powered generation and a mock LLM path, making it easier to test the full pipeline in any environment. Also, we gained an understanding of how Langfuse helps us monitor LLM behavior, compare experiment runs, and build more reliable AI applications.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments appeared first on MarkTechPost.

Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments

The Big Picture

Why It Matters

Deepen your understanding

Intelligence Exchange