Langfuse
Production observability for LLM applications — traces every inference, scores outputs, and gives you the audit trail that pre-production testing tools can’t provide.
GitHub: langfuse/langfuse · ~28k stars · MIT (self-host) / Commercial (cloud)
Docs: langfuse.com/docs
Cloud: EU Cloud at cloud.langfuse.com · US Cloud at us.cloud.langfuse.com
Company: Langfuse GmbH, Berlin, Germany
At a Glance
| Dimension | Rating |
|---|---|
| Setup effort | ● ● ○ ○ ○ |
| Coding required | ● ● ○ ○ ○ (SDK decorator pattern) |
| RAG evaluation | ● ● ● ○ ○ (scoring, not metrics) |
| Conversational / general LLM | ● ● ● ● ○ |
| Red teaming / adversarial | ○ ○ ○ ○ ○ |
| Production observability | ● ● ● ● ● |
| EU AI Act support | ● ● ● ● ○ |
| Cost at scale | Free (self-host) / Usage-based (cloud) |
Best For
- Production LLM monitoring — tracing every inference, input, output, latency, and cost in a live application
- Human review pipelines — annotating traces with scores, building labeling queues for quality teams
- Eval + observability in one system — linking offline evaluation results back to production traces
- Teams with EU data residency requirements — EU Cloud is hosted in Germany; GDPR-compliant by design
- Multi-step and agentic tracing — traces nest arbitrarily; a single user request that spawns 5 LLM calls shows as a unified trace tree
- Online evaluation — automatically scoring production outputs using LLM-as-judge or rule-based scorers
Not Great For
- Pre-production red teaming — Langfuse observes what happens in production; use promptfoo or inspect_ai to find vulnerabilities before deploy
- Deep RAG metric analysis — Langfuse can score RAG outputs, but it doesn’t compute faithfulness or context precision natively; use RAGAS to generate scores and send them to Langfuse
- One-off quality checks without a running application — Langfuse requires an instrumented application generating traces; it’s not a batch eval CLI
- Teams that want zero cloud dependency — self-hosting requires Docker Compose or Kubernetes (feasible, but non-trivial)
How It Works
Langfuse works via an SDK decorator that wraps your LLM calls and sends traces to the Langfuse backend:
from langfuse.decorators import observe, langfuse_context
from langfuse.openai import openai # drop-in OpenAI client
@observe()
def my_rag_pipeline(question: str) -> str:
# Every LLM call inside here is automatically traced
context = retrieve_documents(question)
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Use this context: {context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Every call produces a trace with: input, output, latency, token usage, cost, model used — visible in the Langfuse dashboard.
Scoring: Attach scores to any trace programmatically or via the UI:
langfuse_context.score_current_trace(
name="faithfulness",
value=0.87,
comment="Grounded in retrieved context"
)
This is how RAGAS scores get linked back to production traces.
Trace Hierarchy
Langfuse traces nest naturally for multi-step and agentic workflows:
Trace: user_question
└── Span: retrieve_documents (latency: 120ms)
└── Generation: gpt-4o call (tokens: 450, cost: $0.002)
└── Score: faithfulness = 0.87
└── Score: answer_relevancy = 0.91
For agent harnesses with subagents, each agent’s work appears as a nested span. This is the foundation for the EU AI Act audit trail use case.
Tradeoffs vs. Alternatives
vs. LangSmith Both are LLM observability platforms. Langfuse is open source (self-host for free) and German-based (strong EU data residency story). LangSmith is commercial-only and US-based. Feature parity is high; the decision usually comes down to data residency requirements and vendor preference. Langfuse’s self-hosting option is a strong differentiator for regulated industries.
vs. DeepEval / RAGAS Complementary, not competing. DeepEval and RAGAS run offline evals before deployment. Langfuse traces what actually happens in production and can apply the same scoring logic to live traffic. Typical stack: RAGAS/DeepEval pre-deploy → Langfuse post-deploy.
vs. Braintrust Braintrust has stronger step-efficiency metrics for agentic evals. Langfuse has a better self-hosting story and stronger EU compliance positioning. Both support datasets, experiments, and scoring.
Integration Effort
Time to first trace: 20–30 minutes
pip install langfuse
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "..."
os.environ["LANGFUSE_SECRET_KEY"] = "..."
# EU Cloud
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"
Works with: OpenAI SDK (drop-in wrapper), LangChain, LlamaIndex, LiteLLM, Anthropic SDK, raw HTTP via API
Languages: Python, TypeScript/JavaScript, other languages via REST API
Cost at Scale
| Option | Cost |
|---|---|
| Self-hosted (Docker Compose) | Free — requires a server with ~2GB RAM |
| Self-hosted (Kubernetes, Helm) | Free — requires K8s cluster |
| Cloud (Hobby tier) | Free up to 50k observations/month |
| Cloud (Pro) | $29/month + usage beyond free tier |
| Cloud (Enterprise) | Custom pricing, SSO, audit logs, dedicated support |
Self-hosting is genuinely viable for most teams — Langfuse provides a maintained Docker Compose setup and Helm chart.
EU AI Act Relevance
Strong — Langfuse’s most important compliance differentiator in the eval ecosystem.
- Article 9 (Risk management): Production tracing provides continuous, automated risk monitoring throughout the operational lifecycle — not just at deploy time. Incident detection through score thresholds (e.g., alert when faithfulness drops below 0.7).
- Article 10 (Data governance): Traces create an auditable record of every inference, including inputs, outputs, and metadata. Data retention is configurable.
- Article 13 (Transparency): Trace data can support user-facing transparency requirements — you can demonstrate what information the AI used to generate a response.
- Article 17 (Quality management): Online scoring pipelines implement continuous quality monitoring as required by quality management system obligations.
- Annex IV (Technical documentation): Trace exports (JSONL, CSV) provide machine-readable evidence of system behavior over time.
EU data residency: The EU Cloud instance (cloud.langfuse.com) is hosted in Germany (AWS eu-central-1). Langfuse GmbH is a German company subject to GDPR. A Data Processing Agreement (DPA) is available at langfuse.com/security/dpa — applicable to all subscription tiers including free.
Self-hosting for maximum data control: For Annex III high-risk AI systems where data sovereignty is non-negotiable, self-hosting ensures no data leaves your infrastructure.
Version Tracked
- Current stable: v3.x
- Last verified: 2026-05-20
- Self-host note: Docker Compose setup is well-maintained. The
docker-compose.ymlin the repo is the recommended starting point. Helm chart for Kubernetes is available in thelangfuse/langfuse-k8srepo.