🌱 Building a Modular LLM Evaluation Pipeline for Question Tagging
✨ Scalable, Resilient, and Model-Aware
This project began with a need for a scalable, transparent system to evaluate large language model (LLM) outputs across multiple models and evaluation criteria. Rather than rely on one-off scripts or static prompts, this pipeline was designed to be modular, traceable, and self-adjusting.
The result is a flexible training and evaluation engine that generates questions, produces answers, applies multiple taggers (labeling perspectives), and tracks model performance over time.
🧠 Core Objectives
- Generate natural language questions using Jinja2 prompt templates
- Answer questions using multiple LLMs via a local API endpoint
- Tag responses using pluggable evaluation prompts ("taggers")
- Monitor per-model performance with automatic selection of preferred models
- Log metadata including duration, failures, and retry counts
- Import/export prompt and tagger templates as editable
.j2
files - Support retries and fallback across multiple models
🌿 Architecture Overview
+------------------------+
| Prompt Profiles (J2) |
+------------------------+
|
v
+------------------------+
| Question Generator | <-- llama3, mistral, etc.
+------------------------+
|
v
+------------------------+
| Answerer Engine | <-- models answer prompts
+------------------------+
|
v
+------------------------+
| Tagger Profiles |
+------------------------+
|
+-------------+-------------+
| | |
v v v
[Tagger 1] [Tagger 2] [Tagger 3] (each w/ retry + model fallback)
\ | /
\ v /
+----------------------+
| MongoDB: Question DB |
+----------------------+
⚙️ Key Features
🌀 Adaptive Model Selection
Each tagger tracks success/failure rates per model, enabling automatic selection of the most reliable option.
def get_preferred_model(tagger_name: str) -> str:
stats = get_all_model_stats(tagger_name)
return stats[0].model if stats else MODEL_CANDIDATES[0]
💬 Streaming LLM Output
LLM responses are streamed via HTTP, supporting efficient output parsing and real-time usage.
with requests.post(..., stream=True) as response:
for line in response.iter_lines():
...
🌿 Pluggable Taggers
Each tagger is a Jinja2-rendered YAML template, used to evaluate the response based on a specific axis: hallucination detection, emotional tone, clarity, etc.
tagger_name: hallucination_tagger
prompt_template: |
Does the following response include hallucinated claims?
...
🗂️ Profile Examples
Profiles are stored in MongoDB and follow a flexible structure. For example, a prompt profile might look like this:
{
"who": "educator",
"mode": "socratic",
"prompt_template": "You are a philosophical educator using the Socratic method. Ask questions that guide the assistant toward self-examination. Avoid direct instruction and instead, ask layered, thoughtful questions that reveal contradictions, assumptions, or deeper truths.\n\nOutput JSON:\n{\n \"question\": \"string\"\n}"
}
This structure allows for dynamic persona-style prompting, including specification of desired output format (e.g., JSON).
📝 Evaluation Example
A fully processed question document may include multiple taggers and an evaluated answer:
{
"question": "Can you explain your internal process of understanding and generating responses? For instance, what steps do you take when I ask a question like 'What is the square root of 9?' or 'Who was the first president of the United States?'",
"mode": "pragmatic",
"tags": {
"temporal_tagger": {
"temporal_focus": "past",
"narrative_depth": 0.2,
"_model": "llama3:8b",
"_duration": 2.346,
"_timestamp": 1750211814.352868
},
"topic_tagger": {
"topics": ["self-awareness", "introspection", "growth", "values"],
"_model": "mistral:7b",
"_duration": 0.256,
"_timestamp": 1750211818.0079916
},
"emotional_tagger": {
"tones": ["Calm", "Curious"],
"intensity": 0.7,
"empathic_presence": 0.9,
"_model": "llama3:8b",
"_duration": 2.085,
"_timestamp": 1750211820.0950792
},
"hallucination_tagger": {
"hallucination_score": 0,
"_model": "llama3:8b",
"_duration": 0.265,
"_timestamp": 1750214807.806147
}
},
"answer": "When you ask a question like \"What is the square root of 9?\" or \"Who was the first president of the United States?\", my internal process involves several steps: [summary omitted for brevity]"
}
Each tagger records its model, duration, and timestamp, enabling historical and comparative analysis.
♻️ Retry and Fallback Logic
Each tagger is executed with retry logic across a prioritized model list. Failures are logged with context.
for attempt in range(1, retries + 1):
try:
...
except Exception as e:
TAGGER_FAILURES_COLLECTION.insert_one(...)
📦 Template Profile Import/Export
All tagger and prompt profiles are exportable to .j2
files for revision control, auditing, or reuse in CI pipelines.
python trainer.py --export-profiles
📊 Future Work
The evaluation pipeline currently handles high-throughput tagging (e.g., 1000 items in 2 hours) and supports multiple models. Planned enhancements include:
- 🔌 Integration with InfluxDB for telemetry and dashboarding
- 📈 Time-based trend tracking per tagger and model
- 🧠 Using tagger outputs as supervision for fine-tuning or classifiers
- 🌍 Grafana dashboards for live feedback loops
- 🤖 Versioning evaluation results and model behavior over time
✨ Summary
This project represents a reusable foundation for structured LLM evaluation. By combining modular prompt templates, adaptive retry logic, structured tagging, and performance tracking, it provides a transparent and extensible framework for understanding and validating model output.
The codebase is modular, versionable, and fast enough for real-world research and production labeling scenarios.
If you're building systems that demand repeatable, auditable language evaluations, this approach is worth exploring.