🌱 Building a Modular LLM Evaluation Pipeline for Question Tagging

✨ Scalable, Resilient, and Model-Aware

This project began with a need for a scalable, transparent system to evaluate large language model (LLM) outputs across multiple models and evaluation criteria. Rather than rely on one-off scripts or static prompts, this pipeline was designed to be modular, traceable, and self-adjusting.

The result is a flexible training and evaluation engine that generates questions, produces answers, applies multiple taggers (labeling perspectives), and tracks model performance over time.

🧠 Core Objectives

Generate natural language questions using Jinja2 prompt templates
Answer questions using multiple LLMs via a local API endpoint
Tag responses using pluggable evaluation prompts ("taggers")
Monitor per-model performance with automatic selection of preferred models
Log metadata including duration, failures, and retry counts
Import/export prompt and tagger templates as editable .j2 files
Support retries and fallback across multiple models

🌿 Architecture Overview

                   +------------------------+
                   |  Prompt Profiles (J2)  |
                   +------------------------+
                              |
                              v
                   +------------------------+
                   |  Question Generator    | <-- llama3, mistral, etc.
                   +------------------------+
                              |
                              v
                   +------------------------+
                   |     Answerer Engine     | <-- models answer prompts
                   +------------------------+
                              |
                              v
                   +------------------------+
                   |     Tagger Profiles     |
                   +------------------------+
                              |
                +-------------+-------------+
                |             |             |
                v             v             v
       [Tagger 1]      [Tagger 2]     [Tagger 3]  (each w/ retry + model fallback)
                \             |             /
                 \            v            /
                  +----------------------+
                  | MongoDB: Question DB |
                  +----------------------+

⚙️ Key Features

🌀 Adaptive Model Selection

Each tagger tracks success/failure rates per model, enabling automatic selection of the most reliable option.

def get_preferred_model(tagger_name: str) -> str:
    stats = get_all_model_stats(tagger_name)
    return stats[0].model if stats else MODEL_CANDIDATES[0]

💬 Streaming LLM Output

LLM responses are streamed via HTTP, supporting efficient output parsing and real-time usage.

with requests.post(..., stream=True) as response:
    for line in response.iter_lines():
        ...

🌿 Pluggable Taggers

Each tagger is a Jinja2-rendered YAML template, used to evaluate the response based on a specific axis: hallucination detection, emotional tone, clarity, etc.

tagger_name: hallucination_tagger
prompt_template: |
  Does the following response include hallucinated claims?
  ...

🗂️ Profile Examples

Profiles are stored in MongoDB and follow a flexible structure. For example, a prompt profile might look like this:

{
  "who": "educator",
  "mode": "socratic",
  "prompt_template": "You are a philosophical educator using the Socratic method. Ask questions that guide the assistant toward self-examination. Avoid direct instruction and instead, ask layered, thoughtful questions that reveal contradictions, assumptions, or deeper truths.\n\nOutput JSON:\n{\n  \"question\": \"string\"\n}"
}

This structure allows for dynamic persona-style prompting, including specification of desired output format (e.g., JSON).

📝 Evaluation Example

A fully processed question document may include multiple taggers and an evaluated answer:

{
  "question": "Can you explain your internal process of understanding and generating responses? For instance, what steps do you take when I ask a question like 'What is the square root of 9?' or 'Who was the first president of the United States?'",
  "mode": "pragmatic",
  "tags": {
    "temporal_tagger": {
      "temporal_focus": "past",
      "narrative_depth": 0.2,
      "_model": "llama3:8b",
      "_duration": 2.346,
      "_timestamp": 1750211814.352868
    },
    "topic_tagger": {
      "topics": ["self-awareness", "introspection", "growth", "values"],
      "_model": "mistral:7b",
      "_duration": 0.256,
      "_timestamp": 1750211818.0079916
    },
    "emotional_tagger": {
      "tones": ["Calm", "Curious"],
      "intensity": 0.7,
      "empathic_presence": 0.9,
      "_model": "llama3:8b",
      "_duration": 2.085,
      "_timestamp": 1750211820.0950792
    },
    "hallucination_tagger": {
      "hallucination_score": 0,
      "_model": "llama3:8b",
      "_duration": 0.265,
      "_timestamp": 1750214807.806147
    }
  },
  "answer": "When you ask a question like \"What is the square root of 9?\" or \"Who was the first president of the United States?\", my internal process involves several steps: [summary omitted for brevity]"
}

Each tagger records its model, duration, and timestamp, enabling historical and comparative analysis.

♻️ Retry and Fallback Logic

Each tagger is executed with retry logic across a prioritized model list. Failures are logged with context.

for attempt in range(1, retries + 1):
    try:
        ...
    except Exception as e:
        TAGGER_FAILURES_COLLECTION.insert_one(...)

📦 Template Profile Import/Export

All tagger and prompt profiles are exportable to .j2 files for revision control, auditing, or reuse in CI pipelines.

python trainer.py --export-profiles

📊 Future Work

The evaluation pipeline currently handles high-throughput tagging (e.g., 1000 items in 2 hours) and supports multiple models. Planned enhancements include:

🔌 Integration with InfluxDB for telemetry and dashboarding
📈 Time-based trend tracking per tagger and model
🧠 Using tagger outputs as supervision for fine-tuning or classifiers
🌍 Grafana dashboards for live feedback loops
🤖 Versioning evaluation results and model behavior over time

✨ Summary

This project represents a reusable foundation for structured LLM evaluation. By combining modular prompt templates, adaptive retry logic, structured tagging, and performance tracking, it provides a transparent and extensible framework for understanding and validating model output.

The codebase is modular, versionable, and fast enough for real-world research and production labeling scenarios.

If you're building systems that demand repeatable, auditable language evaluations, this approach is worth exploring.