How to Analyze and Fine-Tune Agent Reasoning Traces with the Hermes Dataset

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

The lambda/hermes-agent-reasoning-traces dataset provides a granular look into how agentic models process multi-turn conversations. Analysis shows that identifying internal thinking versus external tool calls is critical for evaluating agent performance and error rates.

Why This Matters

While ideal models are expected to reason perfectly, technical reality reveals frequent malformed tool calls and reasoning loops. This implementation addresses the gap by providing structured parsers for tags like and <tool_call>, enabling engineers to measure actual performance metrics such as parallel width and error frequency in complex trajectories. Understanding these traces is essential for transitioning from basic prompt engineering to robust agentic system design.

Key Insights

Regex-based parsing for and <tool_call> tags allows for the separation of internal reasoning from external actions in models like Kimi and GLM-4 (Hermes Dataset, 2026).
Parallel execution analysis identifies the parallel width of an agent, measuring how many tool calls are generated within a single assistant turn to evaluate efficiency.
The dataset supports multi-turn trajectories, requiring the conversion of tool roles into user roles with specific prefixes for standard OpenAI-compatible fine-tuning.
Label masking is essential for Supervised Fine-Tuning (SFT), ensuring the model only learns from assistant-generated content while ignoring user and tool inputs.
Error rate analysis across trajectories helps developers identify common failure modes, such as malformed JSON or traceback errors in tool responses.

Working Examples

Regex-based parser to extract reasoning traces and tool calls from assistant messages.

import re, json
THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)
TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)

def parse_assistant(value: str) -> dict:
    thoughts = [t.strip() for t in THINK_RE.findall(value)]
    calls = []
    for raw in TOOL_CALL_RE.findall(value):
        try:
            calls.append(json.loads(raw))
        except json.JSONDecodeError:
            calls.append({"name": "<malformed>", "arguments": {}})
    final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
    return {"thoughts": thoughts, "tool_calls": calls, "final": final}

Implementation of label masking for supervised fine-tuning, ensuring only assistant responses contribute to the loss function.

def build_masked(conv, tokenizer, max_len=2048):
    msgs = to_openai_messages(conv)
    for m in msgs:
        if m["role"] == "tool":
            m["role"] = "user"
            m["content"] = "[TOOL OUTPUT]\n" + m["content"]
    input_ids, labels = [], []
    for m in msgs:
        text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
        ids = tokenizer.encode(text, add_special_tokens=False)
        input_ids.extend(ids)
        labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
    return input_ids[:max_len], labels[:max_len]

Practical Applications

Use Case: Debugging complex agentic workflows by replaying reasoning traces to identify where a model’s internal logic diverges from required tool outputs.
Pitfall: Failing to mask non-assistant tokens during training, which leads to the model attempting to predict tool outputs or user queries instead of its own reasoning.
Use Case: Benchmarking tool usage frequency and error rates across different model configurations like Kimi or GLM to optimize tool selection.
Pitfall: Improperly handling malformed JSON in tool calls, which can crash parsing pipelines without robust try-except blocks.

References:

https://www.marktechpost.com/2026/05/02/a-coding-implementation-to-parsing-analyzing-visualizing-and-fine-tuning-agent-reasoning-traces-using-the-lambda-hermes-agent-reasoning-traces-dataset/

On This Page

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Anthropic Claude Code: Automating Complex Security Research with Agentic Reasoning

Advanced Agentic Workflows: Mastering Tool Combination and Context Circulation in Gemini API

Poolside AI Launches Laguna XS.2 and M.1: High-Performance Agentic Coding via MoE