Skip to content

Add execution logging#18

Open
jonathanbossenger wants to merge 1 commit intoWordPress:trunkfrom
jonathanbossenger:enhancement-logging
Open

Add execution logging#18
jonathanbossenger wants to merge 1 commit intoWordPress:trunkfrom
jonathanbossenger:enhancement-logging

Conversation

@jonathanbossenger
Copy link

This pull request introduces a comprehensive test execution logging system to WP-Bench. It adds a callback-based architecture for logging test lifecycle events (start, complete, error) to a file or console, with configuration options for enabling/disabling logging and customizing log file paths. The logging is thread-safe, timestamps log files to match results, and is integrated into both knowledge and execution test flows. Documentation is also added to explain usage, configuration, and troubleshooting.

The most important changes are:

Test Execution Logging System:

  • Added a callback-based logging architecture, including an abstract TestCallback base class and concrete implementations for file and console logging in callbacks.py. These log test start, completion (with score and duration), and error events in real-time, with immediate flush for visibility.
  • Integrated logging callbacks into the test execution flow in core.py for both knowledge and execution tests, ensuring all test lifecycle events are captured. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Configuration and File Management:

  • Added new configuration options to OutputConfig in config.py for enabling test logging (enable_test_logging) and specifying the log file path (test_log_path), with path resolution and timestamped log file naming for correlation with result files. [1] [2] [3]

Documentation:

  • Added docs/test-execution-logging.md with detailed instructions on configuring, using, and troubleshooting the new logging system, including log format, use cases, and real-time monitoring tips.

These changes provide robust, configurable, and real-time insight into test execution for debugging, performance analysis, and error tracking.

@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: jonathanbossenger <psykro@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a comprehensive test execution logging system to WP-Bench, enabling real-time monitoring of test lifecycle events (start, complete, error) for debugging and performance analysis. The implementation uses a callback-based architecture with file and console logging options, integrated into both single-model and multi-model benchmark runs.

Changes:

  • Added callback-based logging architecture with TestCallback base class and FileLoggerCallback/ConsoleLoggerCallback implementations in new callbacks.py file
  • Integrated logging into BenchmarkRunner and SingleModelRunner classes with configuration-based setup and lifecycle event tracking
  • Added configuration options (enable_test_logging, test_log_path) to OutputConfig with path resolution support
  • Provided comprehensive documentation with usage examples, troubleshooting guides, and real-time monitoring instructions

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
python/wp_bench/callbacks.py New file introducing callback architecture with abstract base class and concrete file/console logger implementations for test lifecycle events
python/wp_bench/core.py Integrates callback setup and invocation in both BenchmarkRunner and SingleModelRunner, adding logging at test start, completion, and error points
python/wp_bench/config.py Adds enable_test_logging and test_log_path configuration fields to OutputConfig with path resolution support
wp-bench.example.yaml Documents new logging configuration options with example values
docs/test-execution-logging.md Comprehensive documentation covering configuration, log format, use cases, and troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +109 to +112
grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt

# Extract all test IDs that completed
grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awk command to extract test IDs is incorrect. Based on the log format (line 54 in callbacks.py), the test_id is in the 5th field when splitting by '|', not the 8th. The command should be: awk -F'|' '{print $5}'. Similarly, line 112 should use field 5 as well.

Suggested change
grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt
# Extract all test IDs that completed
grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt
grep "START" output/test_execution.log | awk -F'|' '{print $5}' | sort > started.txt
# Extract all test IDs that completed
grep "COMPLETE" output/test_execution.log | awk -F'|' '{print $5}' | sort > completed.txt
Copilot uses AI. Check for mistakes.
Comment on lines +123 to +126
# Find tests that took longer than 30 seconds (30000ms)
grep "COMPLETE" output/test_execution.log | \
awk -F'|' '{print $5}' | \
awk '{if ($3 > 30000) print $0}'
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awk command for finding slow tests is incorrect. Field 5 contains the test_id, not the duration. To extract tests with duration > 30000ms, you need to use field 8 (duration_ms) and parse the value correctly. The command should be something like: grep "COMPLETE" output/test_execution.log | awk -F'|' '{split($8, a, "="); if (a[2] > 30000) print $0}'

Copilot uses AI. Check for mistakes.
if not self.config.output.enable_test_logging:
return None

model_name = self.config.model.name if self.config.model else "unknown"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks self.config.model.name if self.config.model else "unknown", but according to the HarnessConfig.get_models() method (config.py line 86), if neither model nor models is specified, it returns a default ModelConfig(). This means self.config.model could be None here, which is handled correctly. However, it would be more consistent to use the same pattern as SingleModelRunner or to get the model from get_models()[0] to ensure consistency.

Suggested change
model_name = self.config.model.name if self.config.model else "unknown"
models = self.config.get_models()
first_model = models[0] if models else None
model_name = first_model.name if first_model and getattr(first_model, "name", None) else "unknown"
Copilot uses AI. Check for mistakes.
Comment on lines +14 to +28
def on_test_start(self, test_id: str, test_type: str, model: str) -> None:
"""Called when a test starts execution."""
pass

def on_test_complete(
self, test_id: str, test_type: str, model: str, score: float, duration_ms: float
) -> None:
"""Called when a test completes successfully."""
pass

def on_test_error(
self, test_id: str, test_type: str, model: str, error: Exception
) -> None:
"""Called when a test fails with an error."""
pass
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestCallback class is defined as an ABC (Abstract Base Class) but its methods are not decorated with @AbstractMethod. This means subclasses are not required to implement these methods. Since these are intended to be abstract methods that subclasses must override, they should be decorated with @AbstractMethod to enforce the contract.

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +112
grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt

# Extract all test IDs that completed
grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This awk command is also incorrect. It should use field 5 to extract test IDs: awk -F'|' '{print $5}'.

Suggested change
grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt
# Extract all test IDs that completed
grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt
grep "START" output/test_execution.log | awk -F'|' '{print $5}' | sort > started.txt
# Extract all test IDs that completed
grep "COMPLETE" output/test_execution.log | awk -F'|' '{print $5}' | sort > completed.txt
Copilot uses AI. Check for mistakes.
Comment on lines +221 to +230
### Multiple model runs

When running multiple models sequentially, all logs are appended to the same file with the model name included in each entry. To separate logs per model, you can either:

1. Parse the log file by model name:
```bash
grep "model=gpt-4o" output/test_execution.log > gpt-4o.log
```

2. Or run models separately with different log files.
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section contradicts the actual implementation. The code creates separate log files per model (see SingleModelRunner._setup_callback in core.py lines 461-473), not a single file where logs are appended. This entire section should be updated to reflect that each model gets its own timestamped log file, matching the description in lines 45-54.

Copilot uses AI. Check for mistakes.
Comment on lines +462 to +471
# e.g., test_execution_gpt-4o_20231216_143052.log
safe_model_name = model_name.replace("/", "-").replace(":", "-")
timestamped_log_path = _timestamped_path(log_path)
# Insert model name before timestamp: test_execution_20231216_143052.log -> test_execution_gpt-4o_20231216_143052.log
parts = timestamped_log_path.stem.rsplit("_", 2)
if len(parts) == 3:
new_stem = f"{parts[0]}_{safe_model_name}_{parts[1]}_{parts[2]}"
else:
new_stem = f"{timestamped_log_path.stem}_{safe_model_name}"
timestamped_log_path = timestamped_log_path.parent / f"{new_stem}{timestamped_log_path.suffix}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for inserting the model name into the timestamped log path assumes a specific format and uses rsplit with a fixed count of 2. While this works for the standard case, it may produce unexpected results if the base log filename doesn't contain underscores or has an unusual format. Consider a more robust approach, such as directly appending the model name before calling _timestamped_path, or using a regular expression to match the timestamp pattern more reliably.

Suggested change
# e.g., test_execution_gpt-4o_20231216_143052.log
safe_model_name = model_name.replace("/", "-").replace(":", "-")
timestamped_log_path = _timestamped_path(log_path)
# Insert model name before timestamp: test_execution_20231216_143052.log -> test_execution_gpt-4o_20231216_143052.log
parts = timestamped_log_path.stem.rsplit("_", 2)
if len(parts) == 3:
new_stem = f"{parts[0]}_{safe_model_name}_{parts[1]}_{parts[2]}"
else:
new_stem = f"{timestamped_log_path.stem}_{safe_model_name}"
timestamped_log_path = timestamped_log_path.parent / f"{new_stem}{timestamped_log_path.suffix}"
# e.g., base: test_execution.log -> test_execution_gpt-4o_20231216_143052.log
safe_model_name = model_name.replace("/", "-").replace(":", "-")
# First append model name to the base filename, then add timestamp
base_with_model = log_path.with_name(
f"{log_path.stem}_{safe_model_name}{log_path.suffix}"
)
timestamped_log_path = _timestamped_path(base_with_model)
Copilot uses AI. Check for mistakes.
self.logger.handlers.clear()

# Create handler with immediate flush
handler = logging.FileHandler(self.log_path, mode="a")
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileLoggerCallback uses mode="a" (append) when opening the log file, but if multiple BenchmarkRunner or SingleModelRunner instances are created with the same log path (unlikely but possible), they could potentially write to the same file concurrently. While Python's logging module is thread-safe, multiple processes or multiple loggers with different handler objects writing to the same file could cause interleaved output. Consider adding a note in the documentation about this scenario, or adding file locking if concurrent access is a concern.

Copilot uses AI. Check for mistakes.
from __future__ import annotations

import logging
from abc import ABC, abstractmethod
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'abstractmethod' is not used.

Suggested change
from abc import ABC, abstractmethod
from abc import ABC
Copilot uses AI. Check for mistakes.
from abc import ABC, abstractmethod
from datetime import datetime
from pathlib import Path
from typing import Optional
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Optional' is not used.

Suggested change
from typing import Optional
Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants