Add execution logging by jonathanbossenger · Pull Request #18 · WordPress/wp-bench

jonathanbossenger · 2026-01-31T08:57:55Z

This pull request introduces a comprehensive test execution logging system to WP-Bench. It adds a callback-based architecture for logging test lifecycle events (start, complete, error) to a file or console, with configuration options for enabling/disabling logging and customizing log file paths. The logging is thread-safe, timestamps log files to match results, and is integrated into both knowledge and execution test flows. Documentation is also added to explain usage, configuration, and troubleshooting.

The most important changes are:

Test Execution Logging System:

Added a callback-based logging architecture, including an abstract TestCallback base class and concrete implementations for file and console logging in callbacks.py. These log test start, completion (with score and duration), and error events in real-time, with immediate flush for visibility.
Integrated logging callbacks into the test execution flow in core.py for both knowledge and execution tests, ensuring all test lifecycle events are captured. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Configuration and File Management:

Added new configuration options to OutputConfig in config.py for enabling test logging (enable_test_logging) and specifying the log file path (test_log_path), with path resolution and timestamped log file naming for correlation with result files. [1] [2] [3]

Documentation:

Added docs/test-execution-logging.md with detailed instructions on configuring, using, and troubleshooting the new logging system, including log format, use cases, and real-time monitoring tips.

These changes provide robust, configurable, and real-time insight into test execution for debugging, performance analysis, and error tracking.

github-actions · 2026-01-31T08:58:06Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: jonathanbossenger <psykro@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copilot

Pull request overview

This pull request adds a comprehensive test execution logging system to WP-Bench, enabling real-time monitoring of test lifecycle events (start, complete, error) for debugging and performance analysis. The implementation uses a callback-based architecture with file and console logging options, integrated into both single-model and multi-model benchmark runs.

Changes:

Added callback-based logging architecture with TestCallback base class and FileLoggerCallback/ConsoleLoggerCallback implementations in new callbacks.py file
Integrated logging into BenchmarkRunner and SingleModelRunner classes with configuration-based setup and lifecycle event tracking
Added configuration options (enable_test_logging, test_log_path) to OutputConfig with path resolution support
Provided comprehensive documentation with usage examples, troubleshooting guides, and real-time monitoring instructions

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
python/wp_bench/callbacks.py	New file introducing callback architecture with abstract base class and concrete file/console logger implementations for test lifecycle events
python/wp_bench/core.py	Integrates callback setup and invocation in both BenchmarkRunner and SingleModelRunner, adding logging at test start, completion, and error points
python/wp_bench/config.py	Adds enable_test_logging and test_log_path configuration fields to OutputConfig with path resolution support
wp-bench.example.yaml	Documents new logging configuration options with example values
docs/test-execution-logging.md	Comprehensive documentation covering configuration, log format, use cases, and troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T06:19:29Z

docs/test-execution-logging.md

+grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt
+
+# Extract all test IDs that completed
+grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt


The awk command to extract test IDs is incorrect. Based on the log format (line 54 in callbacks.py), the test_id is in the 5th field when splitting by '|', not the 8th. The command should be: awk -F'|' '{print $5}'. Similarly, line 112 should use field 5 as well.

Suggested change

grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt

# Extract all test IDs that completed

grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt

grep "START" output/test_execution.log | awk -F'|' '{print $5}' | sort > started.txt

# Extract all test IDs that completed

grep "COMPLETE" output/test_execution.log | awk -F'|' '{print $5}' | sort > completed.txt

Copilot · 2026-02-05T06:19:29Z

docs/test-execution-logging.md

+# Find tests that took longer than 30 seconds (30000ms)
+grep "COMPLETE" output/test_execution.log | \
+  awk -F'|' '{print $5}' | \
+  awk '{if ($3 > 30000) print $0}'


The awk command for finding slow tests is incorrect. Field 5 contains the test_id, not the duration. To extract tests with duration > 30000ms, you need to use field 8 (duration_ms) and parse the value correctly. The command should be something like: grep "COMPLETE" output/test_execution.log | awk -F'|' '{split($8, a, "="); if (a[2] > 30000) print $0}'

Copilot · 2026-02-05T06:19:30Z

python/wp_bench/core.py

+        if not self.config.output.enable_test_logging:
+            return None
+
+        model_name = self.config.model.name if self.config.model else "unknown"


The code checks self.config.model.name if self.config.model else "unknown", but according to the HarnessConfig.get_models() method (config.py line 86), if neither model nor models is specified, it returns a default ModelConfig(). This means self.config.model could be None here, which is handled correctly. However, it would be more consistent to use the same pattern as SingleModelRunner or to get the model from get_models()[0] to ensure consistency.

Suggested change

model_name = self.config.model.name if self.config.model else "unknown"

models = self.config.get_models()

first_model = models[0] if models else None

model_name = first_model.name if first_model and getattr(first_model, "name", None) else "unknown"

Copilot · 2026-02-05T06:19:30Z

python/wp_bench/callbacks.py

+    def on_test_start(self, test_id: str, test_type: str, model: str) -> None:
+        """Called when a test starts execution."""
+        pass
+
+    def on_test_complete(
+        self, test_id: str, test_type: str, model: str, score: float, duration_ms: float
+    ) -> None:
+        """Called when a test completes successfully."""
+        pass
+
+    def on_test_error(
+        self, test_id: str, test_type: str, model: str, error: Exception
+    ) -> None:
+        """Called when a test fails with an error."""
+        pass


The TestCallback class is defined as an ABC (Abstract Base Class) but its methods are not decorated with @AbstractMethod. This means subclasses are not required to implement these methods. Since these are intended to be abstract methods that subclasses must override, they should be decorated with @AbstractMethod to enforce the contract.

Copilot · 2026-02-05T06:19:30Z

docs/test-execution-logging.md

+grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt
+
+# Extract all test IDs that completed
+grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt


This awk command is also incorrect. It should use field 5 to extract test IDs: awk -F'|' '{print $5}'.

Suggested change

grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt

# Extract all test IDs that completed

grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt

grep "START" output/test_execution.log | awk -F'|' '{print $5}' | sort > started.txt

# Extract all test IDs that completed

grep "COMPLETE" output/test_execution.log | awk -F'|' '{print $5}' | sort > completed.txt

Copilot · 2026-02-05T06:19:31Z

docs/test-execution-logging.md

+### Multiple model runs
+
+When running multiple models sequentially, all logs are appended to the same file with the model name included in each entry. To separate logs per model, you can either:
+
+1. Parse the log file by model name:
+   ```bash
+   grep "model=gpt-4o" output/test_execution.log > gpt-4o.log
+   ```
+
+2. Or run models separately with different log files.


This section contradicts the actual implementation. The code creates separate log files per model (see SingleModelRunner._setup_callback in core.py lines 461-473), not a single file where logs are appended. This entire section should be updated to reflect that each model gets its own timestamped log file, matching the description in lines 45-54.

Copilot · 2026-02-05T06:19:32Z

python/wp_bench/core.py

+            # e.g., test_execution_gpt-4o_20231216_143052.log
+            safe_model_name = model_name.replace("/", "-").replace(":", "-")
+            timestamped_log_path = _timestamped_path(log_path)
+            # Insert model name before timestamp: test_execution_20231216_143052.log -> test_execution_gpt-4o_20231216_143052.log
+            parts = timestamped_log_path.stem.rsplit("_", 2)
+            if len(parts) == 3:
+                new_stem = f"{parts[0]}_{safe_model_name}_{parts[1]}_{parts[2]}"
+            else:
+                new_stem = f"{timestamped_log_path.stem}_{safe_model_name}"
+            timestamped_log_path = timestamped_log_path.parent / f"{new_stem}{timestamped_log_path.suffix}"


The logic for inserting the model name into the timestamped log path assumes a specific format and uses rsplit with a fixed count of 2. While this works for the standard case, it may produce unexpected results if the base log filename doesn't contain underscores or has an unusual format. Consider a more robust approach, such as directly appending the model name before calling _timestamped_path, or using a regular expression to match the timestamp pattern more reliably.

Suggested change

# e.g., test_execution_gpt-4o_20231216_143052.log

safe_model_name = model_name.replace("/", "-").replace(":", "-")

timestamped_log_path = _timestamped_path(log_path)

# Insert model name before timestamp: test_execution_20231216_143052.log -> test_execution_gpt-4o_20231216_143052.log

parts = timestamped_log_path.stem.rsplit("_", 2)

if len(parts) == 3:

new_stem = f"{parts[0]}_{safe_model_name}_{parts[1]}_{parts[2]}"

else:

new_stem = f"{timestamped_log_path.stem}_{safe_model_name}"

timestamped_log_path = timestamped_log_path.parent / f"{new_stem}{timestamped_log_path.suffix}"

# e.g., base: test_execution.log -> test_execution_gpt-4o_20231216_143052.log

safe_model_name = model_name.replace("/", "-").replace(":", "-")

# First append model name to the base filename, then add timestamp

base_with_model = log_path.with_name(

f"{log_path.stem}_{safe_model_name}{log_path.suffix}"

)

timestamped_log_path = _timestamped_path(base_with_model)

Copilot · 2026-02-05T06:19:32Z

python/wp_bench/callbacks.py

+        self.logger.handlers.clear()
+
+        # Create handler with immediate flush
+        handler = logging.FileHandler(self.log_path, mode="a")


The FileLoggerCallback uses mode="a" (append) when opening the log file, but if multiple BenchmarkRunner or SingleModelRunner instances are created with the same log path (unlikely but possible), they could potentially write to the same file concurrently. While Python's logging module is thread-safe, multiple processes or multiple loggers with different handler objects writing to the same file could cause interleaved output. Consider adding a note in the documentation about this scenario, or adding file locking if concurrent access is a concern.

Copilot · 2026-02-05T06:19:32Z

python/wp_bench/callbacks.py

+from __future__ import annotations
+
+import logging
+from abc import ABC, abstractmethod


Import of 'abstractmethod' is not used.

Suggested change

from abc import ABC, abstractmethod

from abc import ABC

Copilot · 2026-02-05T06:19:32Z

python/wp_bench/callbacks.py

+from abc import ABC, abstractmethod
+from datetime import datetime
+from pathlib import Path
+from typing import Optional


Import of 'Optional' is not used.

Suggested change

from typing import Optional

Add execution logging

b71e73f

Jameswlepage requested review from Jameswlepage and Copilot February 5, 2026 06:09

Copilot started reviewing on behalf of Jameswlepage February 5, 2026 06:09 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add execution logging#18

Add execution logging#18
jonathanbossenger wants to merge 1 commit intoWordPress:trunkfrom
jonathanbossenger:enhancement-logging

jonathanbossenger commented Jan 31, 2026

github-actions bot commented Jan 31, 2026

Copilot AI left a comment

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Copilot AI Feb 5, 2026

Labels

2 participants

-        model_name = self.config.model.name if self.config.model else "unknown"
+        models = self.config.get_models()
+        first_model = models[0] if models else None
+        model_name = first_model.name if first_model and getattr(first_model, "name", None) else "unknown"

Conversation

jonathanbossenger commented Jan 31, 2026

github-actions bot commented Jan 31, 2026

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Labels

2 participants