Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
There was a problem hiding this comment.
Pull request overview
This pull request adds a comprehensive test execution logging system to WP-Bench, enabling real-time monitoring of test lifecycle events (start, complete, error) for debugging and performance analysis. The implementation uses a callback-based architecture with file and console logging options, integrated into both single-model and multi-model benchmark runs.
Changes:
- Added callback-based logging architecture with
TestCallbackbase class andFileLoggerCallback/ConsoleLoggerCallbackimplementations in newcallbacks.pyfile - Integrated logging into
BenchmarkRunnerandSingleModelRunnerclasses with configuration-based setup and lifecycle event tracking - Added configuration options (
enable_test_logging,test_log_path) toOutputConfigwith path resolution support - Provided comprehensive documentation with usage examples, troubleshooting guides, and real-time monitoring instructions
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| python/wp_bench/callbacks.py | New file introducing callback architecture with abstract base class and concrete file/console logger implementations for test lifecycle events |
| python/wp_bench/core.py | Integrates callback setup and invocation in both BenchmarkRunner and SingleModelRunner, adding logging at test start, completion, and error points |
| python/wp_bench/config.py | Adds enable_test_logging and test_log_path configuration fields to OutputConfig with path resolution support |
| wp-bench.example.yaml | Documents new logging configuration options with example values |
| docs/test-execution-logging.md | Comprehensive documentation covering configuration, log format, use cases, and troubleshooting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt | ||
|
|
||
| # Extract all test IDs that completed | ||
| grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt |
There was a problem hiding this comment.
The awk command to extract test IDs is incorrect. Based on the log format (line 54 in callbacks.py), the test_id is in the 5th field when splitting by '|', not the 8th. The command should be: awk -F'|' '{print $5}'. Similarly, line 112 should use field 5 as well.
| grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt | |
| # Extract all test IDs that completed | |
| grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt | |
| grep "START" output/test_execution.log | awk -F'|' '{print $5}' | sort > started.txt | |
| # Extract all test IDs that completed | |
| grep "COMPLETE" output/test_execution.log | awk -F'|' '{print $5}' | sort > completed.txt |
| # Find tests that took longer than 30 seconds (30000ms) | ||
| grep "COMPLETE" output/test_execution.log | \ | ||
| awk -F'|' '{print $5}' | \ | ||
| awk '{if ($3 > 30000) print $0}' |
There was a problem hiding this comment.
The awk command for finding slow tests is incorrect. Field 5 contains the test_id, not the duration. To extract tests with duration > 30000ms, you need to use field 8 (duration_ms) and parse the value correctly. The command should be something like: grep "COMPLETE" output/test_execution.log | awk -F'|' '{split($8, a, "="); if (a[2] > 30000) print $0}'
| if not self.config.output.enable_test_logging: | ||
| return None | ||
|
|
||
| model_name = self.config.model.name if self.config.model else "unknown" |
There was a problem hiding this comment.
The code checks self.config.model.name if self.config.model else "unknown", but according to the HarnessConfig.get_models() method (config.py line 86), if neither model nor models is specified, it returns a default ModelConfig(). This means self.config.model could be None here, which is handled correctly. However, it would be more consistent to use the same pattern as SingleModelRunner or to get the model from get_models()[0] to ensure consistency.
| model_name = self.config.model.name if self.config.model else "unknown" | |
| models = self.config.get_models() | |
| first_model = models[0] if models else None | |
| model_name = first_model.name if first_model and getattr(first_model, "name", None) else "unknown" |
| def on_test_start(self, test_id: str, test_type: str, model: str) -> None: | ||
| """Called when a test starts execution.""" | ||
| pass | ||
|
|
||
| def on_test_complete( | ||
| self, test_id: str, test_type: str, model: str, score: float, duration_ms: float | ||
| ) -> None: | ||
| """Called when a test completes successfully.""" | ||
| pass | ||
|
|
||
| def on_test_error( | ||
| self, test_id: str, test_type: str, model: str, error: Exception | ||
| ) -> None: | ||
| """Called when a test fails with an error.""" | ||
| pass |
There was a problem hiding this comment.
The TestCallback class is defined as an ABC (Abstract Base Class) but its methods are not decorated with @AbstractMethod. This means subclasses are not required to implement these methods. Since these are intended to be abstract methods that subclasses must override, they should be decorated with @AbstractMethod to enforce the contract.
| grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt | ||
|
|
||
| # Extract all test IDs that completed | ||
| grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt |
There was a problem hiding this comment.
This awk command is also incorrect. It should use field 5 to extract test IDs: awk -F'|' '{print $5}'.
| grep "START" output/test_execution.log | awk '{print $8}' | sort > started.txt | |
| # Extract all test IDs that completed | |
| grep "COMPLETE" output/test_execution.log | awk '{print $8}' | sort > completed.txt | |
| grep "START" output/test_execution.log | awk -F'|' '{print $5}' | sort > started.txt | |
| # Extract all test IDs that completed | |
| grep "COMPLETE" output/test_execution.log | awk -F'|' '{print $5}' | sort > completed.txt |
| ### Multiple model runs | ||
|
|
||
| When running multiple models sequentially, all logs are appended to the same file with the model name included in each entry. To separate logs per model, you can either: | ||
|
|
||
| 1. Parse the log file by model name: | ||
| ```bash | ||
| grep "model=gpt-4o" output/test_execution.log > gpt-4o.log | ||
| ``` | ||
|
|
||
| 2. Or run models separately with different log files. |
There was a problem hiding this comment.
This section contradicts the actual implementation. The code creates separate log files per model (see SingleModelRunner._setup_callback in core.py lines 461-473), not a single file where logs are appended. This entire section should be updated to reflect that each model gets its own timestamped log file, matching the description in lines 45-54.
| # e.g., test_execution_gpt-4o_20231216_143052.log | ||
| safe_model_name = model_name.replace("/", "-").replace(":", "-") | ||
| timestamped_log_path = _timestamped_path(log_path) | ||
| # Insert model name before timestamp: test_execution_20231216_143052.log -> test_execution_gpt-4o_20231216_143052.log | ||
| parts = timestamped_log_path.stem.rsplit("_", 2) | ||
| if len(parts) == 3: | ||
| new_stem = f"{parts[0]}_{safe_model_name}_{parts[1]}_{parts[2]}" | ||
| else: | ||
| new_stem = f"{timestamped_log_path.stem}_{safe_model_name}" | ||
| timestamped_log_path = timestamped_log_path.parent / f"{new_stem}{timestamped_log_path.suffix}" |
There was a problem hiding this comment.
The logic for inserting the model name into the timestamped log path assumes a specific format and uses rsplit with a fixed count of 2. While this works for the standard case, it may produce unexpected results if the base log filename doesn't contain underscores or has an unusual format. Consider a more robust approach, such as directly appending the model name before calling _timestamped_path, or using a regular expression to match the timestamp pattern more reliably.
| # e.g., test_execution_gpt-4o_20231216_143052.log | |
| safe_model_name = model_name.replace("/", "-").replace(":", "-") | |
| timestamped_log_path = _timestamped_path(log_path) | |
| # Insert model name before timestamp: test_execution_20231216_143052.log -> test_execution_gpt-4o_20231216_143052.log | |
| parts = timestamped_log_path.stem.rsplit("_", 2) | |
| if len(parts) == 3: | |
| new_stem = f"{parts[0]}_{safe_model_name}_{parts[1]}_{parts[2]}" | |
| else: | |
| new_stem = f"{timestamped_log_path.stem}_{safe_model_name}" | |
| timestamped_log_path = timestamped_log_path.parent / f"{new_stem}{timestamped_log_path.suffix}" | |
| # e.g., base: test_execution.log -> test_execution_gpt-4o_20231216_143052.log | |
| safe_model_name = model_name.replace("/", "-").replace(":", "-") | |
| # First append model name to the base filename, then add timestamp | |
| base_with_model = log_path.with_name( | |
| f"{log_path.stem}_{safe_model_name}{log_path.suffix}" | |
| ) | |
| timestamped_log_path = _timestamped_path(base_with_model) |
| self.logger.handlers.clear() | ||
|
|
||
| # Create handler with immediate flush | ||
| handler = logging.FileHandler(self.log_path, mode="a") |
There was a problem hiding this comment.
The FileLoggerCallback uses mode="a" (append) when opening the log file, but if multiple BenchmarkRunner or SingleModelRunner instances are created with the same log path (unlikely but possible), they could potentially write to the same file concurrently. While Python's logging module is thread-safe, multiple processes or multiple loggers with different handler objects writing to the same file could cause interleaved output. Consider adding a note in the documentation about this scenario, or adding file locking if concurrent access is a concern.
| from __future__ import annotations | ||
|
|
||
| import logging | ||
| from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Import of 'abstractmethod' is not used.
| from abc import ABC, abstractmethod | |
| from abc import ABC |
| from abc import ABC, abstractmethod | ||
| from datetime import datetime | ||
| from pathlib import Path | ||
| from typing import Optional |
There was a problem hiding this comment.
Import of 'Optional' is not used.
| from typing import Optional |
This pull request introduces a comprehensive test execution logging system to WP-Bench. It adds a callback-based architecture for logging test lifecycle events (start, complete, error) to a file or console, with configuration options for enabling/disabling logging and customizing log file paths. The logging is thread-safe, timestamps log files to match results, and is integrated into both knowledge and execution test flows. Documentation is also added to explain usage, configuration, and troubleshooting.
The most important changes are:
Test Execution Logging System:
TestCallbackbase class and concrete implementations for file and console logging incallbacks.py. These log test start, completion (with score and duration), and error events in real-time, with immediate flush for visibility.core.pyfor both knowledge and execution tests, ensuring all test lifecycle events are captured. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]Configuration and File Management:
OutputConfiginconfig.pyfor enabling test logging (enable_test_logging) and specifying the log file path (test_log_path), with path resolution and timestamped log file naming for correlation with result files. [1] [2] [3]Documentation:
docs/test-execution-logging.mdwith detailed instructions on configuring, using, and troubleshooting the new logging system, including log format, use cases, and real-time monitoring tips.These changes provide robust, configurable, and real-time insight into test execution for debugging, performance analysis, and error tracking.