Debug 救星！用 ObjWatch 助力复杂 Python 项目的高效代码理解与调试

2024-12-21

摘要

使用 ObjWatch 助力复杂 Python 项目的高效代码理解与调试。

源码链接

当前 Debug 痛点

在阅读和 debug 复杂的项目时，常常会遇到多达十几层的嵌套调用，不知道调用的顺序。
最难受的是在多进程场景下，在单个进程上的调试往往会导致其他进程等待超时，需要不断重复启动调试程序。
如果是用 print 大法，经常会漏掉一些函数的调用。这也太费时费力了，目前还没有找到集简单简洁好用于一身的调试库，于是花了个周末自己写了个工具，算是解决了这个痛点。

什么是 ObjWatch？

ObjWatch 是专为简化复杂项目的调试和监控而设计。通过实时追踪对象属性和方法调用，以及允许自定义 hook 来帮助开发者深入了解代码库。

快速用例

可以直接 pip install objwatch 安装，这里演示用例，需要 clone 源码

1
2
3

git clone https://github.com/aeeeeeep/objwatch
cd objwatch
python3 examples/example_usage.py

执行上面的代码就可以得到如下调用信息

[2025-01-04 19:15:13] [DEBUG] objwatch: Processed targets:
>>>>>>>>>>
examples/example_usage.py
<<<<<<<<<<
[2025-01-04 19:15:13] [WARNING] objwatch: wrapper 'BaseLogger' loaded
[2025-01-04 19:15:13] [INFO] objwatch: Starting ObjWatch tracing.
[2025-01-04 19:15:13] [INFO] objwatch: Starting tracing.
[2025-01-04 19:15:13] [DEBUG] objwatch: run main <-
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.__init__ <- '0':(type)SampleClass, '1':10
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.__init__ -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.increment <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value None -> 10
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 10 -> 11
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.increment -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.increment <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 11 -> 12
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.increment -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.increment <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 12 -> 13
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.increment -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.increment <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 13 -> 14
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.increment -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.increment <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 14 -> 15
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.increment -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.decrement <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 15 -> 14
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.decrement -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.decrement <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 14 -> 13
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.decrement -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: | run SampleClass.decrement <- '0':(type)SampleClass
[2025-01-04 19:15:13] [DEBUG] objwatch: | | upd SampleClass.value 13 -> 12
[2025-01-04 19:15:13] [DEBUG] objwatch: | end SampleClass.decrement -> None
[2025-01-04 19:15:13] [DEBUG] objwatch: end main -> None
[2025-01-04 19:15:13] [INFO] objwatch: Stopping ObjWatch tracing.
[2025-01-04 19:15:13] [INFO] objwatch: Stopping tracing.

代码中最核心的是这部分

# Using as a Context Manager with Detailed Logging
with objwatch.ObjWatch(['examples/example_usage.py']):
    main()

# Using the API with Simple Logging
obj_watch = objwatch.watch(['examples/example_usage.py'])
main()
obj_watch.stop()

我们可以同时通过上下文管理器和接口调用使用该工具。用例中我们指定了对 ‘examples/example_usage.py’ 文件的追踪，也就是只要有函数方法或变量是 examples/example_usage.py 中的，工具就会打印，这样就通过清晰的顺序层级日志记录，帮助可视化并监控这些嵌套的函数调用和对象互动。打印日志会包含如下几种执行类型：

run：开始执行函数或类的方法
end：函数或类的方法执行结束
upd：新创建一个变量
apd：(list, set, dict, tuple) 类型的变量中增加元素
pop：(list, set, dict, tuple) 类型的变量中减少元素

用例比较简单，如果是大型项目的执行，这个功能将非常有用。

整体功能

objwatch 提供了如下接口

targets（列表）：要监控的文件或模块。
exclude_targets（列表，可选）：要排除监控的文件或模块。
ranks（列表，可选）：在使用 torch.distributed 时跟踪的 GPU ids。
output（字符串，可选）：写入日志的文件路径。
output_xml（字符串，可选）：用于写入结构化日志的 XML 文件路径。如果指定，将以嵌套的 XML 格式保存追踪信息，便于浏览和分析。
level（字符串，可选）：日志级别（例如 logging.DEBUG，logging.INFO，force 等）。
simple（布尔值，可选）：启用简化日志模式，格式为 "DEBUG: {msg}"。
wrapper（FunctionWrapper，可选）：自定义包装器，用于扩展追踪和日志记录功能。
with_locals（布尔值，可选）：启用在函数执行期间对局部变量的追踪和日志记录。
with_module_path（布尔值，可选）：控制是否在日志中的函数名称前添加模块路径前缀。

重点：自定义包装器扩展

ObjWatch 提供了 FunctionWrapper 抽象基类，允许用户创建自定义包装器，以扩展和定制库的追踪与日志记录功能。通过继承 FunctionWrapper，开发者可以实现特定项目需求的定制化行为，这些行为将在函数调用和返回时执行，提供更专业的监控。

FunctionWrapper 类

FunctionWrapper 类定义了两个必须实现的核心方法：

wrap_call(self, func_name: str, frame: FrameType) -> str：
该方法在函数调用开始时触发，接收函数名和当前的帧对象，帧对象包含了执行上下文信息，包括局部变量和调用栈。在此方法中可以提取、记录或修改信息，在函数执行前进行处理。
wrap_return(self, func_name: str, result: Any) -> str：
该方法在函数返回时触发，接收函数名和返回的结果。在此方法中可以记录、分析或修改信息，函数执行完成后进行处理。
wrap_upd(self, old_value: Any, current_value: Any) -> Tuple[str, str]：
该方法在变量更新时触发，接收旧值和当前值。可用于记录变量的变化，分析其变化过程，从而跟踪和调试变量状态的变化。

有关帧对象的更多信息，请参考官方 Python 文档。

TensorShapeLogger

这是我根据我的使用场景实现的一个自定义包装器示例，代码在 objwatch/wrappers.py 文件中。这个包装器会自动记录指定模块中所有函数方法调用中输入输出的 tensor shape，还有变量值变量的情况，这在帮助理解复杂的分布式框架的执行逻辑时非常有用。

class TensorShapeLogger(FunctionWrapper):
    """
    TensorShapeLogger extends FunctionWrapper to log the shapes of torch.Tensor objects.
    """

    @staticmethod
    def _process_tensor_item(seq: List[Any]) -> Optional[List[Any]]:
        """
        Process a sequence to extract tensor shapes if all items are torch.Tensor.

        Args:
            seq (List[Any]): The sequence to process.

        Returns:
            Optional[List[Any]]: List of tensor shapes or None if not applicable.
        """
        if torch is not None and all(isinstance(x, torch.Tensor) for x in seq):
            return [x.shape for x in seq]
        else:
            return None

    def wrap_call(self, func_name: str, frame: FrameType) -> str:
        """
        Format the function call information, including tensor shapes if applicable.

        Args:
            func_name (str): Name of the function being called.
            frame (FrameType): The current stack frame.

        Returns:
            str: Formatted call message.
        """
        args, kwargs = self._extract_args_kwargs(frame)
        call_msg = self._format_args_kwargs(args, kwargs)
        return call_msg

    def wrap_return(self, func_name: str, result: Any) -> str:
        """
        Format the function return information, including tensor shapes if applicable.

        Args:
            func_name (str): Name of the function returning.
            result (Any): The result returned by the function.

        Returns:
            str: Formatted return message.
        """
        return_msg = self._format_return(result)
        return return_msg

    def wrap_upd(self, old_value: Any, current_value: Any) -> Tuple[str, str]:
        """
        Format the update information of a variable, including tensor shapes if applicable.

        Args:
            old_value (Any): The old value of the variable.
            current_value (Any): The new value of the variable.

        Returns:
            Tuple[str, str]: Formatted old and new values.
        """
        old_msg = self._format_value(old_value)
        current_msg = self._format_value(current_value)
        return old_msg, current_msg

    def _format_value(self, value: Any, is_return: bool = False) -> str:
        """
        Format a value into a string, logging tensor shapes if applicable.

        Args:
            value (Any): The value to format.
            is_return (bool): Flag indicating if the value is a return value.

        Returns:
            str: Formatted value string.
        """
        if torch is not None and isinstance(value, torch.Tensor):
            formatted = f"{value.shape}"
        elif isinstance(value, log_element_types):
            formatted = f"{value}"
        elif isinstance(value, log_sequence_types):
            formatted_sequence = EventHandls.format_sequence(value, func=TensorShapeLogger._process_tensor_item)
            if formatted_sequence:
                formatted = f"{formatted_sequence}"
            else:
                formatted = f"(type){value.__class__.__name__}"
        else:
            formatted = f"(type){value.__class__.__name__}"

        if is_return:
            if isinstance(value, torch.Tensor):
                return f"{value.shape}"
            elif isinstance(value, log_sequence_types) and formatted:
                return f"[{formatted}]"
            return f"{formatted}"
        return formatted

创建和集成自定义包装器

要创建一个自定义包装器，请按照以下步骤操作：

继承 FunctionWrapper：定义一个新的类，继承自 FunctionWrapper，并实现 wrap_call 和 wrap_return 方法，以定义您的自定义行为。
使用自定义包装器初始化 ObjWatch：在初始化 ObjWatch 时，通过 wrapper 参数传入您的自定义包装器对象。这将把您的自定义追踪逻辑集成到 ObjWatch 的追踪过程中。

通过利用自定义包装器，您可以增强 ObjWatch 的功能，捕捉额外的上下文信息，执行专业的日志记录，或与其他监控工具集成，从而为您的 Python 项目提供更全面和定制化的追踪解决方案。

使用示例

例如，TensorShapeLogger 可以按如下方式集成：

from objwatch.wrappers import TensorShapeLogger

# 使用自定义的 TensorShapeLogger 初始化 ObjWatch
obj_watch = objwatch.ObjWatch(['your_module.py'], simple=False, wrapper=TensorShapeLogger))
with obj_watch:
    main()

# 也可以
obj_watch = objwatch.watch(['your_module.py'], simple=False, wrapper=TensorShapeLogger)
main()
obj_watch.stop()

使用自定义包装器的示例

推荐阅读 tests/test_torch_train.py 文件。该文件包含了一个完整的 PyTorch 训练过程示例，展示了如何集成 ObjWatch 进行监控和日志记录。

注意事项

⚠️ 性能警告

ObjWatch 在调试环境中使用时会对程序的性能产生一定影响。因此，建议仅在调试和开发阶段使用。
先写这么多，后续有时间再更，觉得有用可以给个 star。

目前该库还在积极更新中，如果有任何疑问或建议，欢迎在评论区或仓库 issues 留言。