支持 ml tracking 的 python 库 (#2) · Issues · ran.lu / verify

支持 ml tracking 的 python 库

1. 背景

需求点：1.6.0 实验数据对比

2. ml tracking 库

2.1 Tensorboard

不论是callback方式还是手动写入，tensorflow都支持把数据记录到指定日志中。

通过--logdir参数指定目录后，TensorBoard自己就会递归地扫描目录下的内容，展示记录下来的数据。

 --logdir PATH         Directory where TensorBoard will look to find TensorFlow event files that it can display.
                        TensorBoard will recursively walk the directory structure rooted at logdir, looking for
                        .*tfevents.* files. A leading tilde will be expanded with the semantics of Python's
                        os.expanduser function.

进入TensorBoard Web UI后，就可以用下图红框中的“勾选”功能来对比实验数据：

所以，如果用户的实验数据有公共的父文件夹，就可以方便地使用Tensorboard的数据展示功能了；如果不能满足这一条件，还可以尝试--logdir_spec选项。但是从下面的说明来看，TensorBoard不提倡使用--logdir_spec选项。

 --logdir_spec PATH_SPEC
                        Like `--logdir`, but with special interpretation for commas and colons: commas separate
                        multiple runs, where a colon specifies a new name for a run. For example: `tensorboard
                        --logdir_spec=name1:/path/to/logs/1,name2:/path/to/logs/2`. This flag is discouraged and can
                        usually be avoided. TensorBoard walks log directories recursively; for finer-grained control,
                        prefer using a symlink tree. Some features may not work when using `--logdir_spec` instead of
                        `--logdir`.

2.2 Mlflow tracking

使用方法简单：调用 log_metric, log_param, log_artifacts 等函数：https://mlflow.org/docs/latest/quickstart.html#using-the-tracking-api

对于主流框架，提供了自动记录功能。

既支持web UI查看，也支持 Python, REST, R, Java API 接口查询。

这一节讨论了如何自建 mlflow tracking server ，包括支持的存储后端：https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers

2.3 Aim

这个库也可以把日志发到远端，但是目前文档上把这一功能标为实验性功能： https://aimstack.readthedocs.io/en/latest/using/remote_tracking.html

2.5 neptune对不同工具的对比

https://neptune.ai/blog/best-ml-experiment-tracking-tools

这里有一张很大的表格，从主要特性、价格、开源|闭源、服务方式、侵入性、接口等角度，比较了 Neptune, Weights & Biases, Comet, Sacred & Omniboard, MLflow, TensorBoard, Guild AI, Polyaxon, ClearML, Valohai, Pachyderm, Kubeflow, Verta.ai, SageMaker Studio, DVC Studio 。

有好些是付费的。

2.6 一个 ml tracking 库的清单

https://github.com/srush/awesome-ml-tracking

3. 对比

	优点	缺点
TensorBoard	1. 有UI 2. 支持自动记录和手动记录 3. 主流，社区活跃	1. 官方没有提供方便的remote server方案 2. 数据解析不方便，有python第3方库，但是比较小众
MLflow tracking	1. 有UI 2. 支持自动记录和手动记录 3. remote server支持完善 4. 有REST和SDK接口 5. 社区活跃 6. 文档极佳	1. Aim文档中提到，当有几百个实验时，UI会卡顿，但这应该不影响 rest api
Aim	1. 有UI 2. 支持自动记录和手动记录 3. remote server支持完善 4. 专门做ml tracking，注重性能 5. 有SDK接口 6. 社区活跃

3.1 特性确认

	MLflow tracking	Aim
实时更新	有, 10s更新一次	有, 10s更新一次; 要从metrics进入, 并且打开live update
对比实验	有	有
局部缩放	有	有
url前缀	mflow server --static-prefix	aim up --base-path
对 Tensorflow 等框架的支持	可以自动记录	可以自动记录
解析 Tensorboard 的数据	不可以	不可以, 但提供了转换工具
remote server	支持(环境变量MLFLOW_TRACKING_URI等)	实验特性(环境变量AIM_SERVER_DEFAULT_HOST等)

3.2 运行

MLflow tracking 和 Aim 我都可以跑，可以在我这里查看。

3.2.1 Aim

aim init
aim server
aim up -h 0.0.0.0

from aim import Run
import random
import time


run_inst = Run(repo='aim://127.0.0.1:53800')

# Save inputs, hparams or any other `key: value` pairs
run_inst['hparams'] = {
            'learning_rate': 0.01,
                'batch_size': 32,
                }

# Track metrics
for i in range(360):
        run_inst.track(random.random() * 10 + i, name='metric_name')
        time.sleep(1)
        print('in progress: ', i)

3.2.2 MLflow tracking

mlflow server --static-prefix /abcdef

$env:MLFLOW_TRACKING_URI='http://127.0.0.1:5000'

脚本: my_tracking.py

import os
from random import random, randint
import time
from mlflow import log_metric, log_param, log_artifacts

if __name__ == "__main__":
    # Log a parameter (key-value pair)
    log_param("param1", randint(0, 100))

    # Log a metric; metrics can be updated throughout the run
    log_metric("foo", random())
    log_metric("foo", random() + 1)
    log_metric("foo", random() + 2)

    # Log an artifact (output file)
    if not os.path.exists("outputs"):
        os.makedirs("outputs")
    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")
    log_artifacts("outputs")

    for i in range(335):
        log_metric("metricI", random() * 10 + i)
        time.sleep(1)
        print('in progress, i: ', i)

3.3 截图

3.3.1 Aim 截图

3.3.2 MLflow 截图

4. 难点

4.1 如何建立平台任务和第 3 方库任务的关联

用户在平台上创建一个任务后，会导致新的 Aim run 或者 MLflow run 的产生。如何建立两边的对应关系？

对于 aim，可以使用它指定 hash 的特性来做。到时还要做一层 python 库的封装，并且对用户在库的使用进行培训。这个方法可能对用户提出了过多的限制。

ref

Edited Aug 10, 2022 by ran.lu

GitLab