Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • V verify
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 22
    • Issues 22
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ran.lu
  • verify
  • Wiki
  • [问题] aimstack 有时会很慢

Last edited by ran.lu Nov 17, 2022
Page history

[问题] aimstack 有时会很慢

[问题] aimstack 有时会很慢

ai-arts 调用 aim 的地方:


	aimMetrics, err := GetAimMetrics(configs.GetAimServerInnerEndpoint(orgName), runHash, runInfo.CollectMetricNames())
	if err != nil {
		apiErr = exports.RaiseAPIErrorAuto(exports.AIARTS_UNKNOWN_ERROR, err.Error())
		return
	}

apt-get update ; apt-get install -y curl

调的接口是: "/api/runs/%s/metric/get-batch"

http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9

curl -X POST -d '[{"name":"test_1"}]' http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/metric/get-batch {"detail":[{"loc":["body"],"msg":"value is not a valid list","type":"type_error.list"}]}

可行的调用:

curl -H "Content-Type: application/json"  -X POST -d '[{"context": "", "name":"test_1"}]'  http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/metric/get-batch

现在是 2 步:

  1. 查有哪些 metrics

示例返回:

{"params":{"hparams":{"batch_size":2,"learning_rate":0.001}},"traces":{"logs":[],"images":[],"audios":[],"distributions":[],"figures":[],"texts":[],"metric":[{"context":{"gpu":0},"name":"__system__gpu_memory_percent"},{"context":{"gpu":0},"name":"__system__gpu_power_watts"},{"context":{"gpu":0},"name":"__system__gpu_temp"},{"context":{"gpu":0},"name":"__system__gpu"},{"context":{},"name":"__system__cpu"},{"context":{},"name":"__system__disk_percent"},{"context":{},"name":"__system__memory_percent"},{"context":{},"name":"__system__p_memory_percent"},{"context":{},"name":"metric_name"},{"context":{},"name":"test_1"},{"context":{},"name":"test_2"},{"context":{},"name":"test_3"},{"context":{},"name":"test_4"},{"context":{},"name":"test_5"}]},"props":{"name":"Run: 30(train-2b20f4a1-5172-4844-b853-ea868f0c6e36)","description":null,"experiment":{"id":"bb3237a5-00a1-4c26-955f-31ab85ad02ea","name":"default"},"tags":[],"archived":false,"creation_time":1668672386.153556,"end_time":1668672414.478266,"active":false,"notes":0}}
  1. 查这些 metrics 有什么值

示例返回:

[{"name":"test_1","context":{},"values":[0.5087581830665021,0.1352001147880273,0.4123761320648158,0.9422590223948614,0.15437425998701015,0.16548192405524198,0.979551255021762,0.24343191495056993,0.925818780988329,0.32060684813732665],"iters":[0,1,2,3,4,5,6,7,8,9]}]

第 1 步是:

curl -H "Content-Type: application/json"  -X GET -d '[{"context": "", "name":"test_1"}]'  http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/info

我感觉,可能是第 1 步比较慢~

我觉得第 1 步,容易缓存一下~~

Part II

further observation

人家的 live-update 是怎么做的?

请求的接口是:

https://test-3-172.apulis.com.cn/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/search/metric?p=500&q=%28run.hash+in+%5B%22df12e1de38b141fb9e804161%22%5D%29+and+%28%28metric.name+%3D%3D+%22test_5%22%29+or+%28metric.name+%3D%3D+%22metric_name%22%29+or+%28metric.name+%3D%3D+%22test_1%22%29+or+%28metric.name+%3D%3D+%22test_2%22%29+or+%28metric.name+%3D%3D+%22test_3%22%29+or+%28metric.name+%3D%3D+%22test_4%22%29%29&report_progress=False

thought:

aim 里面,我的这个赋值,可能会导致,读取的时候,进行了写入:

                trace.run.name = 'Run: ' + translate_name(trace.run.hash)
Clone repository
  • 3.137 环境 websocket 连接失败,其它环境无此问题
  • 3.137 环境,即使使用 env 中的 grafana 密码,都无法登陆;测试环境则可以
  • 3.172 不定期出现“疯狂写盘”
  • [2022 11 17] 本地数据集要在 slurm|superpodk8s 上面使用
  • [TODO] 合入日志加速发动
  • [build] 加快 SDK 打包
  • [info] 平台日志
  • [优化] ai arts 调用 aim 时,设置超时时间
  • [问题] aim SDK, 连不上 rpc 时会报错?这不能达到无感知
  • [问题] aimstack 有时会很慢
  • [问题] desay 171 部署: 单个训练成功, 收到 2 次调用, 一次训练成功, 一次训练失败
  • [问题] 收集到的日志只有几个服务。问题: 这个是哪里配置的?
  • aim SDK 支持 tensorboard 日志
  • aim SDK 瘦身
  • gpu02环境 选择 推理模型目录 很慢
View All Pages