[问题] aimstack 有时会很慢
ai-arts 调用 aim 的地方:
aimMetrics, err := GetAimMetrics(configs.GetAimServerInnerEndpoint(orgName), runHash, runInfo.CollectMetricNames())
if err != nil {
apiErr = exports.RaiseAPIErrorAuto(exports.AIARTS_UNKNOWN_ERROR, err.Error())
return
}
apt-get update ; apt-get install -y curl
调的接口是: "/api/runs/%s/metric/get-batch"
curl -X POST -d '[{"name":"test_1"}]' http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/metric/get-batch {"detail":[{"loc":["body"],"msg":"value is not a valid list","type":"type_error.list"}]}
可行的调用:
curl -H "Content-Type: application/json" -X POST -d '[{"context": "", "name":"test_1"}]' http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/metric/get-batch
现在是 2 步:
- 查有哪些 metrics
示例返回:
{"params":{"hparams":{"batch_size":2,"learning_rate":0.001}},"traces":{"logs":[],"images":[],"audios":[],"distributions":[],"figures":[],"texts":[],"metric":[{"context":{"gpu":0},"name":"__system__gpu_memory_percent"},{"context":{"gpu":0},"name":"__system__gpu_power_watts"},{"context":{"gpu":0},"name":"__system__gpu_temp"},{"context":{"gpu":0},"name":"__system__gpu"},{"context":{},"name":"__system__cpu"},{"context":{},"name":"__system__disk_percent"},{"context":{},"name":"__system__memory_percent"},{"context":{},"name":"__system__p_memory_percent"},{"context":{},"name":"metric_name"},{"context":{},"name":"test_1"},{"context":{},"name":"test_2"},{"context":{},"name":"test_3"},{"context":{},"name":"test_4"},{"context":{},"name":"test_5"}]},"props":{"name":"Run: 30(train-2b20f4a1-5172-4844-b853-ea868f0c6e36)","description":null,"experiment":{"id":"bb3237a5-00a1-4c26-955f-31ab85ad02ea","name":"default"},"tags":[],"archived":false,"creation_time":1668672386.153556,"end_time":1668672414.478266,"active":false,"notes":0}}
- 查这些 metrics 有什么值
示例返回:
[{"name":"test_1","context":{},"values":[0.5087581830665021,0.1352001147880273,0.4123761320648158,0.9422590223948614,0.15437425998701015,0.16548192405524198,0.979551255021762,0.24343191495056993,0.925818780988329,0.32060684813732665],"iters":[0,1,2,3,4,5,6,7,8,9]}]
第 1 步是:
curl -H "Content-Type: application/json" -X GET -d '[{"context": "", "name":"test_1"}]' http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/info
我感觉,可能是第 1 步比较慢~
我觉得第 1 步,容易缓存一下~~
Part II
further observation
人家的 live-update 是怎么做的?
请求的接口是:
https://test-3-172.apulis.com.cn/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/search/metric?p=500&q=%28run.hash+in+%5B%22df12e1de38b141fb9e804161%22%5D%29+and+%28%28metric.name+%3D%3D+%22test_5%22%29+or+%28metric.name+%3D%3D+%22metric_name%22%29+or+%28metric.name+%3D%3D+%22test_1%22%29+or+%28metric.name+%3D%3D+%22test_2%22%29+or+%28metric.name+%3D%3D+%22test_3%22%29+or+%28metric.name+%3D%3D+%22test_4%22%29%29&report_progress=False
thought:
aim 里面,我的这个赋值,可能会导致,读取的时候,进行了写入:
trace.run.name = 'Run: ' + translate_name(trace.run.hash)