[问题] aimstack 有时会很慢

ai-arts 调用 aim 的地方:


	aimMetrics, err := GetAimMetrics(configs.GetAimServerInnerEndpoint(orgName), runHash, runInfo.CollectMetricNames())
	if err != nil {
		apiErr = exports.RaiseAPIErrorAuto(exports.AIARTS_UNKNOWN_ERROR, err.Error())
		return
	}

apt-get update ; apt-get install -y curl

调的接口是: "/api/runs/%s/metric/get-batch"

http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9

curl -X POST -d '[{"name":"test_1"}]' http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/metric/get-batch {"detail":[{"loc":["body"],"msg":"value is not a valid list","type":"type_error.list"}]}

可行的调用:

curl -H "Content-Type: application/json"  -X POST -d '[{"context": "", "name":"test_1"}]'  http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/metric/get-batch

现在是 2 步：

查有哪些 metrics

示例返回:

{"params":{"hparams":{"batch_size":2,"learning_rate":0.001}},"traces":{"logs":[],"images":[],"audios":[],"distributions":[],"figures":[],"texts":[],"metric":[{"context":{"gpu":0},"name":"__system__gpu_memory_percent"},{"context":{"gpu":0},"name":"__system__gpu_power_watts"},{"context":{"gpu":0},"name":"__system__gpu_temp"},{"context":{"gpu":0},"name":"__system__gpu"},{"context":{},"name":"__system__cpu"},{"context":{},"name":"__system__disk_percent"},{"context":{},"name":"__system__memory_percent"},{"context":{},"name":"__system__p_memory_percent"},{"context":{},"name":"metric_name"},{"context":{},"name":"test_1"},{"context":{},"name":"test_2"},{"context":{},"name":"test_3"},{"context":{},"name":"test_4"},{"context":{},"name":"test_5"}]},"props":{"name":"Run: 30(train-2b20f4a1-5172-4844-b853-ea868f0c6e36)","description":null,"experiment":{"id":"bb3237a5-00a1-4c26-955f-31ab85ad02ea","name":"default"},"tags":[],"archived":false,"creation_time":1668672386.153556,"end_time":1668672414.478266,"active":false,"notes":0}}

查这些 metrics 有什么值

示例返回:

[{"name":"test_1","context":{},"values":[0.5087581830665021,0.1352001147880273,0.4123761320648158,0.9422590223948614,0.15437425998701015,0.16548192405524198,0.979551255021762,0.24343191495056993,0.925818780988329,0.32060684813732665],"iters":[0,1,2,3,4,5,6,7,8,9]}]

第 1 步是：

curl -H "Content-Type: application/json"  -X GET -d '[{"context": "", "name":"test_1"}]'  http://0.0.0.0:80/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/df12e1de38b141fb9e804161/info

我感觉，可能是第 1 步比较慢~

我觉得第 1 步，容易缓存一下~~

Part II

further observation

人家的 live-update 是怎么做的？

请求的接口是:

https://test-3-172.apulis.com.cn/endpoints/aim/eyJwb3J0Ijo4MCwic2VydmljZSI6ImFpbXN0YWNrLmFwdWxpcy5zdmMuY2x1c3Rlci5sb2NhbCJ9/api/runs/search/metric?p=500&q=%28run.hash+in+%5B%22df12e1de38b141fb9e804161%22%5D%29+and+%28%28metric.name+%3D%3D+%22test_5%22%29+or+%28metric.name+%3D%3D+%22metric_name%22%29+or+%28metric.name+%3D%3D+%22test_1%22%29+or+%28metric.name+%3D%3D+%22test_2%22%29+or+%28metric.name+%3D%3D+%22test_3%22%29+or+%28metric.name+%3D%3D+%22test_4%22%29%29&report_progress=False

thought:

aim 里面，我的这个赋值，可能会导致，读取的时候，进行了写入:

                trace.run.name = 'Run: ' + translate_name(trace.run.hash)