【【171】【AIARTS】【百信现场】代码开发里面git push 失败】https://www.tapd.cn/42483287/bugtrace/bugs/view?bug_id=1142483287001007608
远程连接
连上集群:
ssh -p 8886 root@220.194.147.64 100Trust!@
ssh root@192.3.21.217 100trust!@
以普通用户连上代码开发:
ssh -p 34221 org-dev@192.3.21.217 Nwo7jvCW
过程
org-dev@dev-d09e033f-989f-45f2-858c-ab591110b265-vxsld:~/code$ git status
On branch master
Your branch is ahead of 'origin/master' by 2 commits.
(use "git push" to publish your local commits)
nothing to commit, working tree clean
org-dev@dev-d09e033f-989f-45f2-858c-ab591110b265-vxsld:~/code$ git push
Counting objects: 98, done.
Delta compression using up to 192 threads.
Compressing objects: 100% (94/94), done.
remote: error: file write error: Bad file descriptor
remote: fatal: unable to write loose object file
error: remote unpack failed: unpack-objects abnormal exit
error: failed to push some refs to 'git@gitea-ssh.apulis:gitea/ouc68z1NTiC02V7Z0mTAbg.git'
添加文本文件好像挺好的?
怀疑是: ls -lh code/pretrained_model/ssd_80C_500E.ckpt ?
org-dev@dev-d09e033f-989f-45f2-858c-ab591110b265-vxsld:~/code$ git commit -m "try add code"
git [master 2a54d46] try add code
69 files changed, 7027 insertions(+)
create mode 100644 code/Dockerfile
create mode 100644 code/create_data.py
create mode 100644 code/demo.jpg
create mode 100644 code/eval.py
create mode 100644 code/export.py
create mode 100644 code/infer.py
create mode 100644 code/infer/convert/convert_om.sh
create mode 100644 code/infer/data/classes.json
create mode 100644 code/infer/data/classes_id.json
create mode 100644 code/infer/data/coco_ssd_mobile_net_v2.name
create mode 100644 code/infer/data/config.cfg
create mode 100644 code/infer/data/ssd-mobilenet-v2.aipp
create mode 100644 code/infer/data/ssd_mobile_net_v2_aipp.pipeline
create mode 100644 code/infer/data/ssd_mobile_net_v2_no_aipp.pipeline
create mode 100644 code/infer/mxbase/CommandFlagParser.h
create mode 100644 code/infer/mxbase/FunctionTimer.h
create mode 100644 code/infer/mxbase/MxBaseInfer.cpp
create mode 100644 code/infer/mxbase/MxBaseInfer.h
create mode 100644 code/infer/mxbase/MxImage.cpp
create mode 100644 code/infer/mxbase/MxImage.h
create mode 100644 code/infer/mxbase/MxUtil.cpp
create mode 100644 code/infer/mxbase/MxUtil.h
create mode 100644 code/infer/mxbase/SSDInfer.cpp
create mode 100644 code/infer/mxbase/SSDInfer.h
create mode 100644 code/infer/mxbase/SSDPostProcessor.h
create mode 100644 code/infer/mxbase/build.sh
create mode 100644 code/infer/mxbase/main.cpp
create mode 100644 code/infer/mxbase/run_test.py
create mode 100644 code/infer/sdk/mxpi/MxpiSSDMobileNetV2PostProcessor.cpp
create mode 100644 code/infer/sdk/mxpi/MxpiSSDMobileNetV2PostProcessor.h
create mode 100644 code/infer/sdk/mxpi/build_mxpi.sh
create mode 100644 code/infer/sdk/sample/ResultProcess.h
create mode 100644 code/infer/sdk/sample/aipp.cpp
create mode 100644 code/infer/sdk/sample/build_aipp.sh
create mode 100644 code/infer/sdk/sample/build_no_aipp.sh
create mode 100644 code/infer/sdk/sample/no_aipp.cpp
create mode 100644 code/mindspore_hub_conf.py
create mode 100644 code/modelzoo_level.txt
create mode 100644 code/on_platform/modelarts/README.md
create mode 100644 code/on_platform/modelarts/__init__.py
create mode 100644 code/on_platform/modelarts/start.py
create mode 100644 code/on_platform/plat_cfg.yaml
create mode 100644 code/requirements.txt
create mode 100644 code/scripts/docker_start.sh
create mode 100644 code/scripts/run_distribute_self_defined_train.sh
create mode 100644 code/scripts/run_distribute_train.sh
create mode 100644 code/scripts/run_distribute_train_gpu.sh
create mode 100644 code/scripts/run_eval.sh
create mode 100644 code/scripts/run_eval_gpu.sh
create mode 100644 code/serve_desc.template
create mode 100644 code/src/__init__.py
create mode 100644 code/src/anchor_generator.py
create mode 100644 code/src/box_utils.py
create mode 100644 code/src/config.py
create mode 100644 code/src/config_ssd300.py
create mode 100644 code/src/config_ssd_mobilenet_v1_fpn.py
create mode 100644 code/src/dataset.py
create mode 100644 code/src/eval_utils.py
create mode 100644 code/src/init_params.py
create mode 100644 code/src/lr_schedule.py
create mode 100644 code/src/mobilenet_v1_fpn.py
create mode 100644 code/src/ssd.py
create mode 100644 code/train.py
create mode 100644 code/transformer/ext.proto
create mode 100644 code/transformer/ext_pb2.py
create mode 100644 code/transformer/postprocess.py
create mode 100644 code/transformer/preprocess.py
create mode 100644 code/transformer/serve.yaml
create mode 100644 code/version.ini
try add infer 也有点问题
org-dev@dev-d09e033f-989f-45f2-858c-ab591110b265-vxsld:~/code$ git push
Counting objects: 10, done.
Delta compression using up to 192 threads.
Compressing objects: 100% (9/9), done.
remote: fatal: error when closing loose object file: I/O error
error: remote unpack failed: unpack-objects abnormal exit
error: failed to push some refs to 'git@gitea-ssh.apulis:gitea/ouc68z1NTiC02V7Z0mTAbg.git'
org-dev@dev-d09e033f-989f-45f2-858c-ab591110b265-vxsld:~/code$ ls -lh infer
total 25M
-rw-r--r-- 1 org-dev domainusers 180 Nov 21 18:28 eval_result.json
drwxr-xr-x 2 org-dev domainusers 4.0K Nov 21 18:28 export
-rw-r--r-- 1 org-dev domainusers 295K Nov 21 18:28 predictions.json
-rw-r--r-- 1 org-dev domainusers 2.5K Nov 21 18:28 serve_desc.template
-rw-r--r-- 1 org-dev domainusers 2.5K Nov 21 18:37 serve_desc.yaml
-rw-r--r-- 1 org-dev domainusers 13M Nov 21 18:28 ssd.air
-rw------- 1 org-dev domainusers 12M Nov 21 18:37 ssd.om
drwxr-xr-x 2 org-dev domainusers 4.0K Nov 21 18:28 transformer
git config --global http.postBuffer 524288000
对https://gitlab.apulis.com.cn/ran.lu/psutil.git
进行了操作,也是加这个 zip 文件,没有这个问题。
(
https://gitlab.apulis.com.cn/ran.lu/psutil/-/tree/feat/test-add-zip
)
[pid 523107] 14:11:09.606926 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=17, si_uid=1000} ---
改一下配置文件,加一下日志设置:
cd /data/aistudio/nfs/apulis/pvc/aiplatform-gitea-data
ls gitea/conf/
app.ini
配置似乎不会生效~
改了 secret, 可以生效了。但是没有出现新的好日志。
ssh 进程: root 1039605 0.0 0.0 4292 3384 ? Ss 15:19 0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
strace -f -T -tt -e trace=all -p 1039605
[pid 1207696] 15:46:28.089729 write(3, "$\207)(\356\327\306\267\271\256\v\233\255S\351\352\376[\324\330\370\32\254\332\360\216\312\252\33\220\260\336"..., 4096) = -1 EBADF (Bad file descriptor) <0.000053>
[pid 1207696] 15:46:28.089936 write(2, "error: file write error: Bad fil"..., 45) = 45 <0.000046>
[pid 1207438] 15:46:28.090051 <... read resumed> "error: file write error: Bad fil"..., 128) = 45 <9.879668>
[pid 1207696] 15:46:28.090106 write(2, "fatal: unable to write loose obj"..., 41 <unfinished ...>
[pid 1207438] 15:46:28.090167 write(1, "0032\2", 5 <unfinished ...>
发现有 2 处 EBADF:
ioctl(-1, TIOCGPGRP, 0xfffffebfa84c) = -1 EBADF (Bad file descriptor) <0.000029>
pid: 1237051
这个文件句柄:
1237051 15:51:01.193577 openat(AT_FDCWD, "/data/git/gitea-repositories/gitea/ouc68z1ntic02v7z0mtabg.git/./objects/incoming-oEoffE/b3/tmp_obj_XwYObm", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0444) = 3 <0.033109>
EBADF fd is not a valid file descriptor or is not open for writing.
https://linux.die.net/man/2/write
mmap
void *mmap(void *addr, size_t lengthint " prot ", int " flags ,
int fd, off_t offset);int munmap(void *addr, size_t length);
1237051 15:51:01.226765 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff82a9c000 <0.000043>
1237051 15:51:01.226908 mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff7f790000 <0.000027>
1237051 15:51:01.227002 mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff7f77f000 <0.000025>
1237051 15:51:01.227093 mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff7f76e000 <0.000025>
1237051 15:51:01.227182 mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff7f75d000 <0.000025>
这个加起来是 286720
67159 行 76406 行
估计不是行数超标…
也有怀疑 nfs ,因为 dmesg 中有许多错误日志。
https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat
观看 gitea 代码
var (
allowedCommands = map[string]models.AccessMode{
"git-upload-pack": models.AccessModeRead,
"git-upload-archive": models.AccessModeRead,
"git-receive-pack": models.AccessModeWrite,
lfsAuthenticateVerb: models.AccessModeNone,
}
m.PostOptions("/git-receive-pack", repo.ServiceReceivePack)
git-receive-pack - Receive what is pushed into the repository Invoked by git send-pack and updates the repository with the information fed from the remote end. This command is usually not invoked directly by the end user.
确实,调用了 git-receive-pack :
root@ubuntu:~# cat tmp_strace.log | grep -i receive
1236483 15:50:56.633331 newfstatat(AT_FDCWD, "/bin/git-receive-pack", 0x400162c338, 0) = -1 ENOENT (No such file or directory) <0.000029>
1236483 15:50:56.633420 newfstatat(AT_FDCWD, "/usr/bin/git-receive-pack", <unfinished ...>
1236569 15:50:56.640285 execve("/usr/bin/git-receive-pack", ["git-receive-pack", "gitea/ouc68z1ntic02v7z0mtabg.git"], 0x400167ec30 /* 24 vars */ <unfinished ...>
1236459 15:51:03.606437 write(2, "Received disconnect from 172.20."..., 76) = 76 <0.000016>
飞哥换用了 hostPath ,然后 git push 大文件可以用了:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/edge
operator: DoesNotExist
containers:
- env:
- name: SSH_LISTEN_PORT
value: "22"
- name: SSH_PORT
value: "22"
- name: GITEA_APP_INI
value: /data/gitea/conf/app.ini
- name: GITEA_CUSTOM
value: /data/gitea
- name: GITEA_WORK_DIR
value: /data
- name: GITEA_TEMP
value: /tmp/gitea
- name: TMPDIR
value: /tmp/gitea
- name: POSTGRES_PASSWORD
value: vault:secret/data/postgres#POSTGRES_PASSWORD
image: harbor.internal.cn:8443/internal/aistudio/infra/gitea:aistudio-v1.7.1-rc0
imagePullPolicy: Always
livenessProbe:
failureThreshold: 10
initialDelaySeconds: 200
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: http
timeoutSeconds: 1
name: gitea
ports:
- containerPort: 22
name: ssh
protocol: TCP
- containerPort: 3000
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: http
timeoutSeconds: 1
resources: {}
securityContext: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: temp
- mountPath: /data
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xjsk2
readOnly: true
- args:
- proxy
- sidecar
- --domain
- $(POD_NAMESPACE).svc.cluster.local
- --serviceCluster
- gitea.$(POD_NAMESPACE)
- --proxyLogLevel=warning
- --proxyComponentLogLevel=misc:error
- --log_output_level=default:info
- --concurrency
- "2"
env:
- name: JWT_POLICY
value: first-party-jwt
- name: PILOT_CERT_PROVIDER
value: istiod
- name: CA_ADDR
value: istiod.istio-system.svc:15012
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: INSTANCE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: SERVICE_ACCOUNT
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.serviceAccountName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: CANONICAL_SERVICE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-name']
- name: CANONICAL_REVISION
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-revision']
- name: PROXY_CONFIG
value: |
{}
- name: ISTIO_META_POD_PORTS
value: |-
[
{"name":"ssh","containerPort":22,"protocol":"TCP"}
,{"name":"http","containerPort":3000,"protocol":"TCP"}
]
- name: ISTIO_META_APP_CONTAINERS
value: gitea
- name: ISTIO_META_CLUSTER_ID
value: Kubernetes
- name: ISTIO_META_INTERCEPTION_MODE
value: REDIRECT
- name: ISTIO_METAJSON_ANNOTATIONS
value: |
{"checksum/config":"f274be7bf3c100a034ac4b3fc396fe970118128b1b3d912a9cbc07f65bcc52d5","checksum/ldap":"00b7af41c86021efd76987f55a6e6aa17a497e98f2a48b9c2f71d5c0295ed342","checksum/oauth":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","traffic.sidecar.istio.io/excludeInboundPorts":"22","traffic.sidecar.istio.io/excludeOutboundPorts":"22","vault.security.banzaicloud.io/vault-addr":"https://vault.kube-system:8200","vault.security.banzaicloud.io/vault-tls-secret":"vault-tls"}
- name: ISTIO_META_WORKLOAD_NAME
value: gitea
- name: ISTIO_META_OWNER
value: kubernetes://apis/apps/v1/namespaces/apulis/statefulsets/gitea
- name: ISTIO_META_MESH_ID
value: cluster.local
- name: TRUST_DOMAIN
value: cluster.local
image: harbor.internal.cn:8443/internal/istio/proxyv2:1.9.4
imagePullPolicy: Always
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15021
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 1337
runAsNonRoot: true
runAsUser: 1337
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/istio
name: istiod-ca-cert
- mountPath: /var/lib/istio/data
name: istio-data
- mountPath: /etc/istio/proxy
name: istio-envoy
- mountPath: /etc/istio/pod
name: istio-podinfo
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xjsk2
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: gitea-0
initContainers:
- command:
- /usr/sbin/init_directory_structure.sh
env:
- name: GITEA_APP_INI
value: /data/gitea/conf/app.ini
- name: GITEA_CUSTOM
value: /data/gitea
- name: GITEA_WORK_DIR
value: /data
- name: GITEA_TEMP
value: /tmp/gitea
- name: POSTGRES_PASSWORD
value: vault:secret/data/postgres#POSTGRES_PASSWORD
image: harbor.internal.cn:8443/internal/aistudio/infra/gitea:aistudio-v1.7.1-rc0
imagePullPolicy: IfNotPresent
name: init-directories
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/sbin
name: init
- mountPath: /tmp
name: temp
- mountPath: /etc/gitea/conf
name: config
- mountPath: /data
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xjsk2
readOnly: true
- command:
- /usr/sbin/configure_gitea.sh
env:
- name: GITEA_APP_INI
value: /data/gitea/conf/app.ini
- name: GITEA_CUSTOM
value: /data/gitea
- name: GITEA_WORK_DIR
value: /data
- name: GITEA_TEMP
value: /tmp/gitea
- name: GITEA_ADMIN_USERNAME
value: gitea
- name: GITEA_ADMIN_PASSWORD
value: BD2IHvf1jEVuK5J6
- name: POSTGRES_PASSWORD
value: vault:secret/data/postgres#POSTGRES_PASSWORD
image: harbor.internal.cn:8443/internal/aistudio/infra/gitea:aistudio-v1.7.1-rc0
imagePullPolicy: IfNotPresent
name: configure-gitea
resources: {}
securityContext:
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/sbin
name: init
- mountPath: /tmp
name: temp
- mountPath: /data
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xjsk2
readOnly: true
- args:
- istio-iptables
- -p
- "15001"
- -z
- "15006"
- -u
- "1337"
- -m
- REDIRECT
- -i
- '*'
- -x
- ""
- -b
- '*'
- -d
- 15090,15021,22,15020
- -o
- "22"
image: harbor.internal.cn:8443/internal/istio/proxyv2:1.9.4
imagePullPolicy: Always
name: istio-init
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_ADMIN
- NET_RAW
drop:
- ALL
privileged: false
readOnlyRootFilesystem: false
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xjsk2
readOnly: true
nodeName: 192.3.21.217
nodeSelector:
series: a310
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1337
serviceAccount: default
serviceAccountName: default
subdomain: gitea
terminationGracePeriodSeconds: 60
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 100
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 100
volumes:
- emptyDir:
medium: Memory
name: istio-envoy
- emptyDir: {}
name: istio-data
- downwardAPI:
defaultMode: 420
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.labels
path: labels
- fieldRef:
apiVersion: v1
fieldPath: metadata.annotations
path: annotations
- path: cpu-limit
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: limits.cpu
- path: cpu-request
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: requests.cpu
name: istio-podinfo
- configMap:
defaultMode: 420
name: istio-ca-root-cert
name: istiod-ca-cert
- name: init
secret:
defaultMode: 511
secretName: gitea-init
- name: config
secret:
defaultMode: 420
secretName: gitea
- emptyDir: {}
name: temp
- hostPath:
path: /opt/gitea
type: ""
name: data
- name: default-token-xjsk2
secret:
defaultMode: 420
secretName: default-token-xjsk2
用于尝试复现问题的脚本
import os
import time
def write_pid_to_pidfile(pidfile_path):
""" Write the PID in the named PID file.
Get the numeric process ID (“PID”) of the current process
and write it to the named file as a line of text.
"""
os.remove(pidfile_path)
open_flags = (os.O_CREAT | os.O_EXCL | os.O_WRONLY | os.O_LARGEFILE)
open_mode = 0o444
pidfile_fd = os.open(pidfile_path, open_flags, open_mode)
pidfile = os.fdopen(pidfile_fd, 'w')
# According to the FHS 2.3 section on PID files in /var/run:
#
# The file must consist of the process identifier in
# ASCII-encoded decimal, followed by a newline character. For
# example, if crond was process number 25, /var/run/crond.pid
# would contain three characters: two, five, and newline.
pid = os.getpid()
pidfile.write("%s\n" % pid)
for i in range(33):
for rep in range(1000):
pidfile.write('a' * 4096)
print('i:', i)
time.sleep(1)
pidfile.close()
write_pid_to_pidfile('test')
一次成功的 git push:
root@nfs-nginx-5d98bcd8cb-hzgks:/tmp/ouc68z1ntic02v7z0mtabg# git push
root@127.0.0.1's password:
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 192 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 70.02 MiB | 17.23 MiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
To 127.0.0.1:/data/git/gitea-repositories/ouc68z1ntic02v7z0mtabg.git
bdb2b02..ac8d7bd master -> master
git 版本导致?
代码开发 pod 中 git 是 2.17 版本;server 端是 2.30 版本。
但是,这样会导致, hostPath 是正常的、 nfs 是异常的?