Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selkies-gstream hit 100% CPU #33

Open
mathico2 opened this issue Jan 4, 2024 · 32 comments
Open

selkies-gstream hit 100% CPU #33

mathico2 opened this issue Jan 4, 2024 · 32 comments

Comments

@mathico2
Copy link

mathico2 commented Jan 4, 2024

Wishing you a Happy New Year! I've set up a node using Standard_NC16as_T4_v3 in AKS . However, we're encountering issue where the pod runs for a few minutes with CPU usage under 9% for selkies-gstream process , but then it abruptly jumps to over 100%, causing the pod to freeze again and become inaccessible . Please could you advise if there is anything that have been missing while deploying this apps within azure environment

image
@ehfd
Copy link
Member

ehfd commented Jan 5, 2024

I'm not sure what might be an issue (I simply feel this might be something with the GStreamer backend), but try the new container that will be built in the next 10 minutes because there was a new Selkies-GStreamer release with a new GStreamer version patch.

@mathico2
Copy link
Author

mathico2 commented Jan 5, 2024

Hello ehfd-please could you provide URL for new release so I can update that image within deployment
I guess you refer to this below link

https://github.com/selkies-project/selkies-gstreamer/pkgs/container/selkies-gstreamer%2Fgstreamer

@mathico2
Copy link
Author

mathico2 commented Jan 5, 2024

@ehfd I used newly created selkies gstream docker image but unfortunately it doesn't even open the apps. Please could you advised which docker image need to push within Azure container registry then use that image for deployment

@ehfd
Copy link
Member

ehfd commented Jan 5, 2024

No, you can use ghcr.io/selkies-project/nvidia-egl-desktop:22.04 or 20.04.
I built a new image with the new release.

@mathico2
Copy link
Author

mathico2 commented Jan 5, 2024 via email

@mathico2
Copy link
Author

mathico2 commented Jan 5, 2024

I just tried to use new docker image but still encountering same issue and this is deployment yaml file in use for this deployment within AKS :
apiVersion: apps/v1
kind: Deployment
metadata:
name: egl
spec:
replicas: 1
selector:
matchLabels:
app: egl
template:
metadata:
labels:
app: egl
spec:
hostname: egl
# Uncomment the below line to disable network isolation for WebRTC connectivity, may show an error if disallowed by the cluster
# hostNetwork: true
containers:
- name: egl
image: cazaw1232conregistry.azurecr.us/nvidia-egl-desktop:22.04
env:
- name: TZ
value: "UTC"
- name: SIZEW
value: "1920"
- name: SIZEH
value: "1080"
- name: REFRESH
value: "60"
- name: DPI
value: "96"
- name: CDEPTH
value: "24"
# Keep to default unless you know what you are doing with VirtualGL, VGL_DISPLAY should be set to either egl[n], or /dev/dri/card[n] only when the device was passed to the container
#- name: VGL_DISPLAY
# value: "egl"
# Choose either value: or secretKeyRef: but not both at the same time
- name: PASSWD
value: "mypasswd"
# valueFrom:
# secretKeyRef:
# name: my-pass
# key: my-pass
# Uncomment this to enable noVNC, disabing selkies-gstreamer and ignoring all its parameters except BASIC_AUTH_PASSWORD, which will be used for authentication with noVNC, BASIC_AUTH_PASSWORD defaults to PASSWD if not provided
# - name: NOVNC_ENABLE
# value: "true"
# Additional view-only password only applicable to the noVNC interface, choose either value: or secretKeyRef: but not both at the same time
# - name: NOVNC_VIEWPASS
# value: "mypasswd"
# valueFrom:
# secretKeyRef:
# name: my-pass
# key: my-pass
###
# selkies-gstreamer parameters, for additional configurations see lines that start with "parser.add_argument" in https://github.com/selkies-project/selkies-gstreamer/blob/master/src/selkies_gstreamer/__main__.py
###
# Change WEBRTC_ENCODER to x264enc, vp8enc, or vp9enc if you are using software fallback without allocated GPUs or your GPU doesn't support H.264 (AVCHD) under the NVENC - Encoding section in https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new
- name: WEBRTC_ENCODER
value: "nvh264enc"
- name: WEBRTC_ENABLE_RESIZE
value: "false"
- name: ENABLE_BASIC_AUTH
value: "true"
- name: ENABLE_HTTPS_WEB
value: "false"
# Volume mount trusted HTTPS certificate to new path for no web browser warnings
# - name: HTTPS_WEB_CERT
# value: /etc/ssl/certs/ssl-cert-snakeoil.pem
# - name: HTTPS_WEB_KEY
# value: /etc/ssl/private/ssl-cert-snakeoil.key
# Defaults to PASSWD if unspecified, choose either value: or secretKeyRef: but not both at the same time
# - name: BASIC_AUTH_PASSWORD
# value: "mypasswd"
# valueFrom:
# secretKeyRef:
# name: my-pass
# key: my-pass
###
# Uncomment below to use a TURN server for improved network compatibility
###
# - name: TURN_HOST
# value: "turn.example.com"
# - name: TURN_PORT
# value: "3478"
# Provide only TURN_SHARED_SECRET for time-limited shared secret authentication or both TURN_USERNAME and TURN_PASSWORD for legacy long-term authentication, but do not provide both authentication methods at the same time
# - name: TURN_SHARED_SECRET
# valueFrom:
# secretKeyRef:
# name: turn-shared-secret
# key: turn-shared-secret
# - name: TURN_USERNAME
# value: "username"
# Choose either value: or secretKeyRef: but not both at the same time
# - name: TURN_PASSWORD
# value: "mypasswd"
# valueFrom:
# secretKeyRef:
# name: turn-password
# key: turn-password
# Change to tcp if the UDP protocol is throttled or blocked in your client network, or when the TURN server does not support UDP
# - name: TURN_PROTOCOL
# value: "udp"
# You need a valid hostname and a certificate from authorities such as ZeroSSL (Let's Encrypt may have issues) to enable this
# - name: TURN_TLS
# value: "false"
stdin: true
tty: true
ports:
- name: http
containerPort: 8080
protocol: TCP
resources:
limits:
memory: 64Gi
cpu: "16"
nvidia.com/gpu: 1
requests:
memory: 100Mi
cpu: 100m
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /cache
name: egl-cache-vol
- mountPath: /home/user
name: egl-root-vol
- mountPath: /dev/dri
name: drm
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: egl-cache-vol
emptyDir: {}
# persistentVolumeClaim:
# claimName: egl-cache-vol
- name: egl-root-vol
emptyDir: {}
# persistentVolumeClaim:
# claimName: egl-root-vol
- name: drm
emptyDir: {}

image

@ehfd
Copy link
Member

ehfd commented Jan 6, 2024

Can you follow the procedures in: https://github.com/selkies-project/selkies-gstreamer#install-the-packaged-version-on-a-standalone-machine-or-cloud-instance outside Kubernetes or any containers in the same VM instance? I want to check if it's a hardware issue or a container issue.

@ehfd
Copy link
Member

ehfd commented Feb 2, 2024

Similar condition (both on Azure): selkies-project/docker-nvidia-glx-desktop#50

@ehfd
Copy link
Member

ehfd commented Feb 2, 2024

@justinbowes

Selkies-GStreamer directly goes to NVENC so I doubt this is VirtualGL. Do you have any leads? It's just because you were also using Azure's GPUs.

@justinbowes
Copy link

justinbowes commented Feb 2, 2024

@mathico2 On Azure, for virtual workstation applications you might try GRID driver (which is supported on the T4 instances -- see the exception here https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup ).

Beyond that, I'd be trying to isolate the layer in which the issue occurs. What is the output of nvidia-smi encodersessions and nvidia-smi dmon, outside of containers, while this is happening?

Also worth checking the kernel messages to see if the driver is complaining.

@mathico2
Copy link
Author

mathico2 commented Feb 2, 2024 via email

@ehfd
Copy link
Member

ehfd commented Feb 2, 2024

I agree that there should be comparison inside and outside the container to progress further.

@ehfd
Copy link
Member

ehfd commented Mar 12, 2024

@mathico2 Any follow ups?

@mathico2
Copy link
Author

mathico2 commented Mar 12, 2024 via email

@ehfd
Copy link
Member

ehfd commented Mar 24, 2024

https://github.com/selkies-project/selkies-gstreamer#install-the-packaged-version-on-a-standalone-machine-or-cloud-instance

I need you to follow the procedures here outside of a container within the same instance to continue.

@mathico2
Copy link
Author

mathico2 commented Mar 24, 2024 via email

@ehfd
Copy link
Member

ehfd commented May 1, 2024

Leads to: python-xlib/python-xlib#242

@ehfd
Copy link
Member

ehfd commented May 5, 2024

Just to confirm: did you happen to use an international keyboard layout? @mathico2

@mathico2
Copy link
Author

mathico2 commented May 5, 2024 via email

@ehfd
Copy link
Member

ehfd commented Jun 22, 2024

@mathico2 We have a new series of containers.

@ehfd
Copy link
Member

ehfd commented Jun 25, 2024

@mathico2 We have a new release.

@mathico2
Copy link
Author

mathico2 commented Jun 25, 2024 via email

@mathico2
Copy link
Author

mathico2 commented Jun 25, 2024 via email

@mathico2
Copy link
Author

mathico2 commented Jun 25, 2024 via email

@mathico2
Copy link
Author

mathico2 commented Jun 25, 2024 via email

@mathico2
Copy link
Author

mathico2 commented Jun 25, 2024 via email

@mathico2
Copy link
Author

mathico2 commented Jun 25, 2024 via email

@ehfd
Copy link
Member

ehfd commented Jun 26, 2024

I also updated the egl/xgl.yml together. There have been variables that were changed for the new release but you could compare them and update them.

Docker Hub: Why? Doesn't ghcr.io work the exact same way?

@ehfd
Copy link
Member

ehfd commented Jul 3, 2024

There's a new docker-compose.yml as well.

@ehfd
Copy link
Member

ehfd commented Jul 10, 2024

I've fixed everything under my knowledge that might cause this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants