Closed hananmoiseyev closed 3 years ago
If you see the service (and probably a "session" also) but no pod, then the "kvdi-manager" pod is probably off complaining about something in it's logs. If you could paste those it would be helpful 😄 . (I do know I need a debugging doc or something)
I guess I'll add that pod restarts are probably a separate matter, you'd want to check out the logs on the containers in that pod too. If it's related to the kvdi-proxy
container, that is actually currently undergoing a heavy rewrite anyway and will have a new release sometime tonight or tomorrow. If it's the desktop
container, it would be something in the image causing an issue.
Well now that I have a little more Debug info, I can at least eliminate my faults :)
But now I can see that although a single desktop is trying to load, the manager is handling more then one, the PVCs are being used in some cases by two pods for a few seconds.
Looking at the logs of the manager, I see this exception:
ERROR controller-runtime.manager.controller.session Reconciler error {"reconciler group": "desktops.kvdi.io", "reconciler kind": "Session", "name": "hananm-5zgq5", "namespace": "kvdi", "error": "Pod \"hananm-5zgq5\" not found"}
The name of the new pod is hananm-pttxg so it seems like there is a mess there....
Well yes, the manager is responsible for...almost all the non-user facing magic. You can scale it if you need to, but it uses leader election currently so that would really only cover you in case of pod and/or node failure.
This kinda drifts into the controller-runtime
a little bit, but the TLDR is that the API (and the helm chart) just create custom Kubernetes objects, and the manager handles turning those into actual things. It's a nice separation of responsibilities, but it does make the log a bit of a nightmare to parse if its doing a lot of work. Grep is your friend because every log event will contain an identifier unique to the session or whatever its relevant for.
But now I can see that although a single desktop is trying to load, the manager is handling more then one, the PVCs are being used in some cases by two pods for a few seconds.
In most circumstances this should be fine. A benefit of the controller-runtime
here is all errors auto-retry so it doesn't have to worry about silly race conditions like this among other things. If you ever see an error in the log, it got retried further down, and most of the time was fine. Where you'd find yourself in a bad situation, is if it gets stuck in a loop on the same error.
EDIT: Or I guess I'll add for clarity, in the case of a pod being not found, that was probably an error from it trying to delete a pod it didn't think it needed anymore. But chances are some other environmental factor got to it first. The next time the manager tries whatever the job was that got to that point, it'll act based on what the environment looks like. So it would not even see that it has a pod that it thinks it needs to delete. If that made sense 😛
I think I can add to your last epiphany here: (after a few hours of debugging and reading the code)
I changed the PVC to use nfs, still the same issue. The manager just loops and the pod keeps terminating. I see no reason anywhere for this.
Sometimes the pod is loading and I can get to the desktop, but it takes a very long while
I willl say that kvdi creates a /tmp mount/volume internally, so it probably wouldn't let you do an additional separate tmp mount. You would have to choose a different path. But that would not make sense for a pod to get created and then destroyed. This I want to hear more about.
What other info can I get/supply?
I now also see allot of errors from the proxy (when the pod is actually ready) about display port not able to connect due to a connection refused error.
On 26 Feb 2021, at 6:39, Avi Zimmerman notifications@github.com wrote:
I willl say that kvdi creates a /tmp mount/volume internally, so it probably wouldn't let you do an additional separate tmp mount. You would have to choose a different path. But that would not make sense for a pod to get created and then destroyed. This I want to hear more about.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Regarding the /tmp (and also /home), it would be nice to be able to set it in kvdi prior to launching. The logic is that I want each user have a persistent volume for both of them. This way the installations and setups will persist between sessions and I will be able to supply larger tmp folders.
On 26 Feb 2021, at 6:39, Avi Zimmerman notifications@github.com wrote:
I willl say that kvdi creates a /tmp mount/volume internally, so it probably wouldn't let you do an additional separate tmp mount. You would have to choose a different path. But that would not make sense for a pod to get created and then destroyed. This I want to hear more about.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Yea it seems the main matter here is I either need to document better or make more configurable, the volumes that kvdi uses internally for each desktop. /home
and /tmp
are among them. Those two in particular might be difficult to make configurable at the moment. home moreso than tmp just due to a lot of transitive dependencies.
As for the connection refused in the proxy. That is benign or bad depending on the context. It's common for the proxy to be ready before the desktop is, so a first connection attempt would throw that error, but the UI should know to retry. If it continues to happen, it's more likely an issue VNC side.
I'm supposing one thing I can easily do in the interim, potentially as part of the next release I'm in the middle of, is a sort of BYOV for home directories. The way it works right now is each session either gets an emptyDir
or the PVC defined by the userdataSpec
in the VDICluster configuration. There could be room for a third option here, similar to what was described in #23, like a userdataSelector
or something to select existing claims.
What I'd be curious your thoughts on, is would it be easier for it to try to match a label on PVCs containing the username (seems like a cleaner implementation in my head), or a sort of "name template" as proposed in the other issue.
I have like 2/3 of a release done that will be a shot at addressing these issues. In case you care for an update:
userdataSelector
in the VDICluster
configuration to allow for grabbing $HOME from pre-existing PVCs matching a pattern/tmp
volumes provided in the template override the ones that get created internallyFor the last bit which would probably be needed, I'm still thinking on it. That is to say you want the /tmp
volumes to be unique to each user. So the template where you specify the volumes needs to work in a way to make that possible. I'm stuck between two ideas.
spec.render: true
option to Templates
or something. Such that that all the values get passed through a go template with the username before later being processed into a pod.spec.volumes
in the Template
with a block that allows for either specifying a corev1.Volume
(like now), OR a selector like is used for the userdataSelector
mentioned above.I'm going to keep thinking on it a bit more, but if you do have any thoughts more than happy to hear them.
I'll add that the weird bug you had with the pods terminating randomly is also not lost on me. I think someone else mentioned similar behavior once, and it finally happened to me once and I can't for the life of me figure out why.
I threw down tons of logging statements to hopefully get a better idea of where the behavior gets triggered, but I have since been unable to reproduce it again.
I have upgraded to 0.3.1 (couldn't perform an helm upgrade and had to re-install btw) I tried to mount a /tmp in the template without success. Without the /tmp addition, the pod is finally running but I Can't connect to it. The proxy pod shows: 2021-02-28T05:56:36.998Z ERROR kvdi_proxy.10.234.156.157:51598 Error during request {"error": "dial unix /var/run/kvdi/display.sock: connect: no such file or directory"}
I tried to mount a /tmp in the template without success.
Can you provide manager errors and/or pod specs that get generated when this happens?
The proxy pod shows: 2021-02-28T05:56:36.998Z ERROR kvdi_proxy.10.234.156.157:51598 Error during request {"error": "dial unix /var/run/kvdi/display.sock: connect: no such file or directory"}
This usually means the VNC server isn't starting on that socket inside the container. There are a lot of things that could cause this, but for starters I'd make sure you've built off a recent desktop image. It would be helpful to see your template
too to see if any parameters might be wrong.
It might be worth adding that /run
and /var/run
are similar to tmp
in that kvdi needs and manages those internally. If you are overriding those mount points it could lead to other trouble.
I am not. Looking inside the /run/kvdi in the proxy container, it's empty at the moemnt
Could you edit the comment and wrap the yaml in a code block, it's hard to digest like this
three backticks
Can you provide manager errors and/or pod specs that get generated when this happens?
I can't get any logs about the reason for the pod termination.
kind: Template
metadata:
name: hananm-agplenus
spec:
desktop:
allowRoot: true
image: 'nexus-registry.prod.evogene.host/repository/docker-hosted/kvdi-dev-vm:latest'
imagePullPolicy: Always
resources: {}
# envTemplates:
# USERNAME: "{{ .Session.User.Name }}"
env:
- name: UNITY_SITE
value: agplenus
volumeMounts:
- mountPath: /cpbclouds/agplenus
name: cpbclouds-agplenus
- mountPath: /cloud-home
name: cloud-home
# - mountPath: /tmp
# name: cloud-tmp
resources:
limits:
cpu: '4'
memory: 32Gi
requests:
cpu: '4'
memory: 16Gi
proxy:
allowFileTransfer: true
resources: {}
tags:
applications: minimal
desktop: xfce4
os: ubuntu
volumes:
- name: cpbclouds-agplenus
nfs:
path: /cpbclouds/agplenus
server: agplenus-stor.evogrid.internal
- name: cloud-home
persistentVolumeClaim:
claimName: hananm-pvc-nfs
# - name: cloud-tmp
# persistentVolumeClaim:
# claimName: hananm-tmp
So it won't be until later in the day that I'll be able to continue to help you debug this thoroughly, but I was not able to reproduce on my first attempt.
I'm using local-path-provisioner
and created a PVC and template like this:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tinyzimmer-tmp
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 4Gi
---
apiVersion: desktops.kvdi.io/v1
kind: Template
metadata:
name: ubuntu-xfce
spec:
volumes:
- name: tinyzimmer-tmp
persistentVolumeClaim:
claimName: tinyzimmer-tmp
desktop:
image: ghcr.io/tinyzimmer/kvdi:ubuntu-xfce4-latest
imagePullPolicy: IfNotPresent
allowRoot: true
volumeMounts:
- name: tinyzimmer-tmp
mountPath: /tmp
proxy:
allowFileTransfer: true
tags:
os: ubuntu
desktop: xfce4
applications: minimal
And everything worked okay. fdisk
is lieing because it's a local-path, but you get the point:
The fact that systemd
needed to write stuff there when it first started does raise a question for you to explore. Are you sure your volumes have the correct filesystem permissions? kvdi will do what it can internally, but I'm sure there are tons of other variables in that space.
And for whatever it's worth, verifying my server
version with my new fancy CLI 😛
[tinyzimmer@DESKTOP-5H08KBL-wsl kvdi]$ kvdictl version
Client Version: v0.3.1-5-g8c8d841
Git Commit: 8c8d8417e7c9ba1e9e15f3362f79ac92bcd802a4
Server Version: v0.3.1
Git Commit: b794da98f2dc2ede4effbbb9268579d2bfa4ba68
I can see files being written to the /tmp mounted. So ther are permissions. The problem is again with the proxy: 2021-02-28T06:42:17.041Z ERROR kvdi_proxy.10.234.156.157:45566 Error during request {"error": "dial unix /var/run/kvdi/display.sock: connect: no such file or directory"}
The proxy is a separate matter from the tmp volume thing. And the best way to figure out what is wrong for you (since I can't reproduce), is to exec into the desktop
container and inspect the display.service
to see why it isn't starting. The proxy isn't doing much other than proxying the VNC listening on whatever socket inside the desktop (they share a /run
dir). In this case the proxy is just telling you there is no /var/run/kvdi/display.sock
. The answer to why there isn't one lies somewhere in your desktop image.
I am confused by that last comment though, that is to say you figured out the /tmp
volume thing?
I am confused by that last comment though, that is to say you figured out the
/tmp
volume thing?
It seems like the PVC missed a variable. I have re-applied the PV/PVC YAMLs and it now seems to work (I cna't verify since I still don't have a working desktop) ;(
The answer to why there isn't one lies somewhere in your desktop image. My image is based on the docker image you provided and I am just installing a bunch of applications there. Non of them is anywhere related to VNC, but I don't know for sure.
I do know that this thing did work a few days ago and I can't seem to find the reason to all of this..
I appreciate your devoted support here btw, I hope that my mess here helps to promote kvdi as well :)
At least we are making progress 😄
If it worked a few days ago, just to remove the variable, I'd maybe rebuild your image. I don't know for sure when you built it, but there was a bug (from a breaking change in /usr/local/sbin/init
in the ubuntu image) last weekend that I patched real quick.
You could verify this without rebuilding, by seeing if the one inside your image looks like this. Specifically that part about falling back to VNC_SOCKET_ADDR
would look different if you are on an image that had the breaking change.
If it is in fact up to date and you are still having the problem, then yea we'd want to exec
into a failing desktop while the UI is spinning to debug. Something like kubectl exec -it <desktop_name> -c desktop -- bash
(the -c desktop
is important). From there you could look at the systemd display.service
or try starting it manually to see what the problem is.
I don't see this service at all
sudo service --status-all
[ - ] alsa-utils
[ - ] apparmor
[ + ] apport
[ + ] avahi-daemon
[ - ] bluetooth
[ + ] cron
[ + ] cups
[ + ] cups-browsed
[ + ] dbus
[ - ] gdm3
[ - ] hwclock.sh
[ - ] kmod
[ - ] lightdm
[ + ] network-manager
[ - ] nfs-common
[ - ] plymouth
[ - ] plymouth-log
[ - ] pppd-dns
[ - ] procps
[ - ] pulseaudio-enable-autospawn
[ + ] rpcbind
[ - ] saned
[ - ] udev
[ - ] x11-common
[ - ] xpra
I should clarify it's a user unit.
So systemctl --user ...
In a working container:
Which also means I suppose at a root shell you'll need to su <username>
first 😛 . Good ol UNIX sysadmining.
systemctl status --user display
● display.service - Xvnc Display
Loaded: loaded (/etc/xdg/systemd/user/display.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2021-02-28 07:36:14 UTC; 4min 12s ago
Main PID: 126 (Xvnc)
CGroup: /kubepods/burstable/pod1a064545-e4dc-4fc5-987b-902b574a85fb/a1ed7003445ef667f760c80ac9dd327c3b46279fadcbdfaffdf2dfc877b53c82/user.slice/user-9000.slice/user@9000.service/display.service
└─126 /usr/bin/Xvnc :10 -rfbunixpath /run/user/9000/vnc.sock -SecurityTypes None
Feb 28 07:36:14 hananm-agplenus-zdt4c systemd[120]: Started Xvnc Display.
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: Xvnc TigerVNC 1.10.0 - built Apr 9 2020 06:49:31
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: Copyright (C) 1999-2019 TigerVNC Team and many others (see README.rst)
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: See https://www.tigervnc.org for information on TigerVNC.
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: Underlying X server release 12008000, The X.Org Foundation
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: Sun Feb 28 07:36:14 2021
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: vncext: VNC extension running!
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: vncext: Listening for VNC connections on /run/user/9000/vnc.sock (mode
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: 0600)
Feb 28 07:36:14 hananm-agplenus-zdt4c Xvnc[126]: vncext: created VNC server for screen 0
It seems like the rfbunixpath /run/user/9000/vnc.sock is not the same... Due to the fact that it is running in a different user space... But the porxy looks for it elsewhere.
So this tells me you are probably built off an old base.
/run/user/9000/vnc.sock
, specifically the vnc.sock
is what changed (otherwise this would be working). I renamed it to display.sock
since I am working on adding SPICE support. It was one of the things that changed in that breaking change I described. /var/run
should be symlinked to /run
so that part shouldn't matter.
I have a feeling you might even get around this in your template though without rebuilding (though I think that would still be easier). Set proxy.socketAddr: unix:///var/run/kvdi/display.sock
and it could work. But it looks like the core issue here is that your desktop image isn't responding to the right environment variables.
Or the other way around is to set that value to unix:///run/user/9000/vnc.sock
. That would probably compensate for the desktop ignoring it, but the proxy would use that value instead.
ok, But my docker file is using the line: FROM ghcr.io/tinyzimmer/kvdi:ubuntu-xfce4-latest And I built my docker and pushed it...
But when did you do that
Every hour or so :)
Check the environment variables on the pod, there should be a DISPLAY_SOCK_ADDR
, defaulting to /var/run/kvdi/display.sock
. Then we gotta make sure your desktop is doing the right thing with it. In the base image's init script the part that configures display.service
is this:
find /etc/default -type f -exec \
sed -i \
-e "s|%USER%|${USER}|g" \
-e "s|%UNIX_SOCK%|${DISPLAY_SOCK_ADDR}|g" \
-e "s|%USER_ID%|${UID}|g" \
-e "s|%HOME%|${HOME}|g" {} +
There should be a file /etc/default/kvdi
that is used as the environment for display.service
. It should include whatever DISPLAY_SOCK_ADDR
was set to on the pod in that file. Somehow in this process your VNC server is coming up with a different value...one that used to be valid a while ago, so color me even more perplexed.
ok, it seems that there is a docker bug here. The kvdi image I can find with docker images | grep tiny is 2 weeks old...
I am pulling the image manually now, lets see if it will help. I was under the impression that the FROM line in the dockerfile will do this by default
Ha you just saved me from what was about to be a mental breakdown
EDIT: Hit close by mistake, I'll leave open until we are sure that was the issue.
Ha you just saved me from what was about to be a mental breakdown
EDIT: Hit close by mistake, I'll leave open until we are sure that was the issue.
I am not sure its the answer, but I will try that
To my understanding, FROM will not pull if the image is present already. There is an extra flag you can pass to docker build
to do that.
--pull Always attempt to pull a newer version of the image
ok, the issue is now solved for a first run. The second run gave me a connection refused error, The third went fine... The /tmp is mounted. :)
Can you please supply an example to how I should mount the home directories? Another question that came up is about being able to connec to to the same desktop and same session from two tabs... Is it possible? When I try to do this I get: Cannot read property 'spec' of undefined
Cannot read property 'spec' of undefined
Is definitely some latent bug in the UI, I'm sure, unrelated to the actual issue. Right now locks are taken for every display/audio stream so that there can only be one of each at any time per desktop. I can see maybe looking into making this optional, but it would be a bit involved. Those same locks are used when the API is querying session status (but that could be adapted).
For the new userdataSelector
in the VDICluster config (so your helm values or you can kubectl edit vdicluster <kvdi probably>
) you'd do one of two things.
userdataSelector:
matchName: "${USERNAME}-pvc" # would match PVCs in the session namespace like hananm-pvc in your case
matchLabel: "kvdiUser" # would match PVCs in the session namespace that have a label kvdiUser=hananm
But I never tested it thoroughly so you may still find more bugs, I also might move it around a bit more in a future release
userdataSelector: matchName: "${USERNAME}-pvc" # would match PVCs in the session namespace like hananm-pvc in your case matchLabel: "kvdiUser" # would match PVCs in the session namespace that have a label kvdiUser=hananm
Great! I will happily test that! So if my PVC is hananm-blah I need to use matchName: "${USERNAME}-blah"? no changes in the template yaml?
Correct
But just use the matchName
helm upgrade kvdi tinyzimmer/kvdi --version v0.3.1 -n kvdi --set vdi.spec.appNamespace=kvdi --set vdi.spec.userdataSpec.userdataSelector.matchName=\$\{USERNAME\}-nfs-home
Error: UPGRADE FAILED: error validating "": error validating data: ValidationError(VDICluster.spec.userdataSpec): unknown field "userdataSelector" in io.kvdi.app.v1.VDICluster.spec.userdataSpec
I have a few users, they are all using the same docker image, the same desktop template (with very small diffs) and they all did work.
2 hours ago, one user showed me that starting a session via the UI results in the "Waiting" screen. Looking at the K8S Dash, I can see that a service is there but no POD at all. I can't tell if there are logs anywhere, but its just strange.
Other users can use the same template but it will take a while to start with lots of POD restarts...
Any clue what is going on?