A case study in wholistic computer system support
A user emailed me directly asking for me to look at why his webapp isn’t starting. He was having trouble even looking at the logs with journalctl.
I had built the servers for the user, so he always tries to shortcut the support process of opening a ticket to the team queue, and just assigned it directly to me. I emailed the user back saying, “I’ve assigned it to the team queue for you.”
Well, within 5 minutes the level 1 support guy reached out to me saying he couldn’t find anything wrong. And yes, while nothing was wrong on the OS and filesystem level, the app still wasn’t starting.
The user, in the mean time, had remembered his service account that runs the app so he switched user to that and read the logs. It didn’t have permission to the parent directories, on the nfs mount point.
I set aside the tasks I had scheduled for this afternoon, and took a look. Aside from some setgid on the parent directory without g+rx, nothing looked amiss. The directory was owned by root.root which didn’t seem different from the other contents on the filesystem.
My manager approved just fixing the problem and we’ll sort out what caused it later. So the user and I ran chown -R appuser.appuser /mnt/appdir/subdir1 and then the application could get started again.
I checked the sudoers history on the two application servers to see if any users ran sudo chown or sudo chmod even though it was locked down enough to prevent that. No activity with sudo looked related, so I then turned to the filesystem itself.
It is nfs, so it could be affected by other systems. After using mount | grep /mnt/appdir and showmount -e $SERVERNAME | grep $EXPORTNAME I had a list of the IP addresses that have access to this export from the nfs server.
I did a reverse dns lookup to get the hostnames because obviously those are easier to recognize than just a whole series of IP addresses. Immediately I recognized the systems as the other container proof of concept.
Before I even started checking bash history on those hosts, I reached out to the container research guy and he told that I think a reckless chown affected an nfs mount point that affected the app running on the original servers. He said yes, it was him, and he “was trying to set the permissions on the logs directory, but the container mounted the entire nfs.”
I thanked him for being honest, and reminded him that chown -R is serious business. He claimed that this is a further reason for having his own nfs mount point. Additionally, kubernetes has a known security flaw about the availability of an entire mount when mounting just a path of it to a container.
This second user made an additional rookie mistake last week when he changed network settings on a system over a network connection before confirming he had console access. Of course, it went offline and he had to get assistance fixing it.
Be very, very careful when you execute a chown -R! And you probably shouldn’t be doing that inside a container with any bind mounts…