Quoting Elizabeth K. Joseph (lyz@princessleia.com):
In case anyone was curious as to what happened with this, I finally had some time to sit down on site this evening and do some debugging.
Nice detective work.
Some background as to how the guest logins work in Lubuntu: A guest-XXXXX (random characters) user is created upon login, which is used throughout the session. It is then deleted when the user logs out.
After some red herrings in the auth logs (mostly PAM errors around KDE and Gnome keyrings), I did some digging in the lightdm logs. Eventually I noticed the UID of the guest account trying to be created was the same every time a login attempt was made: 999. Odd. So I looked in /etc/passwd and noticed that there were hundreds of guest-XXXXX accounts. That's no good!
Turns out, at some point the /etc/subgid.lock file got stuck in an existing state (wasn't deleted when the lock concluded), which meant the command to delete the user was not completing successfully upon logout. Users were piling up and never being deleted. Once the UIDs hit 999 it was failing to create new guest users, so the login would fail. I quick mv (rm didn't work) of the subgid.lock file and a script to delete all the guest accounts got us going again.
Next time you encounter that situation, I'd be curious what
rm -fv /etc/subgid.lock
reports. The '-f' is for force, which honestly won't help here, because all it does, IIRC, is force rm to omit error reporting if the target doesn't exist -- I think. The '-v' is probably more useful: verbose reporting of what rm encounters when it tries to take the requested action.
GNU rm is (again, IIRC) just a wrapper around the unlink(2) syscall, which removes a specified hardlink to an inode (/socket, FIFO, device), or in the case of a symlink, removes that. So, basically it's about the same as the unlink(1) command except a bit more featureful.
Ordinarily, I would expect 'rm' (or unlink) to fail only because either there's a read-only mount status in the way (obviously not the case, here), or hardware-level blocking (obviously not the case, here), or the immutable flag having been set (highly unlikely in this case), or ownership / rights issues. But I'm not going to hazard a guess, except, gremlins? ;-> I'm intrigued, anyway.
As you suggest, the real long-term fix is a bug report on someone's buggy code in useradd or in something calling adduser. I gather that the latter is a known problem: https://askubuntu.com/questions/459080/useradd-cannot-lock-etc-subuid-try-ag...
(Note someone's suggestion, in the cited case, that something might be running multiple instances of useradd simultaneously.)
However, that sort of contention over /etc/subgid.lock ought to show up in fuser / lsof, which you say doesn't check out -- so I'm back to being intrigued.
I'm considering my options to get us out of this reoccurring issue in the future. I'm thinking of just a cron job on each machine that checks for a subgid.lock file sticking around for more than a couple days and moving it out of the way, but I'll sleep on it. More clever suggestions welcome ;)
Well, not really. If the unlink syscall (basis of /bin/rm) isn't working, then I don't know of a different way of making the file completely go away. You might think of mv'ing it to a small filesystem (like a ramfs) and then blowing away and re-creating the filesystem -- but unfortunately /bin/mv uses rename() only when moving/renaming the file within the same filesystem. For a cross-filesystem mv, it instead does an unlink() followed in quick succession by a link() .
Not that you don't know this already, but a more-satisfactory solution would be to figure out what's bugging /bin/rm.