Here’s a quick one.
One of our servers recently experienced an issue in that its disk usage was consistently growing to illogical, unmanageable proportions. This would trigger an OpsGenie alert and wake me up, so in a groggy state I’d delete some extraneous log files (we log a lot), promise myself to look into it tomorrow, and go back to bed.
I don’t have the best of memories at the best of times. This happened a few times before I looked into the problem.
First I thought we were just logging “too much” — actually we require pretty extensive logging, so my proposed fix was to be increasing the server’s disk space — until I saw a massive difference between the output of
sudo du -h --total --max-depth=1 / and
df -h --total / — a 14GB difference on a 25GB server.
A friend suggested that there might be processes holding onto deleted files.
sudo lsof -nP +L1 confirmed this — 9 weeks of deleted log files! Checking the PIDs, it turned out that Sidekiq was holding onto these deleted files like they were its little zombie children. This was most likely caused by our daily logrotate task.
Restarting Sidekiq fixed the issue, and my proposed solution going forward is likely to be a regular cron job to restart Sidekiq.