Part 1 of this saga can be found here.
Good monitoring doesn’t come ready to use out of the box. It takes a lot of work: deciding what to monitor, deciding on thresholds and when and how to alert. Alert Fatigue is a thing, and the last thing you want is to miss an important alert because you silenced your phone to actually get a full night’s sleep. But monitoring and alerting, if configured in a sensible way, can save hours of downtime, user complaints and the IT practitioner’s sanity.
I had been searching for monitoring utopia for more than a decade. I tried Zenoss, Nagios and Zabbix on the FOSS side, and briefly toyed with PRTG and others before being turned down by management due to cost. In every case, it took a lot of work to set up. Between agent installation and updates, SNMP credential management, SNMPWALKing to find the right OIDs to monitor, and figuring out what and when to alert, monitoring in a mixed environment takes a lot of TLC to get right.
In my mini datacenter build, I had the opportunity to build my monitoring system the way I wanted. I could have made a business case for commercial software, but this build and move already had taken a significant chunk of money–remember that the entire warehousing and shipping/receiving operation had also been moved and expanded. And besides, having tested several of the commercial products, I knew that the initial setup probably would have been faster, but the ongoing work–tweaking thresholds, adding and removing sensors, and configuring monitoring–was going to take a lot of work regardless of the system I chose.
With that in mind, I chose Zabbix. I had used Nagios in the past, but I preferred a GUI-based configuration in this case. In an environment where I was deploying a lot of similar VMs or containers, a config file-based product like Nagios would probably make more sense. Having chosen Zabbix, the question was where to install it. At the time, Zabbix didn’t have a cloud-based product, and installing it within the environment didn’t make sense, as various combinations of equipment and Internet failure could make it impossible to receive alerts.
After looking at AWS and Azure, I went with simple and cheap: DigitalOcean. Their standard $5/month CentOS droplet had 512MB of RAM, one vCPU, 25GB of disk space, a static IP and a terabyte of transfer a month. I opted for the extra $1/month for snapshotting/backups, which would make patching less risky.
The first step was to set up and lock down communications. I went with a CentOS VM at each of the company’s two locations and installed the Zabbix proxy. The Zabbix proxies were configured for active mode, meaning that they communicated outbound to the Zabbix server at DigitalOcean. pfSense was configured to allow the outbound traffic, and DigitalOcean’s web firewall restricted the Zabbix server from receiving unsolicited traffic from any IPs other than the two office locations. Along with the built-in CentOS firewall, it was guaranteed to keep the bad guys out. I also secured the Zabbix web interface with Let’s Encrypt, because why wouldn’t I?
Next up was configuring monitoring. Zabbix comes with a Windows monitoring template. After installing the agent on all of the Windows hosts and VMs, I configured monitoring by template. I found it best to duplicate the base template and use the duplicate for each specific function. For instance, one copy of the Windows template was used to monitor the DNS/DHCP/AD servers. In addition to the disk space, CPU usage and other normal monitoring, it would monitor whether the DHCP and DNS servers were running. Another copy of the template would be tweaked for the VM hosts, with, for instance, more sane disk space checks. Linux monitoring, including of the pfSense boxes, was configured similarly. Ping checks of the external IPs was done from the Zabbix host itself.
Environmental monitoring was also important due to the small closet size and lack of generator support for the building. SNMP came to the rescue here. Fortunately, I had two Tripp Lite and one APC UPS in that little server closet. Using templates I found online, I was able to monitor battery charge level and temperature, remaining battery time, power status, humidity and battery/self test failures. The Tripp Lite units had an option for an external temperature/humidity sensor, and I was able to find the remote sensors on eBay for less than $30 each. One I mounted in front of the equipment in the rack to measure intake air temperature, and the other I mounted dangling in the airflow of the A/C to measure its output temperature. That way, I would be alerted if the A/C unit failed, as well as monitor its cycle time and see how cold the output air was compared to historical data.
The primary infrastructure to monitor was the Hyper-V cluster. Fortunately, I found a Hyper-V and CSV template online that I was able to tweak to work. Not only did it monitor the CSVs for free disk space and connectivity, it could alert if either of the hosts had zero VMs in active status–an indication that the host had rebooted or otherwise lost its VMs. A Dell template monitored server systems and could report a disk or fan failure.
No monitoring system is complete without graphs and alerts, so I built a custom “screen”, which is Zabbix terminology for a status dashboard. I created a 1080×1920 portrait screen with battery and temperature graphs, a regularly updated picture of the interior of the room provided by an old Axis camera, and a list of active alerts and warnings. I mounted a 1080p monitor in portrait orientation in my office and used an old PC to display the screen.
Finally, I tackled the issue of Alert Fatigue. At a previous employer that had a 24-hour phone staff, I would receive eight to ten phone calls in the middle of the night, and of those calls, maybe only one a month would actually need my attention that night. I vowed to tweak all of the unnecessary alerts out of my new monitoring system.
I used a generic IT Google Apps email account on the Zabbix server to send the alerts to my email. I then set my GMail to parse the alert and, if the alert status was “Disaster”, forward it as a text message to my phone to wake me up. I then went through all of my alerts and determined whether they were critical enough to get a “disaster” rating. A VM that had just rebooted wasn’t critical. The Windows service that recorded surveillance on the NVR not running was critical. The power going out wasn’t critical. The remaining battery dropping below four hours, indicating that the power had been out for over an hour, was critical. By setting up my monitoring this way, I would still be awakened for actual issues that needed immediate correction, but less critical issues could be handled during waking hours. I also tiered my issues to indicate when a problem was progressing. For instance, I would get an email alert if the room temperature reached 75 degrees, a second at 80 degrees, and a phone alert at 85 degrees. That would give me plenty of time to drive in or remote in and power down equipment.
Many alerts I disabled completely. The Windows default template alerts if CPU usage reaches something like 70%. Do I care if my VMs reach 70% CPU? If I don’t care at all, I can turn the alert off completely. If I’m really concerned about some runaway process eating CPU for no reason, I can tweak that setting so that I’m not alerted until the CPU exceeds 95% for 15 minutes continuously. At any rate, that’s not going to be flagged as a “disaster.”
The Zabbix droplet worked great. There was no noticeable downtime in the two years I ran it. I was able to run about 2,000 sensors on that droplet with overhead to spare even with one vCPU and 512MB of RAM. (DigitalOcean has since increased that base droplet to 1GB of RAM.) I probably would have replaced my GMail-to-text kludge with PagerDuty if I had known better, as it can follow up alerts with automated phone calls. at any rate, I slept much better knowing my environment was well-monitored.
Next time: Lessons learned, or “What would I do differently today?”