VMworld 2019: Day 0

I don’t have to do a lot of business travel, so when I do, I try to make the most of it by finding something local to do. Whether it’s touring the city, going to a museum or just sampling the local cuisine, I do my best to to get in some “me time.”

And so I found myself booking a flight to my second VMworld in San Francisco. I wasn’t interested in seeing the usual sights–I lived here for over two decades and have spent plenty of time in the Bay Area at the handlebars of a Honda Helix.

One thing I hadn’t seen, however, was the Computer History Museum in Mountain View. I typically arrive at these events the day before so I can beat the crowds, check in to my hotel and get a good night’s sleep and a shower before the big event. In this case, I flew in to San Jose, caught a quick Lyft and bought my ticket. The CHM offers a free bag check, so I dropped off my big duffel and began wandering.

This museum is massive and I got there just as they were opening. The walking path winds through the museum exhibits in historical chronological order, starting with abacuses and slide rules.  Following the path leads to mechanical and electromechanical calculators typically used in finance and engineering, as well as mechanical  cash tills. 

Charles Babbage’s Difference Engine (3/4 scale replica)

Next up are mechanical tabulators and calculators that use punched cards. On display are a multitude of machines that sort, count and duplicate punched cards. Several early analog computers are on display as well. Finally, the first electronic computers come into view. While most artifacts are of pieces of early computers or peripherals (including a massive Univac control console), the jewel in this part of the exhibit is the JOHNNIAC, a very early vacuum tube computer built in the early 1950s by the RAND Corporation for its own internal use. Besides being one of the longest-lived early computers (with over 50,000 hours on its hour meter), it was also upgraded several times during its lifetime, including core memory and additional instructions.

RAND Corporation’s JOHNNIAC, complete in its art deco enclosure.
The massive control console for the UNIVAC I, circa 1951.
UNIVAC 1 Supervisory Control Printer, circa 1951. Essentially a Kegerator’s worth of parts were needed to run the UNIVAC’s output through a Remington typewriter mounted on top.

In the Memory and Storage exhibit, the entire history of storage technology is on display, from vacuum tubes and drum storage to hard drives and flash. Of interest is the progression of hard drives from washing-machine sized peripherals weighing hundreds of pounds to CompactFlash-sized microdrives. Other artifacts on exhibit here include game cartridges, Zip disks, optical disks and floppies.

A massive IBM RAMAC 350 drive from circa 1956. The first production hard drive, it holds 5 million characters (3.5MB). Human for scale.

The Supercomputers exhibit contains something I didn’t think I would ever get to see: a Cray 1 supercomputer. The familiar ‘sixties piece of furniture’ look is there, but the unit was a lot smaller than I thought it would be. Several other supercomputers, including a piece of a Cray 2, are on display here.

The Cray 1, with one ‘seat cushion’ and inner panel removed to see the guts.

Past the supercomputers, a multitude of minicomputers are on display. Among them are the iconic PDP-8 and PDP-11 systems from DEC, as well as the DEC VAX and other systems.

The IBM System/32 midrange business computer. It consisted of the computer, a tiny screen, keyboard and printer built into a desk.

I was running a bit late at this point, so I sped a bit through the rest of the exhibits. The personal computer era was more my era, so other than repeating waves of nostalgia, I had seen most of that exhibit before. The usual suspects were present (Commodore PET, VIC-20 and 64, Atari 800, early Apples and Macintoshes, etc.), as well as an original PC clone made by PCs Unlimited, a little company founded by Michael Dell and later renamed after himself. There were also dozens of artifacts I remember from my childhood, such as the Speak and Spell, the Heathkit HERO robot, the Tomy robot, Colecovision, etc. It’s a must-see for any Gen X’er.

The Milton Bradley Big Trak vehicle, a programmable toy vehicle from the eighties.

The exhibit comes to a conclusion with the dot-com bust that happened in about 2000, and perhaps the poster child for the dot-com bubble, the Pets.com sock puppet. Many though of Pets.com as the prime example of hubris of dot-com executives who thought that they could profitably ship cat sand and 50-pound bags of dog food to customers anywhere in the US for free.

Pets.com is dead; long live Pets.com.

Now to the best bit. CHM has two fully functioning IBM 1401 computers, and every Wednesday and Saturday, docents, all of whom worked on these systems back in the day, fire them up and do demonstrations. On this visit, one of the docents was Paul Laughlan, who also was known as the author of Apple DOS, Atari DOS and Atari BASIC. (Paul’s wife wrote the Assembler/Editor for Atari home computers.) As a lifelong Atari fan, I was a little tongue-tied, but we talked quite a bit about the past. He then allowed visitors to type their names onto punch cards at a punching station and fed them into the 1401 to get a souvenir printout. I did get a picture of their PDP-1, though there were no demos that day. (The PDP-1 is turned on one Saturday a month and visitors can play Spacewar! against each other.)

One of the two functioning IBM 1401 computers at the CHM.
Paul Laughlan, retired software engineer, giving us a demonstration of the punch station
A punching station. Computer operators would type machine code and data into the punch machine, which would punch the data onto punch cards to be fed into the 1401. CHM has three of these laid out like they would typically be laid out in production.
CHM’s DEC PDP-1 minicomputer. This one is in full operating condition and once a month, visitors can battle each other in Spacewar!

There’s an entire additional wing of the museum dedicated to programming languages and software culture, but I bade my farewell and grabbed a Lyft into San Francisco to get ready for Day 1 of VMworld. It was an item checked off my bucket list and a lot of fun.

Exporting Data from Bugzilla to a Local Folder

I was recently tasked with decommissioning an old Bugzilla instance.  The company had long since moved to Jira’s SaaS offering and the old Bugzilla VM had been left to languish in read-only mode.  Now two full versions behind, the decision was made to attempt to export the Bugzilla data and import it into Jira rather than continue to update and support the aging software.

Unfortunately, there are not a lot of great options out there for moving data to an active Jira Cloud instance.  It’s trivial to connect a local Jira instance to Bugzilla and import all of its data.  However, that option is not available in Jira Cloud.  There are three options for getting Bugzilla data into Jira Cloud:

  1. Export the Bugzilla data as CSV and import it into Jira Cloud.  Attachments cannot be imported; they must be put on a separate web server and the URLs inserted into the CSV before import into Jira.  In our experience, notes were also not coming in, but due to the unpalatability of running a separate webserver for attachments, we didn’t pursue this further.
  2. Export the existing Jira Cloud instance and dump it into a local Jira instance.  Use the Jira import tool to merge in the Bugzilla data, and then export the local Jira instance and re-import it back into a Jira Cloud instance.  This would involve a substantial amount of work and downtime on a very busy Jira Cloud instance, and would involve a bit more risk than we were willing to take on.
  3. Export Bugzilla into a clean local Jira instance, and optionally then export it to a separate Cloud instance.  This would involve paying for an additional Jira Cloud instance, and it would be an extra system to manage and secure.

Because this is legacy data (regular use of Bugzilla ended several years ago), I was given the green light to export the roughly 30,500 bugs to searchable PDF files and store them on a shared local folder.  We would then be able to decommission the Bugzilla VM entirely.

Pressing CTRL-P 30,000 times

Bugzilla has a print preview mode for tickets.  It includes all of the data in the ticket including the header fields and date-stamped notes.  Fortunately, it turns out that Bugzilla, being an older “Web 2.0” product, will pull up any ticket by number in print preview mode straight from the URL.  The format is:

https://bugzilla.example.com/show_bug.cgi?format=multiple&id=2000

where ‘2000’ is the bug number. With that in mind, it was time to generate some PDFs.

After experimenting with Google Chrome’s ‘headless’ mode, I was able to write a quick batch script to iterate through the tickets and ‘print’ them to PDF. Here’s the code:

for /l %%i in (1001,1,2000) do (
  mkdir c:\bugs\bug%%i
  "c:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --headless --disable-gpu --user-data-dir="c:\users\tim\appdata\local\google\chrome\user data" --ignore-certificate-errors --disable-print-preview --print-to-pdf="c:\bugs\bug%%i\bug%%i.pdf" "https://bugzilla.example.com/show_bug.cgi?format=multiple&id=%%i"
    )

This code is a bit messy since the verbosity of the third line makes it very long. The ‘for’ in the first line specifies the range of bug numbers to loop through. In this case, I’m printing bugs 1001 through 2000. The second line creates a folder for the bug. I put each bug in its own folder so that the attachments can be put in the same folder as their corresponding PDF.

The third line calls Google Chrome in ‘headless’ mode. I used the ‘–disable-gpu’ and ‘–disable-print-preview’ options to fix issues I was having running this in a Server 2016 VM. The ‘%%i’ is the bug number and is passed in the URL to bring up the bug in Bugzilla. Note that I used ‘–ignore-certificate-errors’ because the cert had expired on the Bugzilla instance (ignore if not using TLS.) Possibly due to the version of Bugzilla we were running, headless Chrome would lose its cookie after a few connections. I resolved this by temporarily turning on anonymous access in Bugzilla so the bugs could be viewed without having to log in. (This was an internal system, so there was no risk having this open for a couple of hours.)

While I could easily just plug in ‘1’ and ‘30500’ into the ‘for’ loop and let it run for a few days, this batch file was taxing only one core on my local system and barely registering at all on the Bugzilla host. Since I had eight cores available, I duplicated the batch file and ran ten copies simultaneously, each pulling down 3,000 bugs. This allowed the local system to run at full load and convert more than 30,000 bugs to PDF overnight.

Attach, mate.

Our version of Bugzilla stores its attachments in a table in the database and would have to be extracted. Fortunately, Stack Overflow came to the rescue. The below SQL script parses the attachments table and generates another SQL file that extracts the attachments one at a time.

use bugs;
select concat('SELECT ad.thedata into DUMPFILE  \'/bugs/bug'
, a.bug_id
, '/bug'
, a.bug_id 
, '___'
, ad.id
, '___'
, replace(a.filename,'\'','')
, '\'  FROM bugs.attachments a, bugs.attach_data ad where ad.id = a.attach_id'
, ' and ad.id = '
, ad.id
,';') into outfile '/bugs/attachments.sql'
from bugs.attachments a, bugs.attach_data ad where ad.id = a.attach_id;

I created a temporary directory at /bugs and made sure MariaDB had permissions to write to the directory. I saved this as ‘parse.sql’ in that directory and then ran it with:

mysql -uroot -ppassword < /bugs/parse.sql

After looking at the freshly created /bugs/attachments.sql to verify that paths and filenames looked good, I edited it to insert a ‘use bugs;’ line at the top of the file. Then I created a quick shell script to create my 30,500 directories to match the directories created by my PDF script above:

#!/bin/bash
for i in {1..30500}
 do
   mkdir bug$i
 done

After running that script, I verified that all my directories were created and gave write permission to them to the ‘mysql’ user. It was then time to run my attachments.sql file:

mysql -uroot -ppassword < /bugs/attachments.sql

A quick ‘du –si’ in the /bugs folder verified that there were indeed files in some of the folders. After confirming that the attachments were named correctly and corresponded to the numbered folder they were in, I used ‘find’ to prune any empty directories. This isn’t strictly necessary, but it means fewer folders to parse later.

cd /bugs
find . -type d -empty -delete

Putting it all together

The final step in this process was to merge the attachments with their corresponding PDF files. Because I used the same directory structure for both, I could now use FileZilla to transfer and merge the folders. I connected it to the VM over SFTP, highlighted the 30,500 folders containing the attachments, and dragged them over to the folder containing the 30,500 folders created during the PDF processing. FileZilla dutifully merged all of the attachments into the PDF folders. Once completed, I spot-checked to verify that everything looked good.

I wouldn’t have used this technique for recent or active data, but for old legacy bugs that are viewed a couple of times a year, it worked. The files are now stored on a limited-access read-only file share and can be searched easily for keywords. It also allowed us to remove one more legacy system from the books.

Working In the Not-Yet-Hot Aisle

“This is no longer a vacation.  It’s a quest.  It’s a quest for fun.” -Clark W. Griswold

The 48U enclosures and the in-row CRACs are in place and bolted together, but there’s no noise except the shrill shriek of a chop saw in the next room.  Drywall dust coats every surface despite the daily visits of the kind folks with mops and wet rags.  The lights overhead are working, but the three-cabinet UPS and zero-U PDUs are all lifeless and dark.

Even in this state, the cabinets are being virtually filled.  In a recently stood-up NetBox implementation back at HQ, top-of-rack switches are, contrary to their name, being placed in the middle of the enclosures.  Servers are being virtually installed, while in the physical world, blanking panels are being snapped into place and patch panels are installed by the cabling vendor, leaving gaping holes where there will soon be humming metal boxes with blinkenlights on display.  Some gaps are bridged with blue painters’ tape, signifying with permanent marker the eventual resting place of ISP-provided equipment.

During the first couple of weeks after everything is bolted down, we’re pretty much limited to planning and measuring because of the room is packed with contractors running Cat6, terminating fiber runs, plumbing the CRACs, putting batteries in the UPS, connecting the generator, wiring up the fire system–it’s barely controlled chaos.  Within a couple of weeks, the pace slows a bit; it’s still a hardhat-required zone, but the fiber runs to the IDFs are done and being tested, patch panels are being terminated and the fire system has long since been inspected and signed off by the city.  A couple of more weeks and we know it’s time to get serious when the sticky mat gets stuck down inside the door, the hardhat rules are rescinded and the first CRACs are fired up.

Thus begins the saga of a small band of intrepid SysAdmins working to turn wrinkled printouts, foam weatherstripping, hundreds of cage nuts, blue painter’s tape and a couple hundred feet of Velcro into a working data center.  This marks the first time I’ve worked in a hot-aisle/cold-aisle data center, much less put one together.  This is something I’ve wanted to do for years, but there’s remarkably little detailed information on the web about this process; the nitty gritty of data center design and construction is usually delegated to consultants who like to keep their trade a closely-guarded secret, and indeed, we consulted with a company on the initial design and construction of our little box of heaven.

The concept of hot-aisle/cold aisle containment is pretty straightforward and detailed in hundreds of white papers on the Internet: server and network equipment use fans to pull cool air in one side of the unit and blow heat out the other.  Therefore, if you can turn your data center into two compartments, one that directs all of the cooled air from your A/C into the cold intake side of the equipment, and one that directs all of the heated air from your equipment back into the A/C return, you increase the efficiency and reduce the costs of running your A/C, and you keep hot exhaust air from returning to the intake of the equipment.  Ultimately, if done right, you can turn up the temperature in the cold aisle, further reducing your costs, as you don’t have “hot spots” where equipment is picking up hotter exhausted air.  The methods for achieving this vary greatly.

And more importantly, it turns out that there are some caveats that can either increase that initial cash outlay significantly or reduce overall efficiency.

Stay tuned as I dig into the details of this new project.

 

Linux Stories: The Radio Ripper

It was early 2005.  My Southern California commute had grown to 65 miles one way, averaging three hours a day.  I was growing tired of top 40 music and news radio.  After a particularly late night at work, I was listening to my local NPR station when the broadcast for the local minor league baseball team came on.  While I had never been a sports nut, I found that the broadcaster (Mike Saeger, now of the San Antonio Missions) interwove the narrative of the game with fascinating stories about the team, the players and baseball’s past.  Unfortunately, it typically started at 7:00 or 7:30 in the evening, meaning that I would typically catch the first 30 minutes at best before arriving home.

My goal became to record the games to listen to on my commute.  At first I considered piping a radio into a sound card, but I couldn’t get the station in the mountains where I lived.  I eventually figured out that the NPR station broadcasting the games had a website with a live feed.  Upon further digging, I found that the station was using a RealPlayer streaming audio format for its live feed.

I looked at options for automated recording of the feed and was unimpressed with my options on the Windows platform.  After some searching, I found that mplayer on Linux could play a RealPlayer feed, and more importantly, it could record it to a PCM-coded WAV file.  Thus began my quest to build a recording rig.  My goal was to record the game to a WAV file and convert it into a low quality MP3 that I could play from my chewing-gum-pack sized Creative MuVo.

I chose Fedora Core as my distro, mainly because most of the relevant forum posts and blogs I found on the topic had a Fedora or Red Hat focus.  I downloaded Fedora Core 3 to a CD and looked through the scrap pile at work to find a suitable machine.  I wanted something with low power consumption, since California’s electricity rates were astronomical and this machine would be running 24/7.  I settled on a Dell Latitude C600 laptop.  It had a 750MHz mobile Pentium III and 192MB of RAM.  It had been relegated to the scrap heap because it had a weak battery, some missing keys and a broken screen, none of which would be an issue for my purposes.  More importantly, with the lid closed, it drew only 15 watts at full tilt and 12 watts at idle with the 60GB drive spun down.  The weak battery still lasted 30 minutes, which was enough to ride out most power interruptions.

I’m a command line guy–I had to be dragged kicking and screaming into Windows–so I deselected the GUI and went with a minimal install.  All told, the running OS took less than a gigabyte of disk space and 48 megs of the 192 megs of available RAM, leaving plenty for my tools.  I installed mplayer, sox and lame to handle the audio processing.  My shell script would spawn mplayer in the background, wait four hours, and then signal it to terminate.  Sox would then be launched to re-encode the stream into low bit-rate mono, which was then piped into lame to create the final MP3.  The encoding process would take several hours but still complete in time for me to download the file to my MP3 player the next morning.  Scheduling was done by looking at the team’s schedule at the beginning of the week and using a series of ‘at’ commands to run the script at the appropriate days and times.

As baseball season drew to a close, I looked for alternatives.  I looked at downloading NPR weekend shows like Car Talk and Wait Wait–shows that were not yet available as podcasts.  Those shows were available at NPR’s website, again in RealPlayer format.  However, the filenames changed each week, and the shows were broken into segments that had to be downloaded individually.  I was able to use wget to get the HTML for the page and grep to parse out the filenames into a text file.  I then fed the text file into mplayer, one line at a time, to download the segments.  Finally, I used sox to encode and concatenate the individual files into a single file, which was converted to MP3 by lame.

After about a year, I recognized a mistake I had made.  I chose Fedora Core not knowing about their support policy.  Fedora Core 3 only received security patches for about 16 months from its release, or less than a year after I installed it.  There was an unofficial process for upgrading to FC4, but it broke a lot of stuff and took a lot of cleanup.  After that upgrade, I left the system at FC4 until it was decommissioned.

This was my first real-world experience with Linux, and it really helped me to feel comfortable with the operating system.  While I have used Debian, Ubuntu, SuSE and other Linuxes, this experience with Fedora drove me to choose CentOS (or Red Hat if OS support is required) as my go-to Linux server OS.

Homelab 2018 – The hardware

Like many, I’ve dreamed of having a miniature datacenter in my basement.  My inspiration came from a coworker who had converted his garage into a datacenter, complete with multiple window A/C units, tile floor, racks and cabinets full of equipment and his own public Class C back when individuals could request and receive their own Class C.

I lived in a house in the mountains that had a fairly high crawlspace underneath, and I dreamed of some day scooping it out level, pouring cement and putting in a short server cabinet and two-post. To that end, I had a lot of equipment that I had saved from the landfill over the previous couple of years: old second generation ProLiants, a tape changer, a QNAP that had been discarded after a very expensive data recovery operation and an AlphaServer 4100.

However, there were two important lessons that I learned that changed my plans:  first, that home ownership can be very expensive, and something simple like pouring a floor was beyond both my abilities and my budget, and second, that electricity in California is ruinously expensive for individuals.  While commercial customers generally get good prices regardless of their usage, it doesn’t take many spinning hard drives before the higher residential tiers kick in at 35+ cents per kWh.  I think my coworker got away with it because he was doing web hosting on the side out of his garage at a time when web hosting still actually paid something.

Once I realized the sheer cost of having the equipment turned on, to say nothing of the cost of air conditioning the basement at the time when my poorly insulated house was already costing over $300 a month to cool during the summer, I gave up on my dreams and donated most of the equipment to that coworker.  My homelab ended up being a single core Dell laptop “server” with a broken screen that ran Fedora and consumed 12 watts.

Fast forward to 2016, and I realized that I needed to make a change.  Working as the sole IT admin in a SMB means that I always had zero to near-zero budget and had to assemble solutions from whatever discarded hardware and free software I could cobble together.  While that does provide some valuable experience, modern companies are looking for experience with VMware or Hyper-V on SAN or virtual SAN storage, not open source NAS software cobbled together and running QEMU on an old desktop.

I looked at building that homelab I always wanted, but again, electricity cost had only gone up in the intervening 15 years, and I was now in a transitional state of four people living in a two bedroom apartment.  I wasn’t willing to sacrifice what little tranquility we had at home by having a couple of R710s screaming in the living room.  Thus, I decided to build my own “servers.”

I already had built my “primary” desktop specifically to move some workloads off of my laptop.  It was an i5-4570S desktop with 6GB of RAM and a discarded video card that I used for light gaming and running one or two small VMs in Virtualbox.  My goal was to build two or three compact desktops I could run trialware on to familiarize myself with vCenter, SCCM and other technologies I was interested in.  By keeping the machines compact, I could fit them under my desk, and by using consumer-grade hardware, I could keep the cost, noise and power usage down.

To save space, I chose the Antec VSK2000-U3 mini/micro-ATX case.  Looking back, this was a huge mistake.  These are about the size of a modern SFF Dell case, but it is a pain finding motherboards and heatsinks that fit.  However, they did the job as far as fitting into the available space under the desk.  These cases use a TFX form factor power supply–common in a lot of desktop and SFF form factor Dells, so used power supplies are cheap and plentiful on eBay.

When choosing motherboards, my goal was to find a mini or micro-ATX board that had four RAM slots so I could eventually upgrade them to 32GB using 8GB sticks–not as easy a task as one might think.  The first motherboard I found on Craigslist.  It was a Gigabyte Mini-ATX motherboard with an i5-3470S CPU.  Due to the location of the CPU on the motherboard, I couldn’t fit it in the case without the heatsink hitting the drive tray, so I ended up swapping the mobo with the board in my home machine, as my nice Gigabyte motherboard and i5-4570S fit the Antec case.

Thinking I was clever, I chose an eBay take-out from an Optiplex 9010 SFF as my second motherboard.  It had four RAM slots and was cheaper than any of the other options.  However, I soon found out that Dell engineered that sucker to fit their case and no others.  Their proprietary motherboard wouldn’t fit a standard heatsink.  I ended up getting the correct Dell heatsink/fan combination from eBay, which fit the case perfectly and used a heat pipe setup to eject the heat out of the back of the case.  I also had to get the Dell rear fan and front panel/power button assembly to get the system to power on without complaining.  Fortunately, the Dell rear fan fit the front of the Antec case where Antec had provided their own fan, so no hacking was needed.  Finally, the I/O shield in the Optiplex is riveted into the Dell case and can’t be purchased separately, so I’m running it without a shield.  This system runs with an i5-3570K, which I rescued out of another dead machine I rescued from the trash.

9010_heatsink

Optiplex 9010 SFF heatsink, bottom view.  The end of the fan shroud on the left stops just millimeters from the rear grille of the Antec case, like they were made for each other.

Once the homelab was up and running, I upgraded RAM when I could afford it.  The two homelab machines and my desktop started out with 8GB each and now have 24GB each.  To further save electricity, I have them powered down when not in use.  (My primary desktop stays on, as it runs Plex and other household services.)  Right now, each system has a hard drive and a 2.5″ SSD.  (They have an empty optical drive bay, so additional drives can fit with a little work.)  I picked up some four-port gigabit Intel 1000 NICs (HP NC364T) since the onboard Realtek NICs on the Gigabyte boards aren’t picked up by ESXi.

So the big question: will these work with ESXi 6.7?  In short, no.  They run 6.5 Update 2 just fine, but the 3570K crashes when booting 6.7, possibly because it doesn’t have VT-d support.  However, they work great running 6.0 and 6.5, which will get me through my VCP studies.  For the price, power consumption, heat and noise, they do the job just fine for now.

The Littlest Datacenter Part 6: Lessons Learned

For the first post of this long saga, click here.

It’s been a year since I moved on from the company running on the Littlest Datacenter, and about two years since it was built.  As I mentioned, I built it to be as self-sufficient, flat, simple and maintainable as possible, first because I had duties beyond being the IT guy and dropping everything to hack on junk equipment wasn’t going to cut it; second because I was the only IT guy and I wanted to be able to take vacations and sleep through the night without the business falling apart; and third, because I knew that, regardless of whether I stayed with that company or not, the IT function would eventually be given to an MSP or a junior admin.

Looking back at the setup, here are some lessons learned:

Buy Supermicro carefully:  The default support Supermicro offers is depot repair.  That means you’re deracking your server, boxing it up and paying to ship it back to them for repair.  Repair can take anywhere from one to six weeks.  This sucks because Supermicro offers a lot of flexible and reliable hardware choices for systems that fall outside the mainstream.  For instance, my Veeam server fit sixteen 3.5″ hard drives and two 2.5″ SSDs for less than half the cost of the equivalent Dells and HPs, and they supported Enterprise drives that didn’t come with the Dell/HP tax.  Just be sure to add on the onsite warranty or carry spare parts.

You’re gonna need more space:  And not just disk space.  I ended up adding 8TB more disk space to my hosts to handle the high resolution cameras for the additional shipping tables added a year after the initial build.  Fortunately I had extra drive bays, but any more expansion will involve a larger tape changer and SAS expansion shelves for the hosts.

Cheaper can sometimes be better:  For a simple two-host Windows cluster, Starwind saved the company a good six figures.  It’s no Nimble, but it was fast, bulletproof and affordable.  And like I said before, Supermicro really saved the day on the D2D backup server.

A/C is the bane of every budget datacenter:  The SRCOOL12K I used did the job, but it was loud and inefficient.  I really should have pushed for the 12,000 BTU mini-split, even though it would have taken more time and money.

So is power:  I probably could have bought the modular Symmetra LX for what I paid for the three independent UPSes.  The independent units are less of a single point of failure than a monolith like the Symmetra, but I could have added enough power modules and batteries to the Symmetra to achieve my uptime goal and also power the A/C unit–something that the individual UPSes could not do.

SaaS all of the things:  Most of our apps were already in the cloud, but I implemented the PBX locally because it was quite a bit cheaper due to the number of extensions.  I’m now thoroughly convinced that in a small business, hosting your own PBX is only slightly less stupid than hosting your own Exchange Server.  Until you get to a thousand extensions and can afford to bring on a dedicated VoIP guy, let someone else deal with it.  Same goes for monitoring–I would have gladly gone with hosted Zabbix if it was available at the time.  Same with PagerDuty for alerting.

Expect your stuff to get thrown out:  My artisanally-crafted monitoring system went out the window when the MSP came in.  Same for my carefully locked down pfSense boxes.  Just expect that an MSP is going to have their own managed firewalls, remote support software, antivirus, etc.

Don’t take it personally:  Commercial pilots and railroad engineers describe the inevitable result of any government accident investigation: “They always blame the dead guy.”  That crude sentiment also applies to IT: no matter what happens after you leave, you’re going to get blamed for it.  After carefully documenting and training my replacement, I hadn’t even left when I started getting phone calls about outages, and they were basically all preventable.  The phone system was rebooted in the middle of the day.  A Windows Server 2003 box was shut down, even though it hosted the PICK application the owner still insisted on keeping around.  The firewalls were replaced without examining the existing rules first, plunging my monitoring system into darkness and causing phone calls to have one-way audio.  I answered calls and texts for two weeks, and then stopped worrying about them and focused solely on my present and future.

Write about it: Even if nobody reads your blog, outlining what you did and why, and what worked and what didn’t, will help you make better recommendations in the future.  And if someone does read it, it might help them as well.

 

The Littlest Datacenter Part 5: Monitoring

Part 1 of this saga can be found here.

Good monitoring doesn’t come ready to use out of the box.  It takes a lot of work: deciding what to monitor, deciding on thresholds and when and how to alert.  Alert Fatigue is a thing, and the last thing you want is to miss an important alert because you silenced your phone to actually get a full night’s sleep.  But monitoring and alerting, if configured in a sensible way, can save hours of downtime, user complaints and the IT practitioner’s sanity.

I had been searching for monitoring utopia for more than a decade.  I tried Zenoss, Nagios and Zabbix on the FOSS side, and briefly toyed with PRTG and others before being turned down by management due to cost.  In every case, it took a lot of work to set up.  Between agent installation and updates, SNMP credential management, SNMPWALKing to find the right OIDs to monitor, and figuring out what and when to alert, monitoring in a mixed environment takes a lot of TLC to get right.

In my mini datacenter build, I had the opportunity to build my monitoring system the way I wanted.  I could have made a business case for commercial software, but this build and move already had taken a significant chunk of money–remember that the entire warehousing and shipping/receiving operation had also been moved and expanded.  And besides, having tested several of the commercial products, I knew that the initial setup probably would have been faster, but the ongoing work–tweaking thresholds, adding and removing sensors, and configuring monitoring–was going to take a lot of work regardless of the system I chose.

With that in mind, I chose Zabbix.  I had used Nagios in the past, but I preferred a GUI-based configuration in this case.  In an environment where I was deploying a lot of similar VMs or containers, a config file-based product like Nagios would probably make more sense.  Having chosen Zabbix, the question was where to install it.  At the time, Zabbix didn’t have a cloud-based product, and installing it within the environment didn’t make sense, as various combinations of equipment and Internet failure could make it impossible to receive alerts.

After looking at AWS and Azure, I went with simple and cheap: DigitalOcean.  Their standard $5/month CentOS droplet had 512MB of RAM, one vCPU, 25GB of disk space, a static IP and a terabyte of transfer a month.  I opted for the extra $1/month for snapshotting/backups, which would make patching less risky.

The first step was to set up and lock down communications.  I went with a CentOS VM at each of the company’s two locations and installed the Zabbix proxy.  The Zabbix proxies were configured for active mode, meaning that they communicated outbound to the Zabbix server at DigitalOcean.  pfSense was configured to allow the outbound traffic, and DigitalOcean’s web firewall restricted the Zabbix server from receiving unsolicited traffic from any IPs other than the two office locations.  Along with the built-in CentOS firewall, it was guaranteed to keep the bad guys out.  I also secured the Zabbix web interface with Let’s Encrypt, because why wouldn’t I?

Next up was configuring monitoring.  Zabbix comes with a Windows monitoring template.  After installing the agent on all of the Windows hosts and VMs, I configured monitoring by template.  I found it best to duplicate the base template and use the duplicate for each specific function.  For instance, one copy of the Windows template was used to monitor the DNS/DHCP/AD servers.  In addition to the disk space, CPU usage and other normal monitoring, it would monitor whether the DHCP and DNS servers were running.  Another copy of the template would be tweaked for the VM hosts, with, for instance, more sane disk space checks.  Linux monitoring, including of the pfSense boxes, was configured similarly.  Ping checks of the external IPs was done from the Zabbix host itself.

Environmental monitoring was also important due to the small closet size and lack of generator support for the building.  SNMP came to the rescue here.  Fortunately, I had two Tripp Lite and one APC UPS in that little server closet.  Using templates I found online, I was able to monitor battery charge level and temperature, remaining battery time, power status, humidity and battery/self test failures.  The Tripp Lite units had an option for an external temperature/humidity sensor, and I was able to find the remote sensors on eBay for less than $30 each.  One I mounted in front of the equipment in the rack to measure intake air temperature, and the other I mounted dangling in the airflow of the A/C to measure its output temperature.  That way, I would be alerted if the A/C unit failed, as well as monitor its cycle time and see how cold the output air was compared to historical data.

The primary infrastructure to monitor was the Hyper-V cluster.  Fortunately, I found a Hyper-V and CSV template online that I was able to tweak to work.  Not only did it monitor the CSVs for free disk space and connectivity, it could alert if either of the hosts had zero VMs in active status–an indication that the host had rebooted or otherwise lost its VMs.  A Dell template monitored server systems and could report a disk or fan failure.

No monitoring system is complete without graphs and alerts, so I built a custom “screen”, which is Zabbix terminology for a status dashboard.  I created a 1080×1920 portrait screen with battery and temperature graphs, a regularly updated picture of the interior of the room provided by an old Axis camera, and a list of active alerts and warnings.  I mounted a 1080p monitor in portrait orientation in my office and used an old PC to display the screen.

Finally, I tackled the issue of Alert Fatigue.  At a previous employer that had a 24-hour phone staff, I would receive eight to ten phone calls in the middle of the night, and of those calls, maybe only one a month would actually need my attention that night.  I vowed to tweak all of the unnecessary alerts out of my new monitoring system.

I used a generic IT Google Apps email account on the Zabbix server to send the alerts to my email.  I then set my GMail to parse the alert and, if the alert status was “Disaster”, forward it as a text message to my phone to wake me up.  I then went through all of my alerts and determined whether they were critical enough to get a “disaster” rating.  A VM that had just rebooted wasn’t critical.  The Windows service that recorded surveillance on the NVR not running was critical.  The power going out wasn’t critical.  The remaining battery dropping below four hours, indicating that the power had been out for over an hour, was critical.  By setting up my monitoring this way, I would still be awakened for actual issues that needed immediate correction, but less critical issues could be handled during waking hours.  I also tiered my issues to indicate when a problem was progressing.  For instance, I would get an email alert if the room temperature reached 75 degrees, a second at 80 degrees, and a phone alert at 85 degrees.  That would give me plenty of time to drive in or remote in and power down equipment.

Many alerts I disabled completely.  The Windows default template alerts if CPU usage reaches something like 70%.  Do I care if my VMs reach 70% CPU?  If I don’t care at all, I can turn the alert off completely.  If I’m really concerned about some runaway process eating CPU for no reason, I can tweak that setting so that I’m not alerted until the CPU exceeds 95% for 15 minutes continuously.  At any rate, that’s not going to be flagged as a “disaster.”

The Zabbix droplet worked great.  There was no noticeable downtime in the two years I ran it.  I was able to run about 2,000 sensors on that droplet with overhead to spare even with one vCPU and 512MB of RAM.  (DigitalOcean has since increased that base droplet to 1GB of RAM.)  I probably would have replaced my GMail-to-text kludge with PagerDuty if I had known better, as it can follow up alerts with automated phone calls.  at any rate, I slept much better knowing my environment was well-monitored.

Next time:  Lessons learned, or “What would I do differently today?”