The Littlest Datacenter Part 6: Lessons Learned

For the first post of this long saga, click here.

It’s been a year since I moved on from the company running on the Littlest Datacenter, and about two years since it was built.  As I mentioned, I built it to be as self-sufficient, flat, simple and maintainable as possible, first because I had duties beyond being the IT guy and dropping everything to hack on junk equipment wasn’t going to cut it; second because I was the only IT guy and I wanted to be able to take vacations and sleep through the night without the business falling apart; and third, because I knew that, regardless of whether I stayed with that company or not, the IT function would eventually be given to an MSP or a junior admin.

Looking back at the setup, here are some lessons learned:

Buy Supermicro carefully:  The default support Supermicro offers is depot repair.  That means you’re deracking your server, boxing it up and paying to ship it back to them for repair.  Repair can take anywhere from one to six weeks.  This sucks because Supermicro offers a lot of flexible and reliable hardware choices for systems that fall outside the mainstream.  For instance, my Veeam server fit sixteen 3.5″ hard drives and two 2.5″ SSDs for less than half the cost of the equivalent Dells and HPs, and they supported Enterprise drives that didn’t come with the Dell/HP tax.  Just be sure to add on the onsite warranty or carry spare parts.

You’re gonna need more space:  And not just disk space.  I ended up adding 8TB more disk space to my hosts to handle the high resolution cameras for the additional shipping tables added a year after the initial build.  Fortunately I had extra drive bays, but any more expansion will involve a larger tape changer and SAS expansion shelves for the hosts.

Cheaper can sometimes be better:  For a simple two-host Windows cluster, Starwind saved the company a good six figures.  It’s no Nimble, but it was fast, bulletproof and affordable.  And like I said before, Supermicro really saved the day on the D2D backup server.

A/C is the bane of every budget datacenter:  The SRCOOL12K I used did the job, but it was loud and inefficient.  I really should have pushed for the 12,000 BTU mini-split, even though it would have taken more time and money.

So is power:  I probably could have bought the modular Symmetra LX for what I paid for the three independent UPSes.  The independent units are less of a single point of failure than a monolith like the Symmetra, but I could have added enough power modules and batteries to the Symmetra to achieve my uptime goal and also power the A/C unit–something that the individual UPSes could not do.

SaaS all of the things:  Most of our apps were already in the cloud, but I implemented the PBX locally because it was quite a bit cheaper due to the number of extensions.  I’m now thoroughly convinced that in a small business, hosting your own PBX is only slightly less stupid than hosting your own Exchange Server.  Until you get to a thousand extensions and can afford to bring on a dedicated VoIP guy, let someone else deal with it.  Same goes for monitoring–I would have gladly gone with hosted Zabbix if it was available at the time.  Same with PagerDuty for alerting.

Expect your stuff to get thrown out:  My artisanally-crafted monitoring system went out the window when the MSP came in.  Same for my carefully locked down pfSense boxes.  Just expect that an MSP is going to have their own managed firewalls, remote support software, antivirus, etc.

Don’t take it personally:  Commercial pilots and railroad engineers describe the inevitable result of any government accident investigation: “They always blame the dead guy.”  That crude sentiment also applies to IT: no matter what happens after you leave, you’re going to get blamed for it.  After carefully documenting and training my replacement, I hadn’t even left when I started getting phone calls about outages, and they were basically all preventable.  The phone system was rebooted in the middle of the day.  A Windows Server 2003 box was shut down, even though it hosted the PICK application the owner still insisted on keeping around.  The firewalls were replaced without examining the existing rules first, plunging my monitoring system into darkness and causing phone calls to have one-way audio.  I answered calls and texts for two weeks, and then stopped worrying about them and focused solely on my present and future.

Write about it: Even if nobody reads your blog, outlining what you did and why, and what worked and what didn’t, will help you make better recommendations in the future.  And if someone does read it, it might help them as well.

 

The Littlest Datacenter Part 5: Monitoring

Part 1 of this saga can be found here.

Good monitoring doesn’t come ready to use out of the box.  It takes a lot of work: deciding what to monitor, deciding on thresholds and when and how to alert.  Alert Fatigue is a thing, and the last thing you want is to miss an important alert because you silenced your phone to actually get a full night’s sleep.  But monitoring and alerting, if configured in a sensible way, can save hours of downtime, user complaints and the IT practitioner’s sanity.

I had been searching for monitoring utopia for more than a decade.  I tried Zenoss, Nagios and Zabbix on the FOSS side, and briefly toyed with PRTG and others before being turned down by management due to cost.  In every case, it took a lot of work to set up.  Between agent installation and updates, SNMP credential management, SNMPWALKing to find the right OIDs to monitor, and figuring out what and when to alert, monitoring in a mixed environment takes a lot of TLC to get right.

In my mini datacenter build, I had the opportunity to build my monitoring system the way I wanted.  I could have made a business case for commercial software, but this build and move already had taken a significant chunk of money–remember that the entire warehousing and shipping/receiving operation had also been moved and expanded.  And besides, having tested several of the commercial products, I knew that the initial setup probably would have been faster, but the ongoing work–tweaking thresholds, adding and removing sensors, and configuring monitoring–was going to take a lot of work regardless of the system I chose.

With that in mind, I chose Zabbix.  I had used Nagios in the past, but I preferred a GUI-based configuration in this case.  In an environment where I was deploying a lot of similar VMs or containers, a config file-based product like Nagios would probably make more sense.  Having chosen Zabbix, the question was where to install it.  At the time, Zabbix didn’t have a cloud-based product, and installing it within the environment didn’t make sense, as various combinations of equipment and Internet failure could make it impossible to receive alerts.

After looking at AWS and Azure, I went with simple and cheap: DigitalOcean.  Their standard $5/month CentOS droplet had 512MB of RAM, one vCPU, 25GB of disk space, a static IP and a terabyte of transfer a month.  I opted for the extra $1/month for snapshotting/backups, which would make patching less risky.

The first step was to set up and lock down communications.  I went with a CentOS VM at each of the company’s two locations and installed the Zabbix proxy.  The Zabbix proxies were configured for active mode, meaning that they communicated outbound to the Zabbix server at DigitalOcean.  pfSense was configured to allow the outbound traffic, and DigitalOcean’s web firewall restricted the Zabbix server from receiving unsolicited traffic from any IPs other than the two office locations.  Along with the built-in CentOS firewall, it was guaranteed to keep the bad guys out.  I also secured the Zabbix web interface with Let’s Encrypt, because why wouldn’t I?

Next up was configuring monitoring.  Zabbix comes with a Windows monitoring template.  After installing the agent on all of the Windows hosts and VMs, I configured monitoring by template.  I found it best to duplicate the base template and use the duplicate for each specific function.  For instance, one copy of the Windows template was used to monitor the DNS/DHCP/AD servers.  In addition to the disk space, CPU usage and other normal monitoring, it would monitor whether the DHCP and DNS servers were running.  Another copy of the template would be tweaked for the VM hosts, with, for instance, more sane disk space checks.  Linux monitoring, including of the pfSense boxes, was configured similarly.  Ping checks of the external IPs was done from the Zabbix host itself.

Environmental monitoring was also important due to the small closet size and lack of generator support for the building.  SNMP came to the rescue here.  Fortunately, I had two Tripp Lite and one APC UPS in that little server closet.  Using templates I found online, I was able to monitor battery charge level and temperature, remaining battery time, power status, humidity and battery/self test failures.  The Tripp Lite units had an option for an external temperature/humidity sensor, and I was able to find the remote sensors on eBay for less than $30 each.  One I mounted in front of the equipment in the rack to measure intake air temperature, and the other I mounted dangling in the airflow of the A/C to measure its output temperature.  That way, I would be alerted if the A/C unit failed, as well as monitor its cycle time and see how cold the output air was compared to historical data.

The primary infrastructure to monitor was the Hyper-V cluster.  Fortunately, I found a Hyper-V and CSV template online that I was able to tweak to work.  Not only did it monitor the CSVs for free disk space and connectivity, it could alert if either of the hosts had zero VMs in active status–an indication that the host had rebooted or otherwise lost its VMs.  A Dell template monitored server systems and could report a disk or fan failure.

No monitoring system is complete without graphs and alerts, so I built a custom “screen”, which is Zabbix terminology for a status dashboard.  I created a 1080×1920 portrait screen with battery and temperature graphs, a regularly updated picture of the interior of the room provided by an old Axis camera, and a list of active alerts and warnings.  I mounted a 1080p monitor in portrait orientation in my office and used an old PC to display the screen.

Finally, I tackled the issue of Alert Fatigue.  At a previous employer that had a 24-hour phone staff, I would receive eight to ten phone calls in the middle of the night, and of those calls, maybe only one a month would actually need my attention that night.  I vowed to tweak all of the unnecessary alerts out of my new monitoring system.

I used a generic IT Google Apps email account on the Zabbix server to send the alerts to my email.  I then set my GMail to parse the alert and, if the alert status was “Disaster”, forward it as a text message to my phone to wake me up.  I then went through all of my alerts and determined whether they were critical enough to get a “disaster” rating.  A VM that had just rebooted wasn’t critical.  The Windows service that recorded surveillance on the NVR not running was critical.  The power going out wasn’t critical.  The remaining battery dropping below four hours, indicating that the power had been out for over an hour, was critical.  By setting up my monitoring this way, I would still be awakened for actual issues that needed immediate correction, but less critical issues could be handled during waking hours.  I also tiered my issues to indicate when a problem was progressing.  For instance, I would get an email alert if the room temperature reached 75 degrees, a second at 80 degrees, and a phone alert at 85 degrees.  That would give me plenty of time to drive in or remote in and power down equipment.

Many alerts I disabled completely.  The Windows default template alerts if CPU usage reaches something like 70%.  Do I care if my VMs reach 70% CPU?  If I don’t care at all, I can turn the alert off completely.  If I’m really concerned about some runaway process eating CPU for no reason, I can tweak that setting so that I’m not alerted until the CPU exceeds 95% for 15 minutes continuously.  At any rate, that’s not going to be flagged as a “disaster.”

The Zabbix droplet worked great.  There was no noticeable downtime in the two years I ran it.  I was able to run about 2,000 sensors on that droplet with overhead to spare even with one vCPU and 512MB of RAM.  (DigitalOcean has since increased that base droplet to 1GB of RAM.)  I probably would have replaced my GMail-to-text kludge with PagerDuty if I had known better, as it can follow up alerts with automated phone calls.  at any rate, I slept much better knowing my environment was well-monitored.

Next time:  Lessons learned, or “What would I do differently today?”

 

 

The Littlest Datacenter Part 4: Environmental and Security

You can find part 1 of this saga, including the backstory, here.

Strip malls and low-rent office/retail buildings present a number of challenges to the fledgling IT-focused company.  From poor electrical infrastructure to lack of security to lack of options for broadband, the small business IT guy has his work cut out for him.  Add to that the fact that management typically chooses space based on price, and you’re lucky if you even get the address before moving day, much less a voice in the selection process.

And so I found myself cramming 100TB of disk into a closet large enough for a 42U cabinet and a 2-post.  And when I say large enough, I mean JUST large enough.  I was able to convince them to put a wide enough door that I could maneuver the cabinet out of its wheels so I could get behind it.  To add to the fun, this closet was in the office space, so noise was a factor.

Fortunately, I had a say in how the room was built.  My goal in building the room was to maximize the use of space, keep the noise and cold air in, and provide a modicum of security; after all, sensitive equipment would be in this room.  To that end, the builder put fiberglass insulation into the walls, and then doubled the drywall on the outside facing the office. The inside walls were drywalled and then covered in plywood lagged to the studs, providing a strong base for mounting wall-based equipment and further reducing sound.  The roof was a lid consisting of solid steel sandwiched between plywood above and drywall below.  A steel door with a push button combination lock and steel frame completed the room.

As a warehousing operation, shipping was the most important function of the business.  Delayed shipments could result in financial penalties.  Since shipping was a SaaS function, my goal was to provide the business with the power and Internet so they complete the bulk of the day’s shipping even in the event of a power outage.  Installation of a generator was impossible due to the location, so I had to settle for batteries.  I ended up with five UPSes in total.  One Tripp Lite 3kVA and one APC 3kVA split the duties in the server cabinet, and one Tripp Lite 3kVA UPS kept the 2-post (with PoE switch for the cameras) and wall-mounted equipment alive.  I also had a 1,500VA unit at each pair of shipping tables to power the shipping stations (Dell all-in-ones) and label printers.  Additional battery packs were added to each unit so that a total uptime of about five hours could be achieved.  That gave plenty of time to either finish the shipping day or make a decision about renting a portable generator for longer outages.  So far, there has been only one significant outage during a shipping day, but the production line was able to work through it without a hitch.

Cooling for the server room was provided by a Tripp Lite SRCOOL12K portable air conditioner.  The exhaust was piped into the area above the drop ceiling.  While this did the job, I would have preferred a variable inverter drive unit with dual hoses for more efficiency.  We investigated a mini-split, but due to property management requirements, it would have taken months and cost many thousands of dollars.  The server equipment could go for well over an hour before heat buildup began to become an issue, which was enough time to open the door and use a fan.  Equipment could also be shut down remotely, further reducing heat production.

In addition to the physical security, infrastructure security had to be considered as well.  To that end, I deployed a physically separate network for the surveillance, access control and physical security systems.  Endpoints ran with antivirus and GPO-enforced firewalls and auto-patching.  Ninite Pro took care of keeping ancillary software up to date.  As all of the company equipment was wired, the wireless network was physically segmented from the rest of the network for BYOD and customer use.  pfBlocker was deployed on the pfSense firewalls to block incoming and outgoing traffic to countries where we did not do business, and outbound traffic was limited initially to ports 80 and 443, with additional ports added on an as-needed basis.  Finally, I deployed Snort on the firewalls themselves and in various VMs to catch any intrusions if they happened.

Coming up: Monitoring and lessons learned.

 

The Littlest Datacenter Part 3: Backup

Part 1 of this saga can be found here.

I had an interesting backup conundrum.  I had 26TB of utterly incompressible and deduplication-proof surveillance video data that needed to be backed up.  As this video closely recorded the fulfillment operation and was used to combat fraud by “customers”, it needed to be accessible for 90 days after recording.  Other workloads that needed to be backed up included the infrastructure (AD/DNS/DHCP) servers, a local legacy file server, the PBX, and a few other miscellaneous VMs.  Those, however, totaled less than 2TB and were easily backed up using a multitude of options.

The video data was difficult to cloudify.  Just the initial 26TB was massive, but it also changed at a rate of about 50GB per hour during shipping hours.  And in case of actual hardware failure, getting that 26TB back and into production was equally difficult.

This combined with a limited budget led me to a conclusion I didn’t want to reach: tape.  I needed to get a copy of the data out of the server room, and having a 52TB array sitting under somebody’s desk wasn’t appealing.  One thing I did know was that management told me that an offsite copy wasn’t necessary; in the event of a full-site disaster (fire, earthquake, civil unrest), the least of our worries would be defending shipments that had already been made.  With that in mind, I decided to go with a disk-to-disk-to-tape option.

When it comes to buying server hardware with massive amounts of disk, I generally look at Super Micro.  I had been quoted a few Dell servers with 4TB drives, but the cost was generally breathtaking.  I wanted a relatively lightly powered box with a ton of big drives at a reasonable price, and I got it.  I picked up a new 16-bay box with a single 8-core CPU, an LSI RAID controller and 64GB of RAM for cheaper than used Dell gear.  For drives, I went with 6TB enterprise-class SAS drives: four from WD, four from HGST, four from Seagate and four from Toshiba.  I configured these drives in a RAID-60 with two of each model of drive in each half of the RAID-60.  That way, if I had a bad batch of Seagate drives (which NEVER happens, of course), I could lose all four and still have a running, if degraded, array.  This RAID arrangement gave me about 72TB usable–enough for two full backups and a number of incrementals.

For tape duties, I picked up a new-old stock Dell PowerVault PV124.  This LTO-5 SAS changer was chosen at a time when the initial build called for only 16TB for video.  Holding 16 tapes, its raw uncompressed capacity was about 24TB, and it eventually was using all 16 tapes for a single backup.

Veeam was chosen to handle the backup duties, because at the time, it was the only solution I could find that could do VM-level backups AND handle SAS tape libraries.  Backups were made nightly to the local proxy, with full copies to tape occurring weekly.  The tapes were then stored in a fireproof safe in a steel-and-concrete vault at the other end of the building.

Coming soon: Environmental, monitoring and security.

The Littlest Datacenter Part 2: Internet and Firewalls

Part 1 of this saga can be found here.

As mentioned before, this was a SaaS-focused business.  Most of the vital business functions, including ordering, shipping and receiving, pricing, accounting and customer service, were SaaS.  That meant that a rock-solid Internet connection was required.  But again, a small business runs on a small budget.  Combined with the fact that the business was in a strip mall, and we were lucky to get Internet at all.

Fortunately, we were able to get Fios for a reasonable cost and installed reasonably quickly.  Previously the business had been running IPCop on a tiny fanless Jetway PC, but I felt we had outgrown IPCop, and the Jetway box, though still working, was a bit underpowered for what I needed.  I settled on pfSense as my firewall of choice, but I didn’t want to run it on desktop hardware.

Fortunately, Lenovo had a nearly perfect solution for my budget: the RS140 server.  It was a 1U rackmount server with a four-core Xeon E3 processor with AES-NI for fast crypto, and it came with 4GB of RAM for a hair over $400.  The price was so good I bought two.  Each I fitted out with an additional 4GB of RAM and two SSDs, a 240GB from SanDisk and a 240GB from Intel.  There was a bit of consternation when I found that the server came with no drive trays, but I found that I could mount the SSDs in 3.5″ adapters and mount them directly into the chassis with no drilling.

The SanDisk and Intel SSDs in each server were configured in software RAID-1 using the onboard motherboard RAID, and the integrated IPMI was finicky but good enough that I could remotely KVM into the boxes if need be.  The servers were then configured into an active/passive pair using the pfSense software, and I used a new HPe 8-port switch to connect them to the Fios modem.

The firewalls worked so well I bought a matching pair for the other location and connected them with an IPSec tunnel so they could share files more securely.

You may ask why I used hardware for the firewalls instead of virtualizing them.  The answer is, I initially did virtualize them in Hyper-V.  However, I just wasn’t comfortable with the idea of running my firewalls on the same hardware as my workloads.  There have been rumors of ways to escape a VM and compromise the host, and indeed recent revelations about hypervisor compromise through bad floppy drivers and side channel data leakage a la Spectre and Meltdown have confirmed my suspicions about virtualized firewalls.

Coming soon: Backup, environmental, monitoring and security.

The Littlest Datacenter Part 1: Compute and Storage

I was tasked with building a datacenter.  Okay, not really.  The company was expanding into a low-cost strip mall, which meant limited connectivity options, no power redundancy and strict rules regarding modifications.  It also meant that I was limited to two racks in a tiny closet in the middle of an office space.  Finally, as always, there was minimal budget.

The Requirements

The COO was very SaaS-focused for business applications.  As the sole IT person (with additional ancillary duties), I was happy to oblige.  File storage, office applications, email, CRM, shipping and accounting functions were duly shipped off to folks who do that kind of thing for a living, leaving me with a relatively small build: AD/DNS/DHCP, phone system and surveillance.  While the systems I was replacing used independent servers that replicated VMs across, it was a decidedly more… manual failover process than I wanted.  As a shipping-focused business that was penalized for missing shipping deadlines, systems needed to be redundant and self-healing to the extent possible within the thin budget.  Finally, I knew that I would eventually be handing off the environment to either a managed service provider or a junior admin, so everything needed to be as simple and self-explanatory as possible.

The infrastructure VM (AD, DNS, etc.) and ancillary VMs were pretty straightforward.  The elephant in the room was the surveillance system.  Attached to 27 high-resolution surveillance cameras, it would have to store video for 90 days for most of the cameras for insurance reasons.  Once loaded with 90 days of video, it would consume 26TB of disk space and average about 50GB/hour of disk churn during business hours.

The Software

Because of costs, I settled on Hyper-V as my VM solution.  As it’s included with the Windows licenses I was already buying, it was cost-effective, and it had live migration, storage migration, backup APIs, remote replication and failover capabilities.  Standard licensing allows two Windows VMs to run on one license, further reducing costs.

Next to consider was the storage solution.  As I mentioned, the existing server pair consisted of two independent Hyper-V systems, with one active and one passive.  Hyper-V replication kept the passive host up to date, but in the event of a failure or maintenance, failing over and failing back was a long and arduous process.  I opted for shared storage to allow HA.  Rather than roll my own shared storage, I decided to buy.

After talking with several vendors, I settled on Starwind vSAN.  I had used their trialware with good results, and it had good reviews from people who had chosen it.  As it ran on two independent servers with independent copies of the data, it protected both from disk failure and host, backplane, operating system, RAID controller and motherboard failure.  Starwind sold a virtual appliance which was an OEM-branded but very familiar Dell T630 tower server, so I ordered two, which was substantially cheaper than sourcing the servers and vSAN software separately, and about a sixth of the cost of an equivalent pair of Dell servers and separate SAN.

The Hardware

I settled on a pair of midrange Xeons with 12 cores each–24 cores or 48 threads per host.  This was enough to process video on all of the cameras while leaving plenty of overhead for other tasks.  The T630 is an 18-bay unit with a rack option.  Dual gigabit connections went to the dedicated camera switch, while another pair went to the core switches.  For Starwind, a dual-port 10 gigabit card was installed in each host.  One port on each was used for Starwind iSCSI traffic, and the other for Starwind sync traffic.  Both were redundant in software, and they were direct-connected between the hosts with TwinAx.  Storage for each host consisted of sixteen 4TB Dell drives and two 200GB solid state drives for Starwind’s caching.

In an effort to reduce complexity, I went with a flat network.  Two HPe switches provided redundant gigabit links to the teamed server NICs and the other equipment in the rack.  Stacked and dual-uplinked HPe switches connected the workstations and ancillary equipment to the core.

The Software

Windows Server 2012R2 standard provided the backbone, with Starwind vSAN running on top.  Two Windows VMs powered the AD infrastructure server and the surveillance recording server.  I later purchased an additional Windows license and built a second DC/DNS/DHCP VM running on the second host.

Coming Soon:  Firewalls, backup, environmental, monitoring and security

 

The Datastores That Would Not Die

As part of a recent cleanup of our vSphere infrastructure, I was tasked with removing disused datastores from our Nimble CS500.  The CS500 had been replaced with a newer generation all-flash Nimble, and the VMs had been moved a couple of months ago.  Now that the new array was up and had accumulated some snapshots, I was cleaning up the old volumes to repurpose the array.  I noticed, however, that even though all of the files were removed from the datastores, there were still a lot of VMs that “resided” on the old volumes.

VMware addresses this in a KB (2105343) titled “Virtual machines show two datastores in the Summary tab when all virtual machine files are located one datastore.” It suggests that the VM is pointing to an ISO that no longer exists on the old datastore.

After looking at the configs, I realized that, sure enough, some of the VMs were still pointing to an ISO file that was no longer on that datastore.  Easy, right, except that on one of the test VMs, I set the optical drive setting back to “Client Device.”  It was still pointing at the old datastore.

Looking through the config again, I noticed that the Floppy Drive setting is missing from the HTML5 client.  I fired up the Flex client and set the floppy drive to “Client Device” as well.  Still no go.  For the few VMs that were pointing at a nonexistent ISO, setting the optical drive back to “Client Device” worked, but for VMs that were pointing at a nonexistent floppy, changing the floppy to “Client Device” wasn’t working.  A bug in the floppy handling?  Perhaps.

I created a blank floppy image on one of my new datastores and pointed the VM’s floppy to that new image.  Success!  The VM was no longer listing the old datastore, and I could then set the floppy to “Client Device.”  After checking out other VMs, I realized that I had over 100 VMs that had some combination of optical drive or floppy drive pointing at a non-existent file on the old datastores.  PowerCLI to the rescue!

$vm = $args[0]

$cd = get-cddrive -vm $vm
floppy = get-floppydrive -vm $vm

set-floppydrive -Floppy $floppy -FloppyImagePath "[datastorename] empty-floppy.flp" -StartConnected:$false -Confirm:$false

set-floppydrive -Floppy $floppy -NoMedia -Confirm:$false
set-cddrive -cd $cd -NoMedia -Confirm:$false

Simply save this as a .ps1 file and pass it the name of the VM (in quotes if it contains spaces.) It will get the current floppy and CD objects from the VM, set the floppy to the blank floppy image previously created, and then set both the CD and floppy to use “NoMedia.” This was a quick and dirty script, so you will have to install PowerCLI and do your own Connect-VIServer first. Once connected, however, you can either manually specify VMs one at a time or modify the script to get VM names from a file or from vCenter itself.  All of these settings can be changed while the VM is running, so there is no need to schedule downtime to run this script.

After all of this work, I found that there were still a few VMs that were showing up on the old datastores.  Another quick Google search revealed that any VMs that had snapshots that were taken with the CD or floppy mounted would still show up on that datastore.  Drat!  After clearing out the snapshots, I finally freed up the datastores and was able to delete them using the Nimble Connection Manager.

So now a little root cause analysis: why were there so many machines with a nonexistent CD and floppy mounted?  After seeing that they were all Windows 2016 VMs, I went back to our templates and realized that the tech who built the 2016 template left the Windows ISO (long since deleted) and floppy image (mounted so he could F6-load the paravirtualized SCSI driver during OS installation) mounted when he created the template.  I converted the template to a VM, removed the two mounts (using the same two-step method for the floppy) and converted it back to a template.

With that job done, I’m continuing to plug away at converting our other vCenters from 6.5 to 6.7U1.  Have a great day, everyone!

The Dell PowerEdge FX2: A Dead End?

A small crowd was gathered by the server enclosure at the edge of the Dell EMC VMworld booth, gawking at the blinkenlights on the newly announced MX7000 blade chassis.  I dodged the eye candy and the suited salesdroids surrounding it and instead searched the booth for their PowerEdge FX2 display.  Alas, the entire booth was dedicated to the 7 rack unit MX7000 and its array of server, storage and networking building blocks.

When the FX2 was announced in 2014, it held a lot of promise.  A 2U enclosure with horizontally-mounted blades, the FX2 could hold up to eight dual-processor Xeon servers, consolidating power supplies, cooling and management.  The FC430 blade crammed two Xeons and 8 DIMM slots into a quarter-width sled.  The FC630 blade offered three times as much RAM capacity, access to another PCIe chassis slot, and a “real” PERC controller in a half-width sled, and the FC830 was a rack-width quad-Xeon sled with more DIMM slots and extra drive bays.

With the introduction of the 14th generation Dell PowerEdge lineup, the FC630 half-width blade was replaced by the FC640, but the FC830 and FC430 were never updated.

Seeing the writing on the wall, I grabbed one of the Dell EMC guys that was floating around the booth and asked him about the future of the FX2.  He told me that the FC430 and FC830 wouldn’t be updated because the “processors were too big”, and that he couldn’t comment on the future of the FX2 platform further.

Now, I’ve been in IT for a while, but that explanation just seemed strange.  What did he mean by the “processors are too big?” They need a bigger heatsink because they run hotter?  They’re a physically bigger die?  After digging a bit, the answer appears to be “both of the above.”

The Broadwell-based Xeons that underpin the 13th generation PowerEdge use the 37.5mm by 37.5mm LGA1151-3 socket–a socket size that has been carried over virtually unchanged since the 35mm by 35mm Socket 478 of nearly two decades ago.  However, the Skylake-based “Purley” Xeons that power the 14th generation servers use a much larger socket called LGA3647, so-called because it has nearly 2,500 more pins than the old 1151-3.  Those extra pins are used for additional memory channels among other features.

Those extra pins mean that the actual socket had to grow–in this case, to 76mm by 56.5mm.  That’s about three times the real estate of the previous generation CPU.  The FC430 blade was already so tight that it gave up DIMM slots and the option of having a decent RAID controller to fit in the space allotted.  There is simply no room to fit a Purley Xeon without radical rework like breaking off a lot of support hardware onto a daughterboard, which complicates manufacturing, repair and cooling.

fc430

A PowerEdge FC430 blade.

So, I now understand the demise of the FC430, but what about the future of the FX2 as a platform?

The folks over at Wccf Tech have posted Intel documents about the Xeon roadmap.  The Cooper Lake and Ice Lake chips, expected in 2019 and 2020 respectively, are expected to use a socket called LGA4189.  With 15% more pins than the LGA3647, I would expect the socket to be 15-20% larger as well.  While that doesn’t sound like much, it may be just too much to fit in a dense enough blade for the FX2 to continue to make sense.

Heat will also be a factor.  The same source above shows that Ice Lake will have a TDP of “up to 230W.”  That’s 75 watts higher than the hottest processor Dell offers in the FC640.  To handle the more powerful Ice Lake processors will probably take larger heatsinks and better cooling than is possible with any of the current FX2 form factors.

So how will these larger and hotter processors affect the blade landscape?  Dell knows the answer, and the answer lies in the MX7000.  Dell’s newest blade architecture, just released to the public in the past couple of months, is your traditional big box with vertical blades in the same vein as the old M1000e.  However unlike the M1000e, which offered full-height and half-height blades, the MX7000 only offers full-height single-width (2 CPU) and double-width (4 CPU) options.  Again, the helpful salesguy at the Dell EMC booth said that they were “unlikely” to be able to offer half-height blades because the days of 125 watt 37.5mm by 37.5mm CPUs are over.

So, are blades still worth it?  The M1000e could fit 16 servers into 10U and the FX2 can still fit 20 current generation servers into 10U.  The MX7000 can only fit 8 servers into 7U, and Lenovo can fit 14 servers into 10U.  The FX2 offers very high density in a form factor that is more flexible because it can be purchased and racked in smaller increments than the larger blade systems, but it appears that the days may be numbered for that form factor.

 

 

The End of an Era?

I was heartbroken to read about the demise of Weird Stuff Warehouse, a Silicon Valley institution.

I remember when they were just called Weird Stuff and were located in a commercial storefront near Fry’s in Milpitas.  They had glass display cases with a few dozen parts for sale, such as hard drives and peripheral cards. Once in a while, they would have something crazy, like a giant minicomputer hard drive with a spindle motor that looked like it belonged in a washing machine.  We mostly visited just to see what was new, though I do remember when they had trash cans full of ping pong balls that they were selling by the bagful.  We bought a few dozen to throw at each other at the office.

Imagine my surprise, then, when I was assigned a weeklong project in Milpitas a few years ago, a good decade after I had moved to SoCal.  My GPS took me to a nondescript warehouse entrance.  When I walked inside, it was like a massive museum.  Stack after stack of 30-year-old hard drives, cards, motherboards, power supplies, test equipment, industrial equipment, cables, wires, displays, servers, switches, cabinets, modem banks… I spent every evening after work walking up and down the aisles, admiring and sometimes touching the Silicon Valley of my youth.

With my (and my 17-year-old son’s) excitement building about the upcoming Vintage Computer Festival West in Mountain View this summer, I Googled Weird Stuff so that my son, too, could experience the fruits of Silicon Valley on those shelves.  Alas, it turns out that Googling was what contributed to the death of this institution.  The search giant bought the building, and Weird Stuff Warehouse closed its doors and sold its inventory to a company that, as far as I can tell, doesn’t have a retail presence.

In the light of recent events involving Facebook, Uber and other companies, there’s a growing sentiment that Silicon Valley is not what it used to be.  I can’t speak to that myself; I moved out of the Bay Area almost two decades ago and haven’t followed it as closely as I used to.  But it seems that Silicon Valley, which used to be about inventing and building better stuff (hence the “silicon” in the name) has forgotten its roots a bit in its bid to grab some of that VC gold rush money.  Perhaps Silicon Valley needs to get back to building more weird stuff instead.