Working in the Hot Aisle: Sealing It Up

“The hot stays hot and the cool stays cool.” -McDonalds McDLT commercials

Part 1 of this saga can be found here.

We were all just a bunch of starry-eyed kids with our plan for the future:  Rack some equipment, buy back-to-front cooled switches, snap in some filler blanks and the world would be a beautiful place.  We soon were in for a big reality check.  Let’s start with a simple task: buying some switches.

It turns out that it’s actually difficult to find switches that suck air in the back and blow it out the front.  In fact, it’s even fairly difficult to find switches that blow front to back.  Many (most?) switches pull air from the left side of the switch and blow it out the right or vice versa.  In a contained data center, that means that the switches are pulling in hot air from the hot aisle and blowing out even hotter air.  In fact, there are diagrams on the Internet showing rows of chassis switches mounted in 2-post racks where the leftmost switch is getting relatively cool air and blowing warmer air into the switch to its right.  This continues down the line until you get to the switch at the other end of the row, which is slowly melting into a pile of slag.  Needless to say, this is not good for uptime.

There are companies that make various contraptions for contained-aisle use.  These devices have either passive ducts or active fan-fed ducts that pull air from the cold aisle and duct it into the intake side of the switch.  Unfortunately, switch manufacturers can’t even agree on which side of the switch to pull air from or where the grilles on the chassis are located.  Alas, this means that unless somebody is making a cooler specific to your chassis, you have to figure out which contraption is closest to your needs.  In our case, we were dealing with a Cisco 10-slot chassis with right-to-left cooling.  There were no contraptions that fit correctly, so an APC 2U switch cooler was used.  This cooler pulls air from the front and blows it up along the intake side of the switch in the hot aisle.  While not as energy efficient as contraptions with custom fitted ducts that enclose the input side of the switch, it works well enough and includes redundant fans and power inputs.

For the top of rack and core switches, only the Cisco Nexus line offered back-to-front cooling options (among Cisco switches, that is.)  That’s fine since we were looking at Nexus anyway, but it’s unfortunate that it’s not an option on Catalyst switches.  Front-to-back cooling is an option, but then the switch ports are in the cold aisle, meaning that cables must be passed through the rack and into the hot aisle.  It can work, but it’s not as clean.

However, buying back-to-front cooled switches is but the beginning of the process.  The switches are mounted to the back of the cabinet and are shorter than the cabinet, meaning that there is a gap around the back of the switch where it’s not sealed to the front of the cabinet.  Fortunately, the contraption industry has a solution for that as well.  In our case, we went with the HotLok SwitchFix line of passive coolers.  These units are expandable; they use two fitted rectangles of powder-coated steel that telescope to close the gap between the switch and the cabinet.  They come in different ranges of depths to fit different rack depth and switch depth combinations and typically mount inside the switch rails leading to the intake side of the switch.  Nylon brush “fingers” allow power and console cables to pass between the switch and the SwitchFix and into the hot aisle.

The rear of the switch as viewed from the cold aisle side. The SwitchFix bridges the gap between the rear of the switch and the front of the rack.

While this sounds like an ideal solution, in reality the heavy gauge steel was difficult to expand and fit correctly, and we ended up using a short RJ-45 extension cable to bring the console port out of the SwitchFix and into the cold aisle for easy switch configuration.  The price was a little heart-stopping as well, though it was still better than cobbling together homemade plastic and duct-tape contraptions to do the job.

With the switches sorted, cable managers became the next issue.  The contractor provided standard 2U cable managers, but they had massive gaps in the center for cables to pass through–great for a 2-post telco rack, but not so great for a sealed cabinet.  We ended up using some relatively flat APC 2U cable managers and placed a flat steel 2U filler plate behind them, spaced out with #14 by 1.8″ nylon spacers from Grainger.  With the rails fully forward in the cabinet, the front cover of the cable manager just touched the door, but didn’t scrape or significantly flex.

One racks are in place and equipment is installed, the rest of the rack needs to be filled to prevent mixing of hot and cold air. There are a lot of options, from molded plastic fillers that snap into the square mounting holes to powdered coated sheets of steel with holes drilled for mounting. Although the cost was significantly higher, we opted for the APC 1U snap-in fillers. Because they didn’t need screws, cage nuts or tools, they were easy to install and easy to remove. With the rails adjusted all the way up against the cabinet on the cold aisle side, no additional fillers were needed around the sides.

With every rack unit filled with switches, servers, cable managers, telco equipment and snap-in fillers, sealing the remaining gaps was our final issue to tackle from an efficiency perspective.  While the tops of the cabinets were enclosed by the roof system, there was still a one-inch gap underneath the cabinets that let cold air through.  Even though the gap under the cabinet was only an inch high, our 18 cabinets had gaps equivalent to about three square feet of open space!  We bought a roll of magnetic strip to attach to the bottom of the cabinets to block that airflow, reduce dust intrusion and clean up the look.

Lessons Learned

There’s no other way of saying this: this was a lot of work. A lot of stuff had to be purchased after racking began. There are a lot of gotchas that have to be considered when planning this, and the biggest one is just being able to seal everything up. Pretty much the entire compute equipment industry has standardized on planning their equipment around front-to-back cooling, which makes using them in a contained-aisle environment simple. Unfortunately, switch manufacturers are largely just not on board. I don’t know if it’s because switching equipment is typically left out of such environments, or if they just don’t see enough of it in the enterprise and small business markets, but cooling switches involves an awful lot of random metal and plastic accessories that have high markups and slow shipping times.

However, I have to say that having equipment sitting at rock-stable temperatures is a huge plus. We were able to raise the server room temperatures, and we don’t have the hot-spot issues that cause fans to spin up and down throughout the day. Our in-row chillers run much less than the big CRAC units in the previous data center, even though there is much more equipment in there today. The extra work helped build a solid foundation for expansion.

Working in the Hot Aisle: Power and Cooling

Part 1 of this saga can be found here.

An hour of battery backup is plenty of time to shut down a dozen servers… until it isn’t.  The last thing you want to see is that clock ticking down while a couple thousand Windows virtual machines decide to install updates before shutting down.

We were fortunate that one of the offices we were consolidating was subleased from a manufacturer who had not only APC in-row chillers that they were willing to sell, but a lightly used generator.  Between those and a new Symmetra PX UPS, we were on our way to breathing easy when the lights went out.  The PX provides several hours of power and is backed by the generator, which also backs up the chillers in the event of a power outage.  The PX is a marvel of engineering, but is also a single point of failure.  We witnessed this firsthand with an older Symmetra LX, which had a backplane failure a couple of years previous that took down everything.  With that in mind, we opted to go with two PDUs in each server cabinet: one that fed from the UPS and generator, and one that fed from city power with a massive power conditioner in front of it.  These circuits also extend into the IDFs so that building-wide network connectivity stays up in the event of a power issue.

Most IT equipment comes with redundant power supplies, so splitting the load is easy: one power supply goes to each PDU.  For the miscellaneous equipment with a single power supply, an APC 110V transfer switch handles the switching duties.  A 1U rackmount unit, it is basically a UPS with two inputs and no batteries, and it seamlessly switches from one source circuit to the other when a voltage drop is detected.

As mentioned, cooling duties are handled by APC In-Row chillers, two in each aisle.  They are plumbed to rooftop units and are backed by the generator in case of power failure.  Temperature sensors on adjacent cabinets provide readings that help them work as a group to optimize cooling, and network connectivity allow monitoring using SNMP and/or dedicated software.  Since we don’t yet need the cooling power from all four units, we will be programming the units to run on a schedule to balance running hours between the units.

Cooling in the IDFs is handled by the building’s chiller, with an independent thermostat-controlled exhaust fan as backup. As each IDF basically hosts just one chassis switch, cooling needs are easily handled in this manner. As users are issued laptops that can ride out most outages, we were able to sidestep having to provide UPS power to work areas.

Next time:  Keeping the hot hot and the cool cool.

Working In the Hot Aisle: Choosing a Design

Part 1 of this saga can be found here.

As mentioned in the previous post, there several ways to skin the airflow separation cat. Each has its advantages and disadvantages.

Initially we considered raised floor cooling. Dating back to the dawn of computing, raised floor datacenters consist of chillers that blow cold air into the space under the raised floor. Floor tiles are replaced with perforated tiles in front of or underneath the enclosures to allow cold air into the equipment. Hot air from the equipment is then returned to the A/C units above the floor. The raised floor also provides a hidden space to route cabling, although discipline is more difficult since the shame of poor cabling is hidden from view. While we liked the clean look of raised floors with the cables hidden away, the cost is high and the necessary entry ramp at the door would have taken up too much floor space in our smaller datacenter.

We also looked at a hot aisle design that uses the plenum space above a drop ceiling as the return. Separation is achieved with plastic panels above the enclosures, and the CRAC units are typically placed at one or both ends of the room. Because this was a two-row layout in a relatively tight space, it was difficult to find a location for the CRACs that would avoid creating hot spots.

The decision became a lot easier when we found out that one of the spaces we were vacating had APC in-row chillers that we could pick up at a steep discount. The in-row units are contained within a single standard rack enclosure, so they are ideal for a hot aisle/cold aisle configuration. They solved the hot spot issues, as they could be placed into both rows. They also use temperature probes integrated into the nearby cabinets to keep temperatures comfortable efficiently.

APC In-Row RD. APC makes a kit to extend the height to 48U if used with tall racks. (Photo by APC)

With the cooling situation sorted, we turned our attention to containment. We opted for the Schneider/APC EcoAisle system, which provided a roof and end-doors to our existing two-row enclosure layout to create a hot aisle and a cold aisle. The equipment fans pull in cooler air from the cold aisle and exhaust hot air into the hot aisle, while the in-row chillers pull hot air from the hot aisle and return chilled air back into the cold aisle.

There are two options for this configuration.  A central cold aisle can be used, with the rest of the exterior space used as a hot aisle.  This possibly could reduce energy consumption, as only the central aisle is cooled and therefore the room doesn’t need to be sealed as well from air leaks.

The second option, which we ended up choosing, was a central hot aisle.  In our case, the exterior cold aisles gave us more room to rack equipment, and using the entire room as the cold aisle gives us a much larger volume of cool air, meaning that in the case of cooling system failure, we have more time to shut down before temperatures become dangerous to the equipment.

The central hot aisle is covered by a roof consisting of lightweight translucent plastic insulating panels, which reduce heat loss while allowing light in. (The system includes integrated LED lighting as well.) The roof system is connected to the fire system, so in the case of activation of the fire alarm, electromagnets that hold the roof panels in place will release, causing the roof panels to fall in so the sprinklers can do their job.  We can also easily remove one or more panels to work in the cable channel above.

APC EcoAisle configuration. Ours is against the wall and has one exit door. (Photo APC).

Our final design consists of eleven equipment racks, three UPS racks and four chiller racks.  This leaves plenty of room for growth, and in the case of unexpected growth, an adjacent room, accessible by knocking out a wall, doubles the capacity.

We decided on 48U racks, both to increase the amount of equipment we can carry and to raise the EcoAisle’s roof. To accommodate networking, one enclosure in each aisle is a 750mm “network width” enclosure to provide extra room for cable management. As the in-row CRACs were only 42U high, Schneider provides a kit that bolts to the top of the units to add 6U to their height.

Next time:  Power and cooling

VMworld 2019 Conclusion: vCommunity

I’m back in my house, in my chair, with my family. I’m so thankful for them and happy and contented to be back home.

But the whirlwind of the past few days hasn’t faded yet from my mind. There are so many notes, so much information, and I so want to fire up the homelab before it all disappears, but I just can’t. My brain needs to unclench first.

This was my first time at VMworld in San Francisco. Unlike the Vegas show last year, Moscone Center is broken up into three distinct buildings, all separated by busy streets. What this means is that it takes 10 to 15 minutes to get from, say the third floor of Moscone West to Moscone South. Because most events and sessions are scheduled for 30 or 60 minutes, it means leaving an event early or arriving late to the next one. While there was some of this in Vegas, usually events were only separated by a floor and a short walk. It’s something I have to take into account when scheduling sessions next year, and it meant that I had to duck out early or skip some sessions altogether.

The sessions I attended were all excellent. I tried to keep my sessions to topics that would help in my current situation, i.e. wrangling thousands of random virtual machines that are in various stages of use or abandonment, managing and deploying VMs with as few touches as possible, and trying to automate the hell out of all of it. To that end, I focused on the code-related sessions as much as possible, and I was not disappointed. VMware and the community are hard at work solving problems like mine with code, and it’s great to have such a variety of tools ready for incorporation into my own workflows.

Additionally, I attended great sessions on vSphere topics such as VMware Tools, ESXi performance, certificate management and vMotion. These not only gave me insight into how these functions work under the hood, they hinted at new technologies being planned for ESXi and vCenter to make these products work better. This was a great relief, as I’ve been concerned for a while that vSphere would go into a more slow maintenance-only product cycle as the push toward cloud increases. I’m happy that VMware continues to invest heavily in its on-prem products.

If there was one word that summed up the overarching theme of this VMworld, it’s Kubernetes. From the moment Pat Gelsinger stepped onto the stage, Kubernetes was the topic at hand. Kubernetes integration will involve a complete rearchitecting of ESXi, and as someone who sees my customers experimenting with using containers for their build processes, I’m happy that VMware is going to make this easier (and faster) to do and manage in the future.

Let’s face it though. Most of the sessions were recorded and will be made available after the show. This is true of most major software trade shows, and if sessions were the only reason to attend, one could reasonably just stay home and watch videos in their pajamas.

It’s the interactions that matter. Being able to ask questions and get clarifications is very important, and I found that valuable for certain topics and sessions. However, the most important thing that you don’t get sitting at home is the interaction with the community.

Last year was my first VMworld. I didn’t know anybody, and I didn’t really know what to do to get the most out of the show. I scheduled sessions to fill every time slot, even if the product or topic wasn’t interesting or relevant. The time that I wasn’t in a session was spent roaming the conference floor collecting swag from vendors. By the time I was done, I had fifteen t-shirts, a jacket, a drone, a game console and more pens and hats than I would ever use. I did attend a couple of parties and met a couple of great people (like Scott Lowe, who I only realized later was the proprietor of the Full Stack Journey podcast I had been listening to on the plane.)

I didn’t have the real community experience until the Dallas VMUG UserCon a few months later. There I met great people like Ariel Sanchez Mora and the leaders of both the Dallas and Austin VMUGs. But it was a talk Tim Davis gave on community that really made me realize how important my participation would be to me. I dropped a few bucks on WordPress.com and threw out and replaced my old Twitter account and started participating. I’ve seen been attending both the Cedar Park and Austin VMUG meetings as well as the Dallas UserCon, and it’s been great having a group of peers to talk with and bounce ideas off of.

Contrast last year’s VMworld to this year’s. This VMworld I collected two shirts (both from VMUG events) and no vendor swag. I visited a couple of vendors that I specifically wanted to talk to but otherwise just did a little browsing. Instead of filling my calendar and stressing out about catching them all, I strategically picked topics that would be relevant for me in the next year.

Most of my downtime between sessions was spent at the blogger tables at the VMware Social Media and Community booth outside the Solutions Exchange. It was a little darker than the VMUG area, and with a lot less foot traffic. There, I could recharge my batteries (literally), blog about the event and catch up on some of the other topics I’ve been working on and chat with the other community members who rotated in and out as they went to their sessions and events. I also got to drop in on vBrownbag sessions being given by community members, which provided some good background learning while I was working.

So what do I plan to have accomplished by VMworld 2020?

  1. Build out those automation workflows, VMware integrations and tools that I need to better manage the clusters under my purview.
  2. Blog about it.
  3. Work with my customers to get their automated ephemeral build and test envrionments working so they don’t have to rely on sharing and snapshotting their current environments.
  4. Blog about it.
  5. Earn my VCP.
  6. Blog about it.
  7. Have a presentation ready for vBrownbag.
  8. Blog about it.

Thanks to everyone I met this year; it was great.

VMworld 2019: Day 5

First things first: some logistics. I picked a less than desirable hotel to save some money. The hotel… okay, let’s call it what it is. The MOTEL in question is on Harrison Street in SoMa. It’s an area where a 1 bedroom loft still sells for $885,000 (I looked one up), but every other parking meter has automotive glass shards all around it. This was the cheapest accommodation within walking distance of the venue where I wouldn’t have to share a bathroom with strangers, but it’s still more than twice what I paid in Las Vegas last year.

I got up at about 6:00 this morning. This was not by choice; it was a couple of gunshots a few blocks away that woke me, and I wasn’t going back to sleep. That was after the people in the room next door chased away a guy who was trying to burgle a truck in the parking lot, not once, but twice. There was also a shouted discussion at around 5:00am between two men about their upbringing and manliness. Needless to say, I didn’t get involved, and I don’t know if the gunshots an hour later were related.

So I packed up my belongings and took a Lyft to the Intercontinental, which is next door to the venue and has reverted back to regular non-gouging pricing for my last night in San Francisco.

There’s a different vibe at VMworld today. The halls have thinned out substantially, and goodbyes are being exchanged. While there are some familiar faces among the die-hards remaining, it’s very clear that VMworld is winding down. That doesn’t stop me from making the most of my time though. I have two back-to-back sessions to attend late in the morning.

Seating is easy to find on Thursday.

The first session was presented by William Lam and Michael Gasch. The VMware team has collaborated on an event engine called VEBA, an appliance that brokers events from vCenter to OpenFAAS to allow event-driven functions. They demonstrated using VEBA and functions to trigger actions from events, such as posting a message in Slack when a critical VM is powered off.

The second session dove into automating vSphere through the PowerShell CLI. This demo was put on by Kyle Ruddy, who focuses heavily on using code to automate VMware.

I then made my way back to the VMware community booth, where I could comfortably catch up on my various feeds, plug in to charge, and type. I actually fly out in the morning since trying to leave for the airport at 4:00pm on a Thursday to fly out tonight is the perfect way to end up stuck overnight at an airport. Besides, I need the sleep.

VMworld 2019: Day 4

With the general sessions behind me, today would be all about breakout sessions and networking. My first session was at the VMware {CODE} alcove. Knowing that there are issues being able to see the screen from the back, I got in line early to get a front row seat. The session did not disappoint. David Stamen has written a comprehensive suite of Powershell scripts to automate the management of VCSA patches, including querying for available patches, downloading, staging and deploying. Tasks can be run on multiple vCenters simultaneously, so vast swaths of vCenters can be updated simultaneously with a couple of commands.

In the second session, Kyle Ruddy and JJ Asghar demonstrated how to automate template builds and modifications using Packer, Jenkins, Ansible and InSpec. This was another one of those whirlwind sessions that covers a lot of ground in a very short time. They have made the scripts they use available, but realistically, it was more of a proof of concept demo than something one could take home and immediately implement. However, they do provide the scripts they use, and some blog searching could probably help cobble together such a setup.

The third session covered the mechanics of SSL certificates in vSphere, including best practices regarding maintaining certificates. Finally, a tech preview hinted at some utilities to make certificate management easier in a future release. Finally, a great session about VMware Tools: what they are, what they consist of and how to manage updates and maintenance. Again, the tech preview hinted at future technologies to make updating and using VMware Tools easier.

The evening ended with beer and Billy Idol at the Bill Graham Civic Auditorium. What could be better?

VMworld 2019: Day 3

This time I was smart. Rather than being crammed into a room with thousands of other people to watch the Tuesday keynote on a screen, I instead made my way to the Social Media and VMware Community area in the lounge and grabbed a seat at the blogging table. Not only could I watch the keynote while blogging, chatting with others and keeping my devices topped off.

A chilly foggy evening in San Francisco from about 400 feet. Picture by me.

The second general session expanded upon the previous day’s session and presented examples of fictional companies using VMware’s Tanzu offerings to produce its fictional products. The presentation explained how Tanzu could cover all of the theoretical technological needs of business, from IoT to edge to datacenter to networking.

My first session of the day was a deep dice into the mechanics of vMotion on vSphere. Detailed hints for tuning vSphere to optimize the vMotion process were also provided, including NIC and vmkernel settings. The presenters also hinted at future functionality that would allow vSphere to be more intelligent about handling both local, remote and storage vMotions to eliminate the need to tweak it.

Back at the Social Media and VMware Community booth, I watched Ariel Sanchez Mora give a presentation on how to start and run mini local hackathons. As someone who is working to get more involved with the Austin-area community as well as improve my coding skills and solve some real issues in the enterprise, I am looking into doing this soon. Right after, Gabriel Maentz presented a talk about using vRealize and Terraform to improve automation in the enterprise.

Ariel Sanchez Mora presenting at vBrownbag.

Finally, I attended a briefing by HPe about some new initiatives, including Greenlake, which uses a consumption-based model to provide a fully-managed platform that extends from on-prem to cloud providers like Google Cloud. More directly applicable to me, though, it improved Nimble VVol support coming in the near future, and Nimble Cloud Volumes, which allow replicating data between cloud providers and Nimble Arrays with no egress charges. Also discussed was something I had heard about previously from my current Nimble rep: Nimble dHCI, which allows automatic provisioning of Nimble storage arrays and HPe Proliant servers into the dHCI “stack”, reducing operator work, while providing a single support point for issues anywhere in the stack. And unlike some other solutions, standard HPe arrays and servers can be ordered to grow the stack, rather than having custom hardware.

After a show and drinks, I hit the hay, preparing for a busy Wednesday.