About Tim Clevenger

I'm a 25+ year veteran of the IT industry who enjoys modern and vintage technology.

Let’s Automate VMware Part 2: Building Linux Templates with Packer

With packer ready to go, the next step was to specify the installation options for the template. In CentOS, that’s achieved with a text file called a kickstart file. The easiest way to create a kickstart file is to manually install CentOS in a throwaway VM using the desired options. Once the installation was completed, I could download the /root/anaconda-ks.cfg file from the VM and customize it using the CentOS documentation. My manual install included automatic partitioning, disabling kdump, choosing a minimum installation set, enabling the network adapter and creating a root password and an ansible user for future use. Once I downloaded the file, I added a few options and made some changes, all commented below.

#version=RHEL8

# Autopartition as lvm without home partition (CHANGED)
ignoredisk --only-use=sda
autopart --type=lvm --nohome

# Partition clearing information
clearpart --none --initlabel

# Use graphical install
graphical

# Use CDROM installation media
cdrom

# Keyboard layouts
keyboard --vckeymap=us --xlayouts='us'

# System language
lang en_US.UTF-8

# Network information - disable ipv6 (CHANGED)
network  --bootproto=dhcp --device=eth0 --noipv6 --activate
network  --hostname=localhost.localdomain

repo --name="AppStream" --baseurl=file:///run/install/repo/AppStream

# Root password
rootpw --iscrypted $6$poUOPKJL.FwIJlYW$6qcoGwXwoxcusg92P7vQTDg6naxmYovIXfrU6Ck6pRV.LoDWxYu0GzmFWnrQHF5kBaF6rIx2t0C6IlUO0qXXO.

# Don't run the Setup Agent on first boot (CHANGED)
firstboot --disable

# Enable firewall with ssh allowed (ADDED)
firewall --enabled --ssh

# Do not configure the X Window System
skipx

# System services (ADDED VMTOOLS)
services --disabled="chronyd"
services --enabled="vmtoolsd"

# System timezone (not UTC in VM environment) (CHANGED)
timezone America/Chicago

# Add ansible service account and add to 'wheel' group
user --groups=wheel --name=ansible --password=$6$D5GqhqMWaRMWrrGP$WVktoxs2QAhyEtozouyUEwjMFShVGEXUvKJ0hG89LW38jU1Uax5tCkFvp4O3z0piP3NiSV34XBn6qQk3RwfQ5. --iscrypted --gecos="ansible"

# Reboot after installation (ADDED)
reboot

%packages
@^minimal-environment
wget
open-vm-tools

%end

%addon com_redhat_kdump --disable --reserve-mb='auto'

%end

%anaconda
pwpolicy root --minlen=6 --minquality=1 --notstrict --nochanges --notempty
pwpolicy user --minlen=6 --minquality=1 --notstrict --nochanges --emptyok
pwpolicy luks --minlen=6 --minquality=1 --notstrict --nochanges --notempty
%end

I added a few packages, including open-vm-tools, to the install list. Any additional packages I could install later with ansible, but getting VM Tools installed up front is important.

Once I created the file, I had to figure out how to get it into a place where the CentOS installer could reach it. Packer has the ability to drop the file onto a virtual “floppy disk” and mount it to the VM, but CentOS 8 no longer supports floppy disks during boot. There are many other options, including FTP, HTTP and NFS, so I put Nginx on my automation machine and dropped the Kickstart file into the html folder.

Packer uses JSON as its legacy language. It was in the process of moving to HCL2, but as it was in beta at the time of this blog post, I stayed with JSON. Here was my initial configuration file for CentOS on vSphere.

{
"variables": {
  "vcenter-server": "vcsa.lab.clev.work",
  "vcenter-username": "automation@vsphere.local",
  "vcenter-password": "VMware1!",
  "vcenter-datacenter": "datacenter",
  "vcenter-cluster": "cluster1",
  "vcenter-datastore": "storage1",
  "vcenter-folder": "Templates",
  "vm-name": "CentOS_8_Minimal",
  "vm-cpu": "2",
  "vm-ram": "2048",
  "vm-disk": "81920",
  "vm-network": "VM Network",
  "ks-path": "http://automation.lab.clev.work/ks_centos8_minimal.cfg",
  "iso-path": "[storage2] iso/CentOS-8.1.1911-x86_64-dvd1.iso"
},
"builders": [
{
  "type": "vsphere-iso",
  "vcenter_server": "{{user `vcenter-server`}}",
  "username": "{{user `vcenter-username`}}",
  "password": "{{user `vcenter-password`}}",
  "insecure_connection": "true",
  "datacenter": "{{user `vcenter-datacenter`}}",
  "cluster": "{{user `vcenter-cluster`}}",
  "datastore": "{{user `vcenter-datastore`}}",
  "folder": "{{user `vcenter-folder`}}",
  "ssh_username": "root",
  "ssh_password": "RootPW1#",
  "convert_to_template": "false",
  "vm_name": "{{user `vm-name`}}",
  "notes": "Built by Packer at {{timestamp}}",
  "guest_os_type": "centos8_64Guest",
  "CPUs": "{{user `vm-cpu`}}",
  "RAM": "{{user `vm-ram`}}",
  "RAM_reserve_all": "false",
  "disk_controller_type": "pvscsi",
  "disk_size": "{{user `vm-disk`}}",
  "disk_thin_provisioned": "true",
  "network_card": "vmxnet3",
  "network": "{{user `vm-network`}}",
  "iso_paths": ["{{user `iso-path`}}"],
  "boot_command": [
    "<esc><wait>",
    "linux ks={{user `ks-path`}}<enter>"
    ]
  }
]
}

It looks a little intimidating at first, but the layout is pretty straightforward. The first section is called “variables” and it allows me to define variables that will be used later in the code. Most of these are self-explanatory–they define the settings for the VM, such as disk size and network, as well as how it will be deployed in vCenter, such as which datacenter, cluster and datastore it will be written to. Also note the ks-path, which is where the CentOS installer will find its Kickstart file, and the iso-path, which is where I uploaded the CentOS ISO.

The second section is called “builders” and passes those variables on to the builder, which actually does the work. Any insertion is encased in double sets of braces, and the variable name is further encased in backticks. Again, everything is fairly straightforward until we get to the boot_command field. Here, the builder will actually tell vCenter to type the characters into the VM, interrupting the boot and instructing grub to boot the installer and grab the Kickstart file from the web server.

So, with the template saved in the packer directory as centos8_minimal.json, I was ready to build.

./packer build centos8_minimal.json

Because I was interested, I ran this command and immediately jumped into vCenter. The “CentOS_8_Minimal” VM appeared almost instantly and I launched a remote console into it. There I could see the boot options literally being typed into the window. The installer then booted and proceeded with a completely hands-off graphical install. It then rebooted the VM and finished.

The process was so seamless that I ran it again with the “time” command to show how long the entire process took:

I fired up the VM and verified that everything was in place and that both the root and ansible users were created correctly. After deleting the VM and changing “convert_to_template” to “true”, I reran Packer and it dutifully rebuilt the VM and converted it to a template of the same name.

Packer has the ability to install additional software, run scripts and perform other tasks on a freshly prepared VM before shutting it down, using “provisioners.” My goal is to manage my infrastructure using Ansible, so I won’t be using provisioners to perform that task here; however, I don’t want my VMs to be deployed from a hopelessly out-of-date template, so I added in a quick provisioner:

"provisioners": [
  {
    "type": "shell",
    "inline": [
    "sudo yum -y upgrade"
    ]
  }
]

The output of the yum upgrade was copied into the output from Packer, so I could verify that the build and upgrade were successful. With that, I had a basic clean freshly-upgraded of CentOS 8 that was templated and ready to deploy in vCenter. The final task was to schedule regular redeployments. I wanted to schedule my deployments for Saturdays, and since Windows patches come out on the second Tuesday, I decided to rebuild all of my templates on the second Saturday of each month. Since my automation VM was running on Linux, I would use cron to schedule it. However, cron doesn’t have the concept of “every second Saturday of the month.” However, since the second Saturday of the month always happens between the eighth and fourteenth of the month, inclusive, I set cron to run every day during that second week and then only proceed if it was a Saturday:

0 2 8-14 * * [ `date +\%u` = 6 ] && /opt/automation/packer build -force /opt/automation/centos8_minimal.json

This crontab entry builds at 2:00am every second Saturday. The “-force” option overwrites the previous template with the newly built one. In the next post, I will build Windows Server templates in Packer.

Let’s Automate VMware Part 1: Introduction and Packer Prerequisites

Just a little warning up front: I’ve never done this before. I’m taking a journey into slowly replacing myself with a small shell script, or at least with some automation tools. I’m completely new to this and working my way through it with the official documentation, a little bit of Googling and my cobbled-together vSphere 6.7 homelab. Feel free to join me on this journey, but please don’t blame me if you try this in your environment and get eaten by a grue.

Ultimately, the goal this project is to completely automate my homelab, from hardware provisioning to hypervisor installation, cluster management, image management, package management, log aggregation and backups. Stretch goals include extending into the cloud, a CI/CD pipeline and “employee” self service.

My first part of this journey was building base images so I would have something to operate, back up and manage. This seemed like a logical place to start as this was “low hanging fruit” that would be easy to implement in any environment where VMs are deployed regularly. But rather than hand-building some base CentOS and Windows templates, it made sense to delve into using Packer to build these templates for me. So in the next couple of blog posts, I will use Packer to build base Linux and Windows images that I can use as templates in my VMware environment, and then schedule Packer to keep those images up-to-date.

While I could run Packer straight from my desktop, ultimately I wanted to have this code running on a dedicated virtual machine in my environment. I created a barebones Linux VM based on CentOS 8 that I named ‘automation’ as the base for all of this activity. While I’m using CentOS 8 for this, feel free to use whatever *nix you want or even Windows–Packer itself is platform agnostic.

A quick note up front: some versions of RedHat derivatives, including CentOS 8, already have a program called “packer”, which is related to cracklib, a library for checking passwords for good entropy. It’s actually a link to another binary, so I bypassed the name collision issue altogether by issuing an “unlink /usr/sbin/packer” command without any regard for whether it might break the rest of the system. Do this at your own peril.

Installation was straightforward. I downloaded the Linux x64 binary from the Packer website (https://www.packer.io/downloads.html) and unzipped it into /usr/local. Running /usr/local/packer without any arguments proved that it was working. You can also find prebuilt packages for Packer or compile it yourself.

My next step was to get permissions to my vSphere cluster. Packer uses the vSphere API, and the permissions needed can be found in Packer’s documentation for the vSphere builders. I fired up the vSphere client to create a user called “automation” and assign it to the Administrators group. I then logged into the vCenter using PowerCLI in PowerShell:

Connect-VIServer -server vcsa.lab.clev.work

After allowing the vCenter certificate, I entered the credentials for the new ‘automation’ user into the popup dialog box and connected. To test my new user, I connected my datastore called “datastore2” to a local PowerShell mount called “isostore”, and then uploaded the latest CentOS ISO to the “iso” folder on that datastore.

New-PSDrive -Location (Get-Datastore "storage2") -Name isostore -PSProvider VimDatastore -Root "\"
Copy-DatastoreItem -Item d:\isos\CentOS-8.1.1911-x86_64-dvd1.iso -Destination isostore:\iso\

Note that this process is slow; the upload took about 20 minutes in my homelab and the progress bar in PowerCLI did not move during the copy. Fortunately, I could see the file growing in the vSphere client as the upload proceeded. In the next post, I use my freshly-uploaded ISO to build my first Linux image from my ‘automation’ VM.

Automatically Disable Inactive Users in Active Directory

While Microsoft provides the ability to set an expiration date on an Active Directory user account, there’s no built-in facility in Group Policy or Active Directory to automatically disable a user who hasn’t logged in in a defined period of time. This is surprising since many companies have such a policy and some information security standards such as PCI require it.

There are software products on the market that provide this functionality, but for my homelab, my goal is do this on the cheap. Well, not the cheap so much as the free. After reading up on the subject, I found that this is not quite as straightforward as it may seem. For instance, Active Directory doesn’t actually provide very good tools out of the box for determining when a user last logged on.

The Elusive Time Stamp

Active Directory actually provides three different timestamps for determining when a user last logged on, and none of them are awesome. Here are the three available AD attributes:

lastLogon – This provides a time stamp of the user’s last logon, with the caveat that it is not a replicated attribute. Each domain controller retains its own version of this attribute with the last timestamp that the user logged onto that particular domain controller. This means that any script that uses this attribute will need to pull the attribute from every domain controller in the domain and then use the most recent of those timestamps to determine that actual last logon.

lastLogonTimeStamp – This is a replicated version of the lastLogon timestamp. However, it is not replicated immediately. To reduce domain replication traffic, the replication frequency depends on a domain attribute called msDS-LogonTimeSyncInterval. The value of lastLogonTimeStamp is replicated based on a random time interval of up to five days before the msDS-LogonTimeSyncInterval. By default the msDS-LogonTimeSyncInterval attribute is unset, which makes it default to 14 days. Therefore, by default, lastLogonTimeStamp is replicated somewhere between 9 and 14 days after the previous replicated value. Needless to say, this is not useful for our purposes. In addition, this attribute is stored in a 64-bit signed numeric value that must be converted to a proper date/time to be useful in Powershell.

lastLogonDate – There are a lot of blogs that will state that this is not a replicated timestamp. Technically that’s true; it’s a copy of lastLogonTimeStamp that the domain controller has converted to a standard date/time for you. Otherwise, it’s the same as lastLogonTimeStamp and has the same 9-to-14-day replication delay.

Another caveat for all three of these attributes is that the timestamp is updated for many logon operations, including interactive and network logons and those passed from another service such as RADIUS or another Kerberos realm. Also, there’s a Kerberos operation called “Service-for-User-to-Self” (“S4u2Self”) which allows services to request Kerberos tickets for a user to perform group and access checks for that user without supplying the user’s credentials. For more information, see this blog post.

Caveats aside, this is what we have to work with. For the purposes of my relatively small domain, I’m comfortable with increasing the replication frequency of lastLogonTimeStamp. Be aware before using this in production that it will increase replication traffic, especially during periods when many users are logging in simultaneously; domain controllers will be replicating this attribute daily instead of every 9 to 14 days.

As none of these caveats applied in my homelab, I launched Active Directory Users and Computers. Under “View”, I selected “Advanced Features” to expose the attributes I needed to view or change. I then right-clicked my domain and selected “Properties.” The msDS-LogonTimeSyncInterval was “not set” as expected, so I changed it to “1” to ensure that the timestamp was replicated daily for all users.

I then created the below Powershell script in a directory.

# disableUsers.ps1  
# Set msDS-LogonTimeSyncInterval (days) to a sane number.  By
# default lastLogonDate only replicates between DCs every 9-14 
# days unless this attribute is set to a shorter interval.

# Also, make sure to create the EventLog source before running, or
# comment out the Write-EventLog lines if no event logging is
# needed.  Only needed once on each machine running this script.
# New-EventLog -LogName Application -Source "DisableUsers.ps1"

# Remove "-WhatIf"s before putting into production.

Import-Module ActiveDirectory

$inactiveDays = 90
$neverLoggedInDays = 90
$disableDaysInactive=(Get-Date).AddDays(-($inactiveDays))
$disableDaysNeverLoggedIn=(Get-Date).AddDays(-($neverLoggedInDays))

# Identify and disable users who have not logged in in x days

$disableUsers1 = Get-ADUser -SearchBase "OU=Users,OU=Demo Accounts,DC=lab,DC=clev,DC=work" -Filter {Enabled -eq $TRUE} -Properties lastLogonDate, whenCreated, distinguishedName | Where-Object {($_.lastLogonDate -lt $disableDaysInactive) -and ($_.lastLogonDate -ne $NULL)}

 $disableUsers1 | ForEach-Object {
   Disable-ADAccount $_ -WhatIf
   Write-EventLog -Source "DisableUsers.ps1" -EventId 9090 -LogName Application -Message "Attempted to disable user $_ because the last login was more than $inactiveDays ago."
   }

# Identify and disable users who were created x days ago and never logged in.

$disableUsers2 = Get-ADUser -SearchBase "OU=Users,OU=Demo Accounts,DC=lab,DC=clev,DC=work" -Filter {Enabled -eq $TRUE} -Properties lastLogonDate, whenCreated, distinguishedName | Where-Object {($_.whenCreated -lt $disableDaysNeverLoggedIn) -and (-not ($_.lastLogonDate -ne $NULL))}

$disableUsers2 | ForEach-Object {
   Disable-ADAccount $_ -WhatIf
   Write-EventLog -Source "DisableUsers.ps1" -EventId 9091 -LogName Application -Message "Attempted to disable user $_ because user has never logged in and $neverLoggedInDays days have passed."
   }

You may notice two blocks of similar code. The lastLogonDate is null for newly created accounts that have never logged in. Rather than have them all handled in a single block, I created two separate handlers for accounts that have logged in and those that haven’t. This might be useful for some organizations that want to disable inactive accounts after 90 days but disable accounts that have never logged in after only 14 or 30 days.

Note also that I have included event logging in this script. This is completely optional, but if you are bringing Windows logs into a SIEM like Splunk, it’s useful to have custom scripts logging into the Windows logs so that the security team can track and act on these events. To log Windows events from a Powershell script, you need to register your script. This is a simple one-time command on each machine running the script. Here’s the command I used to register my script:

New-EventLog -LogName Application -Source "DisableUsers.ps1"

This gives my script the ability to write events into the Application log, and the source will show as “DisableUsers.ps1”. The LogName can be used to log events to a different standard Windows log, or to even create a completely separate log. I also created two event IDs, 9090 and 9091, to log the two event types from my script. I did a quick Google search to make sure that these weren’t already used by Windows, but duplicate IDs are fine from different sources.

This script by default has a “-WhatIf” switch on each Disable-ADAccount command. This is so the script can be tested against the domain to make sure it’s behaving in an expected manner. During the run, the script will display the accounts it would have disabled, and the event log will generate an event for each disablement regardless of the WhatIf switch. Once thoroughly tested, the “-WhatIf”s can be removed to make the script active. The SearchBase should also be changed to an appropriate OU in your domain. Note that this script does not discriminate; any users in its path are subject to being disabled, including service accounts and the Administrator account if they have not logged in during the activity period.

Finally, it’s time to put the script into play. It’s as easy as creating a daily repeating task that launches powershell.exe with the arguments “-ExecutionPolicy Bypass c:\path\to\DisableUsers.ps1” If this is run on a domain controller, it can be run as the NT AUTHORITY\System user so that no credentials need to be stored or updated. I ran this in my homelab test domain and viewed the event log. Sure enough, several of my test accounts were disabled. Note that I used the weasel words “attempted to” in the log; this is because I don’t actually do any checking to verify that the account was disabled successfully.

I did see one blog post where the author added a command to update the description field of an account so that administrators could see at a glance that it was auto-disabled. I didn’t do that here, but if you wanted to add that, here’s a sample command you can add to the script:

Set-ADUser -Description ("Account Auto-Disabled by DisableUser.ps1 on $(get-date)")

Hopefully this will help others to work around a glaring oversight by Microsoft. Please drop me a comment if you have any suggestions for improvement; I’m not a Powershell coder by trade and I’m always looking for tips.

Building and Populating Active Directory for my Homelab

There are a multitude of Active Directory how-to blog posts and videos out there. This is my experience putting together a quick and dirty Active Directory server, populating it with OUs and users, and getting users randomly logged in each day. I’m going to skip the standard steps for installing AD as it’s well documented in other blogs, and it’s the populating of AD that’s more interesting to me.

Why put in all this work you ask? I’m doing some experimentation in AD, and rather than blow up a production environment, I want a working and populated AD at home. I also need users to “log in” randomly so I can use the login activity to write some scripts. So, here’s how I did it.

Step 1 was to actually install Active Directory. Again, there are numerous blogs about how to do this. Since I was running Server 2016 in evaluation mode, I wasn’t going to do a lot of customizing since it was just going to expire eventually anyway. I had already configured 2016 for DNS and DHCP for my lab network, so installing AD was as easy as adding the role and management utilities from the Server Manager. Utilizing the “Next” button repeatedly, I created a new forest and domain for the lab.

The next step was to populate the directory. For this, I went with the “CreateDemoUsers” script from the Microsoft TechNet Gallery. I downloaded the PowerShell script and accompanying CSV and executed the script. In minutes, I had a complete company including multiple OUs filled with over a thousand user accounts and groups. Each user had a complete profile, including title, location and manager.

With my sample Active Directory populated, my next goal was to record random logins for these users. I needed this initially for some scripts I was writing, but my eventual goal was to generate regular activity to bring into Splunk. For my simulated login activity, I wrote a script to pick a user at random and execute a dummy command as that user so that a login was recorded on their user record. I began by modifying the “Default Domain Controller policy” in Group Policy Management. I added “Domain Users” to “Allow log on locally.” (This is found in Computer Configuration -> Windows Settings -> Security Settings -> Local Policies -> User Rights Assignment.) This step isn’t necessary if the script is running from a domain member, but since I didn’t have any computers added to the domain, this would allow me to run the script directly on the domain controller.

Here is the script that I created. I put it in c:\temp so it could be run by any user account.

# Login.ps1

$aduser = Get-ADUser -Filter * | Select-Object -Property UserPrincipalName | Sort-Object{Get-Random} | Select -First 1
$username = $aduser.UserPrincipalName
$password = ConvertTo-SecureString 'Password1' -AsPlainText -Force

$credential = New-Object System.Management.Automation.PSCredential $username, $password
Start-Process hostname.exe -Credential $credential

I wanted my user base as a whole to log in on average every 90 days, which meant that I wanted about 11 users per day to log in. Easy enough, I created a scheduled task to run at 12:00am every day and repeat every 2 hours. That was close enough for my purposes.

Since I was running this on a lone DC in a lab, I ran it as NT AUTHORITY\SYSTEM so I didn’t have to mess with passwords. The task runs powershell.exe with the arguments -ExecutionPolicy Bypass c:\temp\login.ps1.

After saving the task, I right-clicked and clicked “Run.” Because I chose hostname.exe as the execution target, the program opened and immediately closed. While this briefly flashed on my screen when testing, nothing appears on the screen when running this as a scheduled task. A quick look through the event log confirmed that user “LEClinton” logged in, and a look at Lisa Clinton’s AD record shows that she did indeed log in.

While my eventual goal is to script all of the installation and configuration, this was enough to get the other work I needed to do done, as well as provide a semi-active AD environment for further testing and development.

Homelab Again

“We’re putting the band back together.”

Five o’clock. Time to shut down the ol’ laptop and go watch some TV. Or not. I feel the urge to create again. I have a bunch of hardware laying around that’s gathering dust, and it’s time to assemble it into something I can use to learn and share.

I’m not looking for anything flashy here. Money is tight and the future is uncertain for me and millions of others, so while I’d love a few old Dell servers or some shiny new NUCs, I’m sure I can cobble together something useful out of my closet for free. Initially I was going to put together a single machine and set up nested virtualization, but my best available equipment only has four cores and frankly, I feel like it’s cheating anyway. I’m a hardware junkie, so I cobble.

With a day’s work, I have scraped together a cluster of machines. One is from my original homelab build: a Dell Optiplex 9010 motherboard and i5 processor bought off eBay and transplanted into a slim desktop case. It has four DIMM slots, so it’s going to be one of my ESXi hosts. The other VM host is a Dell Optiplex 3020 SFF i5 desktop. Alas, it only has two DIMM slots, so it maxes out at 16GB with the sticks I have. I also manage scrape together a third system with eight gigs of RAM and a lowly Core 2 Duo. Transplanted into a Dell XPS tower case formerly occupied by a Lynnfield i7, it has enough drive bays that I can drop in a one terabyte SATA SSD and three 800GB SAS SSDs in RAID-0 on an LSI HBA. This system takes the role of an NFS-based NAS running on CentOS 8 and has no problems saturating a gigabit link.

It’s not all roses–my networking is all gigabit–but it totals eight cores, 48GB of usable RAM and over 3TB of flash storage. A few hours after that, I have a two-node ESXi 6.7 cluster configured and vCenter showing all green. I toyed briefly with installing 7.0, but I want to give that a little more time to bake, and besides, if what I’m planning doesn’t directly translate then I’ve done something wrong.

What are my goals here? Well, I want to work on a “let’s automate VMware” series. Not satisfied with the kind of janky scripts I’ve cobbled together to automate some of my uglier processes, I want to learn how to do this the right way. That certainly means Terraform, some kind of configuration management (Ansible, Salt, Puppet, Chef or a combination), building deployment images with Packer, and figuring out how to safely do all of this with code that I’m storing in Github. Possible future projects might involve integrating with ServiceNow, Splunk, monitoring (Zabbix, Solarwinds) or other business tools, and using the resulting tooling to deploy resources into AWS, Azure, DigitalOcean and/or VMware Cloud.

Why do this on my own? While I do have the resources at work, I feel more comfortable doing this on my own time and at my own pace. Plus, by using personal time and resources, I feel safer sharing my findings publicly with others. While my employer has never given me any indication that it would be an issue, I’d just rather be able to do and refine my skills on my own time.

So stay tuned; there’s more content to come from the homelab!

How to Disable Screen Resolution Auto-Resizing on ESXi Virtual Machines (for Linux this time)

It seemed like a simple request. All I wanted was for my Linux virtual machine to stay at the 1920×1080 resolution I specified. VMware Tools, however, had different ideas and resized my VM’s screen resolution every time I resized the window in the remote console. Normally this is no biggie, but users were connecting into this box with VNC and the VNC server was freaking out when the resolution changed. Google has dozens of blog posts and forum questions about resolving this issue in Windows VMs and in VMware Fusion/Workstation, but I couldn’t find anything on the fixing this in ESXi 6.7 even after an hour of searching.

Finally I dug into my local filesystem and found the culprit: a dynamic library called libresolutionKMS.so. On my CentOS 7 VM, it was located in /usr/lib64/open-vm-tools/plugins/vmsvc. Like the Windows equivalent, moving or deleting this file and rebooting was sufficient to stop this issue in its tracks. In my case, I moved it back to my home directory to make sure nothing broke, but it appears that this stopped the auto-resizing without affecting any other operations, and VMware still reported the VMtools as running.

As this is Linux, open-vm-tools is updated by my package manager, so this file will need to be moved or deleted each time after open-vm-tools is updated. Given the lack of any other way to disable this behavior, this seems like a workable kludge for my situation.

Working in the Hot Aisle: Sealing It Up

“The hot stays hot and the cool stays cool.” -McDonalds McDLT commercials

Part 1 of this saga can be found here.

We were all just a bunch of starry-eyed kids with our plan for the future:  Rack some equipment, buy back-to-front cooled switches, snap in some filler blanks and the world would be a beautiful place.  We soon were in for a big reality check.  Let’s start with a simple task: buying some switches.

It turns out that it’s actually difficult to find switches that suck air in the back and blow it out the front.  In fact, it’s even fairly difficult to find switches that blow front to back.  Many (most?) switches pull air from the left side of the switch and blow it out the right or vice versa.  In a contained data center, that means that the switches are pulling in hot air from the hot aisle and blowing out even hotter air.  In fact, there are diagrams on the Internet showing rows of chassis switches mounted in 2-post racks where the leftmost switch is getting relatively cool air and blowing warmer air into the switch to its right.  This continues down the line until you get to the switch at the other end of the row, which is slowly melting into a pile of slag.  Needless to say, this is not good for uptime.

There are companies that make various contraptions for contained-aisle use.  These devices have either passive ducts or active fan-fed ducts that pull air from the cold aisle and duct it into the intake side of the switch.  Unfortunately, switch manufacturers can’t even agree on which side of the switch to pull air from or where the grilles on the chassis are located.  Alas, this means that unless somebody is making a cooler specific to your chassis, you have to figure out which contraption is closest to your needs.  In our case, we were dealing with a Cisco 10-slot chassis with right-to-left cooling.  There were no contraptions that fit correctly, so an APC 2U switch cooler was used.  This cooler pulls air from the front and blows it up along the intake side of the switch in the hot aisle.  While not as energy efficient as contraptions with custom fitted ducts that enclose the input side of the switch, it works well enough and includes redundant fans and power inputs.

For the top of rack and core switches, only the Cisco Nexus line offered back-to-front cooling options (among Cisco switches, that is.)  That’s fine since we were looking at Nexus anyway, but it’s unfortunate that it’s not an option on Catalyst switches.  Front-to-back cooling is an option, but then the switch ports are in the cold aisle, meaning that cables must be passed through the rack and into the hot aisle.  It can work, but it’s not as clean.

However, buying back-to-front cooled switches is but the beginning of the process.  The switches are mounted to the back of the cabinet and are shorter than the cabinet, meaning that there is a gap around the back of the switch where it’s not sealed to the front of the cabinet.  Fortunately, the contraption industry has a solution for that as well.  In our case, we went with the HotLok SwitchFix line of passive coolers.  These units are expandable; they use two fitted rectangles of powder-coated steel that telescope to close the gap between the switch and the cabinet.  They come in different ranges of depths to fit different rack depth and switch depth combinations and typically mount inside the switch rails leading to the intake side of the switch.  Nylon brush “fingers” allow power and console cables to pass between the switch and the SwitchFix and into the hot aisle.

The rear of the switch as viewed from the cold aisle side. The SwitchFix bridges the gap between the rear of the switch and the front of the rack.

While this sounds like an ideal solution, in reality the heavy gauge steel was difficult to expand and fit correctly, and we ended up using a short RJ-45 extension cable to bring the console port out of the SwitchFix and into the cold aisle for easy switch configuration.  The price was a little heart-stopping as well, though it was still better than cobbling together homemade plastic and duct-tape contraptions to do the job.

With the switches sorted, cable managers became the next issue.  The contractor provided standard 2U cable managers, but they had massive gaps in the center for cables to pass through–great for a 2-post telco rack, but not so great for a sealed cabinet.  We ended up using some relatively flat APC 2U cable managers and placed a flat steel 2U filler plate behind them, spaced out with #14 by 1.8″ nylon spacers from Grainger.  With the rails fully forward in the cabinet, the front cover of the cable manager just touched the door, but didn’t scrape or significantly flex.

One racks are in place and equipment is installed, the rest of the rack needs to be filled to prevent mixing of hot and cold air. There are a lot of options, from molded plastic fillers that snap into the square mounting holes to powdered coated sheets of steel with holes drilled for mounting. Although the cost was significantly higher, we opted for the APC 1U snap-in fillers. Because they didn’t need screws, cage nuts or tools, they were easy to install and easy to remove. With the rails adjusted all the way up against the cabinet on the cold aisle side, no additional fillers were needed around the sides.

With every rack unit filled with switches, servers, cable managers, telco equipment and snap-in fillers, sealing the remaining gaps was our final issue to tackle from an efficiency perspective.  While the tops of the cabinets were enclosed by the roof system, there was still a one-inch gap underneath the cabinets that let cold air through.  Even though the gap under the cabinet was only an inch high, our 18 cabinets had gaps equivalent to about three square feet of open space!  We bought a roll of magnetic strip to attach to the bottom of the cabinets to block that airflow, reduce dust intrusion and clean up the look.

Lessons Learned

There’s no other way of saying this: this was a lot of work. A lot of stuff had to be purchased after racking began. There are a lot of gotchas that have to be considered when planning this, and the biggest one is just being able to seal everything up. Pretty much the entire compute equipment industry has standardized on planning their equipment around front-to-back cooling, which makes using them in a contained-aisle environment simple. Unfortunately, switch manufacturers are largely just not on board. I don’t know if it’s because switching equipment is typically left out of such environments, or if they just don’t see enough of it in the enterprise and small business markets, but cooling switches involves an awful lot of random metal and plastic accessories that have high markups and slow shipping times.

However, I have to say that having equipment sitting at rock-stable temperatures is a huge plus. We were able to raise the server room temperatures, and we don’t have the hot-spot issues that cause fans to spin up and down throughout the day. Our in-row chillers run much less than the big CRAC units in the previous data center, even though there is much more equipment in there today. The extra work helped build a solid foundation for expansion.

Working in the Hot Aisle: Power and Cooling

Part 1 of this saga can be found here.

An hour of battery backup is plenty of time to shut down a dozen servers… until it isn’t.  The last thing you want to see is that clock ticking down while a couple thousand Windows virtual machines decide to install updates before shutting down.

We were fortunate that one of the offices we were consolidating was subleased from a manufacturer who had not only APC in-row chillers that they were willing to sell, but a lightly used generator.  Between those and a new Symmetra PX UPS, we were on our way to breathing easy when the lights went out.  The PX provides several hours of power and is backed by the generator, which also backs up the chillers in the event of a power outage.  The PX is a marvel of engineering, but is also a single point of failure.  We witnessed this firsthand with an older Symmetra LX, which had a backplane failure a couple of years previous that took down everything.  With that in mind, we opted to go with two PDUs in each server cabinet: one that fed from the UPS and generator, and one that fed from city power with a massive power conditioner in front of it.  These circuits also extend into the IDFs so that building-wide network connectivity stays up in the event of a power issue.

Most IT equipment comes with redundant power supplies, so splitting the load is easy: one power supply goes to each PDU.  For the miscellaneous equipment with a single power supply, an APC 110V transfer switch handles the switching duties.  A 1U rackmount unit, it is basically a UPS with two inputs and no batteries, and it seamlessly switches from one source circuit to the other when a voltage drop is detected.

As mentioned, cooling duties are handled by APC In-Row chillers, two in each aisle.  They are plumbed to rooftop units and are backed by the generator in case of power failure.  Temperature sensors on adjacent cabinets provide readings that help them work as a group to optimize cooling, and network connectivity allow monitoring using SNMP and/or dedicated software.  Since we don’t yet need the cooling power from all four units, we will be programming the units to run on a schedule to balance running hours between the units.

Cooling in the IDFs is handled by the building’s chiller, with an independent thermostat-controlled exhaust fan as backup. As each IDF basically hosts just one chassis switch, cooling needs are easily handled in this manner. As users are issued laptops that can ride out most outages, we were able to sidestep having to provide UPS power to work areas.

Next time:  Keeping the hot hot and the cool cool.

Working In the Hot Aisle: Choosing a Design

Part 1 of this saga can be found here.

As mentioned in the previous post, there several ways to skin the airflow separation cat. Each has its advantages and disadvantages.

Initially we considered raised floor cooling. Dating back to the dawn of computing, raised floor datacenters consist of chillers that blow cold air into the space under the raised floor. Floor tiles are replaced with perforated tiles in front of or underneath the enclosures to allow cold air into the equipment. Hot air from the equipment is then returned to the A/C units above the floor. The raised floor also provides a hidden space to route cabling, although discipline is more difficult since the shame of poor cabling is hidden from view. While we liked the clean look of raised floors with the cables hidden away, the cost is high and the necessary entry ramp at the door would have taken up too much floor space in our smaller datacenter.

We also looked at a hot aisle design that uses the plenum space above a drop ceiling as the return. Separation is achieved with plastic panels above the enclosures, and the CRAC units are typically placed at one or both ends of the room. Because this was a two-row layout in a relatively tight space, it was difficult to find a location for the CRACs that would avoid creating hot spots.

The decision became a lot easier when we found out that one of the spaces we were vacating had APC in-row chillers that we could pick up at a steep discount. The in-row units are contained within a single standard rack enclosure, so they are ideal for a hot aisle/cold aisle configuration. They solved the hot spot issues, as they could be placed into both rows. They also use temperature probes integrated into the nearby cabinets to keep temperatures comfortable efficiently.

APC In-Row RD. APC makes a kit to extend the height to 48U if used with tall racks. (Photo by APC)

With the cooling situation sorted, we turned our attention to containment. We opted for the Schneider/APC EcoAisle system, which provided a roof and end-doors to our existing two-row enclosure layout to create a hot aisle and a cold aisle. The equipment fans pull in cooler air from the cold aisle and exhaust hot air into the hot aisle, while the in-row chillers pull hot air from the hot aisle and return chilled air back into the cold aisle.

There are two options for this configuration.  A central cold aisle can be used, with the rest of the exterior space used as a hot aisle.  This possibly could reduce energy consumption, as only the central aisle is cooled and therefore the room doesn’t need to be sealed as well from air leaks.

The second option, which we ended up choosing, was a central hot aisle.  In our case, the exterior cold aisles gave us more room to rack equipment, and using the entire room as the cold aisle gives us a much larger volume of cool air, meaning that in the case of cooling system failure, we have more time to shut down before temperatures become dangerous to the equipment.

The central hot aisle is covered by a roof consisting of lightweight translucent plastic insulating panels, which reduce heat loss while allowing light in. (The system includes integrated LED lighting as well.) The roof system is connected to the fire system, so in the case of activation of the fire alarm, electromagnets that hold the roof panels in place will release, causing the roof panels to fall in so the sprinklers can do their job.  We can also easily remove one or more panels to work in the cable channel above.

APC EcoAisle configuration. Ours is against the wall and has one exit door. (Photo APC).

Our final design consists of eleven equipment racks, three UPS racks and four chiller racks.  This leaves plenty of room for growth, and in the case of unexpected growth, an adjacent room, accessible by knocking out a wall, doubles the capacity.

We decided on 48U racks, both to increase the amount of equipment we can carry and to raise the EcoAisle’s roof. To accommodate networking, one enclosure in each aisle is a 750mm “network width” enclosure to provide extra room for cable management. As the in-row CRACs were only 42U high, Schneider provides a kit that bolts to the top of the units to add 6U to their height.

Next time:  Power and cooling

VMworld 2019 Conclusion: vCommunity

I’m back in my house, in my chair, with my family. I’m so thankful for them and happy and contented to be back home.

But the whirlwind of the past few days hasn’t faded yet from my mind. There are so many notes, so much information, and I so want to fire up the homelab before it all disappears, but I just can’t. My brain needs to unclench first.

This was my first time at VMworld in San Francisco. Unlike the Vegas show last year, Moscone Center is broken up into three distinct buildings, all separated by busy streets. What this means is that it takes 10 to 15 minutes to get from, say the third floor of Moscone West to Moscone South. Because most events and sessions are scheduled for 30 or 60 minutes, it means leaving an event early or arriving late to the next one. While there was some of this in Vegas, usually events were only separated by a floor and a short walk. It’s something I have to take into account when scheduling sessions next year, and it meant that I had to duck out early or skip some sessions altogether.

The sessions I attended were all excellent. I tried to keep my sessions to topics that would help in my current situation, i.e. wrangling thousands of random virtual machines that are in various stages of use or abandonment, managing and deploying VMs with as few touches as possible, and trying to automate the hell out of all of it. To that end, I focused on the code-related sessions as much as possible, and I was not disappointed. VMware and the community are hard at work solving problems like mine with code, and it’s great to have such a variety of tools ready for incorporation into my own workflows.

Additionally, I attended great sessions on vSphere topics such as VMware Tools, ESXi performance, certificate management and vMotion. These not only gave me insight into how these functions work under the hood, they hinted at new technologies being planned for ESXi and vCenter to make these products work better. This was a great relief, as I’ve been concerned for a while that vSphere would go into a more slow maintenance-only product cycle as the push toward cloud increases. I’m happy that VMware continues to invest heavily in its on-prem products.

If there was one word that summed up the overarching theme of this VMworld, it’s Kubernetes. From the moment Pat Gelsinger stepped onto the stage, Kubernetes was the topic at hand. Kubernetes integration will involve a complete rearchitecting of ESXi, and as someone who sees my customers experimenting with using containers for their build processes, I’m happy that VMware is going to make this easier (and faster) to do and manage in the future.

Let’s face it though. Most of the sessions were recorded and will be made available after the show. This is true of most major software trade shows, and if sessions were the only reason to attend, one could reasonably just stay home and watch videos in their pajamas.

It’s the interactions that matter. Being able to ask questions and get clarifications is very important, and I found that valuable for certain topics and sessions. However, the most important thing that you don’t get sitting at home is the interaction with the community.

Last year was my first VMworld. I didn’t know anybody, and I didn’t really know what to do to get the most out of the show. I scheduled sessions to fill every time slot, even if the product or topic wasn’t interesting or relevant. The time that I wasn’t in a session was spent roaming the conference floor collecting swag from vendors. By the time I was done, I had fifteen t-shirts, a jacket, a drone, a game console and more pens and hats than I would ever use. I did attend a couple of parties and met a couple of great people (like Scott Lowe, who I only realized later was the proprietor of the Full Stack Journey podcast I had been listening to on the plane.)

I didn’t have the real community experience until the Dallas VMUG UserCon a few months later. There I met great people like Ariel Sanchez Mora and the leaders of both the Dallas and Austin VMUGs. But it was a talk Tim Davis gave on community that really made me realize how important my participation would be to me. I dropped a few bucks on WordPress.com and threw out and replaced my old Twitter account and started participating. I’ve seen been attending both the Cedar Park and Austin VMUG meetings as well as the Dallas UserCon, and it’s been great having a group of peers to talk with and bounce ideas off of.

Contrast last year’s VMworld to this year’s. This VMworld I collected two shirts (both from VMUG events) and no vendor swag. I visited a couple of vendors that I specifically wanted to talk to but otherwise just did a little browsing. Instead of filling my calendar and stressing out about catching them all, I strategically picked topics that would be relevant for me in the next year.

Most of my downtime between sessions was spent at the blogger tables at the VMware Social Media and Community booth outside the Solutions Exchange. It was a little darker than the VMUG area, and with a lot less foot traffic. There, I could recharge my batteries (literally), blog about the event and catch up on some of the other topics I’ve been working on and chat with the other community members who rotated in and out as they went to their sessions and events. I also got to drop in on vBrownbag sessions being given by community members, which provided some good background learning while I was working.

So what do I plan to have accomplished by VMworld 2020?

  1. Build out those automation workflows, VMware integrations and tools that I need to better manage the clusters under my purview.
  2. Blog about it.
  3. Work with my customers to get their automated ephemeral build and test envrionments working so they don’t have to rely on sharing and snapshotting their current environments.
  4. Blog about it.
  5. Earn my VCP.
  6. Blog about it.
  7. Have a presentation ready for vBrownbag.
  8. Blog about it.

Thanks to everyone I met this year; it was great.