Brokenhaze

The thoughts of a SysAdmin

How Stack Exchange gets the most out of HAProxy

with 2 comments

At Stack Exchange we like to two two, well no three things. One, we love living on the bleeding edge and making use of awesome new features in software. Two, we love configuring the hell out of everything we run which leads to three – getting the absolute most performance out of the software that we run.

HAProxy is no exception to this. We have been running HAProxy since just about day one. We have been on the 1.5dev branch of code since almost the day it came out.

Of course most people would ask why you would do that? You open yourself up to a whole lot of issues with dev code. The answer of course is there are features that we want in this dev code. The less selfish answer is that we want to make the internet a better place for everyone. What better way to do that then running bleeding edge code and finding the issues for you?

I’m going to go through our HAProxy setup and how we are using the features. I would highly recommended reading through the HAProxy documentation for the version you are running. There is a ton of good information in there.

front_end_se

HAProxy Flow - ERD

This is a high level overview of what our network looks like from the cloud to the web front ends. Yes, there is a lot more to us serving you a request, but this is enough for this post.

The basics are that a request comes into our network from the internet. Once it passes our edge routers it goes on to our load balencers. These are CentOS 6 linux boxes running HAProxy 1.5dev. The request comes into our HAProxy load balencers and then depending on what tier that they come into are processed and sent to a backend. After the packet makes it’s way through HAProxy it gets routed to one of the web servers in our IIS farm.

One of the reasons that HAProxy is so damn good at what it does is that is it single minded, as well as (mostly) single threaded. This has lead it to scale very very well for us. One of the nice things about the software being single threaded is that we can buy a decent sized multi-core server and as things need more resources we just split them out to their own tier which is another HAProxy instance, using a different core.

Things get a bit more interesting with SSL as there is a multi-threaded bit to that to be able to handle the transaction overhead there. Going deeper into the how of the threading of HAProxy is out of the scope of this post though, so I’ll just leave it at that.

Phew, we’ve got the introductory stuff out of the way now. Let’s dive into what our HAProxy config actually looks like!

The first bit is our global defaults, and some setup – users, a bit of tuning, and some other odds and ends. All of these options are very well documented in the HAProxy docs, so I won’t bore you by explaining what each one of them do.

For this post all but one example (our websocket config) comes out of what we call “Tier 1″ this is our main tier, it’s where we server the Q&A sites and other critical services out of.

userlist stats-auth
    group admin users <redacted>
    user supa_dupa_admin insecure-password <redacted>
    group readonly users <redacted>
    user cant_touch_this insecure-password <redacted>

global
    daemon
    stats socket /var/run/haproxy-t1.stat level admin
    stats bind-process 1
    maxconn 100000
    pidfile /var/run/haproxy-t1.pid
    log 127.0.0.1 local0
    log 10.7.0.17 local0
    tune.bufsize 16384
    tune.maxrewrite 1024
    spread-checks 4
    nbproc 4

defaults
    errorfile 503 /etc/haproxy-shared/errors/503.http
    errorfile 502 /etc/haproxy-shared/errors/502.http
    mode http
    timeout connect 15s
    timeout client 60s
    timeout server 150s
    timeout queue 60s
    timeout http-request 15s
    timeout http-keep-alive 15s
    option redispatch
    option dontlognull
    balance source

Nothing all that crazy here, some tweaks for scale, setting up some users, timeouts, logging options and default balance mode. Generally you want to tune the

maxconn

and your timeout values size to your environment and your application. Other than that the defaults should work for 98% of the people out there.

Now that we have our defaults setup, lets look a little deeper into the really interesting parts of our configuration. I will point out things that we use that are only available in 1.5dev as I go.

First, our SSL termination. We used to use Nginx for our SSL termination but as we grew our deployment of SSL. We knew that SSL support was coming to HAProxy, so we waited for it to come out then went in whole hog.

listen ssl-proxy-1
    bind-process 2 3 4
    bind 198.51.100.1:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.2:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.3:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.4:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.5:443 ssl crt /etc/haproxy-shared/ssl/misc.pem
    bind 198.51.100.6:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.7:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.8:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.9:443 ssl crt /etc/haproxy-shared/ssl/wc-san.pem
    bind 198.51.100.10:443 ssl crt /etc/haproxy-shared/ssl/misc.pem
    mode tcp
    server http 127.1.1.1:80 send-proxy
    server http 127.1.1.2:80 send-proxy
    server http 127.1.1.3:80 send-proxy
    server http 127.1.1.4:80 send-proxy
    server http 127.1.1.5:80 send-proxy

    maxconn 100000

This is a 1.5dev feature.

HAProxy - Core Detail - New Page

Once again, this is a pretty simple setup the gist of what is going on here, is that we setup a listener on port 443. It binds to the specified IP addresses as an SSL port using the specified certificate file in PEM format specifically the full chain including the private key. This is actually a very clean way to setup SSL since you just have one file to manage, and one config line to write when setting up an SSL endpoint.

The next thing it does is set the target server to itself (127.0.0.1,2,3 etc) using the send-proxy directive which tell the proccess to use the proxy protocol so that we don’t lose some of that tasty information when the packet gets shipped to the plain http front end.

Now hold on a second! Why are you using multiple localhost proxy connections?! Ahh, good catch. Most people probably won’t run into this, but it’s because we are running out of source ports when we only use one proxy connection. We ran into something called source port exhaustion. The quick story is that you can only have ~65k ip:port to ip:port connections. This wasn’t an issue before we started using SSL since it we never got close to that limit.

What happened when we started using SSL? Well we started proxying a large amount of traffic via 127.0.0.1. I mean we do have a feeewww more than 65k connections.

Total: 581558 (kernel 581926)
TCP:   677359 (estab 573996, closed 95478, orphaned 1237, synrecv 0, timewait 95475/0), ports 35043

So the solution here is to simply load balance between a bunch of ip’s in the 127.0.0.0/8 space. Giving us ~65k more source ports per entry.

The final thing I want to point out about the SSL front end is that we use the bind-process directive to limit the cores that that particular front end is allowed to use. This allows us to have multiple HAProxy instances running and not have them stomp all over eachother in a multi-core machine.

Our HTTP Fronend

The real meat of our setup is our http frontend. I will go through this piece by piece and at the end of this section you can see the whole thing if you would like.

frontend http-in
    bind 198.51.100.1:80 name stackexchange
    bind 198.51.100.2:80 name careers
    bind 198.51.100.3:80 name careers.sstatic.net
    bind 198.51.100.4:80 name openid
    bind 198.51.100.5:80 name misc
    bind 198.51.100.6:80 name stackexchange
    bind 198.51.100.7:80 name careers
    bind 198.51.100.8:80 name careers.sstatic.net
    bind 198.51.100.9:80 name openid
    bind 198.51.100.10:80 name misc
    bind 127.1.1.1:80 accept-proxy name http-in
    bind 127.1.1.2:80 accept-proxy name http-in
    bind 127.1.1.3:80 accept-proxy name http-in
    bind 127.1.1.4:80 accept-proxy name http-in
    bind 127.1.1.5:80 accept-proxy name http-in
    bind-process 1

Once again, this is just setting up our listeners, nothing all that special or interesting here. Here is where you will find the binding that our SSL front end sends to with the accept-proxy directive. Additionally, we give them a name so that they are easier to find in our monitoring solution.

stick-table type ip size 1000k expire $expire_time store gpc0,conn_rate($some_connection_rate)

## Example from HAProxy Documentation (not in our actual config)##
# Keep track of counters of up to 1 million IP addresses over 5 minutes
# and store a general purpose counter and the average connection rate
# computed over a sliding window of 30 seconds.
stick-table type ip size 1m expire 5m store gpc0,conn_rate(30s)

The first interesting piece is the stick-table line. What is going on here is we are capturing connection rate for the incoming IPs to this frontend and storing them into gpc0 (General Purpose Counter 0). The example from the HAProxy docs on stick-tables explains this pretty well.

    log global
    

    capture request header Referer               len 64
    capture request header User-Agent            len 128
    capture request header Host                  len 64
    capture request header X-Forwarded-For       len 64
    capture request header Accept-Encoding       len 64
    capture response header Content-Encoding     len 64
    capture response header X-Page-View          len 1
    capture response header X-Route-Name         len 64
    capture response header X-Account-Id         len 7
    capture response header X-Sql-Count          len 4
    capture response header X-Sql-Duration-Ms    len 7
    capture response header X-AspNet-Duration-Ms len 7
    capture response header X-Application-Id     len 5
    capture response header X-Request-Guid       len 36
    capture response header X-Redis-Count        len 4
    capture response header X-Redis-Duration-Ms  len 7
    capture response header X-Http-Count         len 4
    capture response header X-Http-Duration-Ms   len 7
    capture response header X-TE-Count           len 4
    capture response header X-TE-Duration-Ms     len 7

rspidel ^(X-Page-View|Server|X-Route-Name|X-Account-Id|X-Sql-Count|X-Sql-Duration-Ms|X-AspNet-Duration-Ms|X-Application-Id|X-Request-Guid|X-Redis-Count|X-Redis-Duration-Ms|X-Http-Count|X-Http-Duration-Ms|X-TE-Count|X-TE-Duration-Ms):

We are mostly doing some setup for logging here. What is happening, is that as a request comes in or leaves we capture some specific headers using capture response or capture request depending on where the request is coming from. HAProxy then takes those headers and inserts them into the syslog message that is sent to our logging solution. Once we have captured the headers that we want on the response we remove them using rspidel to strip them from the response sent to the client. rspidel uses a simple regex to find and remove the headers.

The next thing that we do is to setup some ACLs. I’ll just show a few examples here since we have quite a few.

acl source_is_serious_abuse src_conn_rate(http-in) gt $some_number
acl api_only_ips src -f /etc/haproxy-shared/api-only-ips
acl is_internal_api path_beg /api/
acl is_area51 hdr(host) -i area51.stackexchange.com
acl is_kindle hdr_sub(user-agent) Silk-Accelerated

I would say that the first ACL here is one of the more important ones we have. Remember that stick-table we setup earlier? Well this is where we use that. It adds your IP to the ACL source_is_serious_abuse if your IP’s connection rate in gpc0 is greater than $some_number. I will show you what we do with this shortly when I get to the routing in the config file.

The next few acl’s are just examples of different ways that you can setup acl’s in HAProxy. For example, we check to see if your user agent has ‘Silk-Accelerated’ in the UA. If it does we put you in the is_kindle acl.

Now that we have those acl’s setup, what exactaly do we use them for?

    tcp-request connection reject if source_is_serious_abuse !source_is_google !rate_limit_whitelist
    use_backend be_go-away if source_is_abuser !source_is_google !rate_limit_whitelist

The first thing we do is deal with those connections that make it onto our abuse ACLs. The first one just deny’s the connection if you are bad enough to hit our serious abuse ACL – unless you have been whitelisted or are google. The second one is a soft error that throws up a 503 error if you are just a normal abuser – once again unless you are google or whitelisted.

The next thing we do is some request routing. We send different requests to different server backends.

    use_backend be_so_crawler if is_so is_crawler
    use_backend be_so_crawler if is_so is_crawler_ua
    use_backend be_so if is_so
    use_backend be_stackauth if is_stackauth
    use_backend be_openid if is_openid

    default_backend be_others

What this is doing is matching against ACLs that where setup above, and sending you to the correct backend. If you don’t match any of the ACLs you get sent to our default backend.

An Example Backend

Phew! That’s a lot of information so far. We really do have a lot configured in our HAProxy instances. Now that we have our defaults, general options, and front ends configured what does one of our backends look like?

Well they are pretty simple beasts. Most of the work is done on the front end.

backend be_others
    mode http
    bind-process 1
    stick-table type ip size 1000k expire 2m store conn_rate($some_time_value)
    acl rate_limit_whitelist src -f /etc/haproxy-shared/whitelist-ips
    tcp-request content track-sc2 src
    acl conn_rate_abuse sc2_conn_rate gt $some_value
    acl mark_as_abuser sc1_inc_gpc0 gt $some_value
    tcp-request content reject if conn_rate_abuse !rate_limit_whitelist mark_as_abuser

    stats enable
    acl AUTH http_auth(stats-auth)
    acl AUTH_ADMIN http_auth_group(stats-auth) $some_user
    stats http-request auth unless AUTH
    stats admin if AUTH_ADMIN
    stats uri /my_stats
    stats refresh 30s

    option httpchk HEAD / HTTP/1.1\r\nUser-Agent:HAProxy\r\nHost:serverfault.com

    server ny-web01 10.7.2.101:80 check
    server ny-web02 10.7.2.102:80 check
    server ny-web03 10.7.2.103:80 check
    server ny-web04 10.7.2.104:80 check
    server ny-web05 10.7.2.105:80 check
    server ny-web06 10.7.2.106:80 check
    server ny-web07 10.7.2.107:80 check
    server ny-web08 10.7.2.108:80 check
    server ny-web09 10.7.2.109:80 check

There really isn’t too much to our back ends. We setup some administrative auth at the beginning. The next thing we do is, I think the most important part. We specify with the option httpchk where we want to connect when doing a check on the host to see if it’s up.

In this instance we are just checking ‘/’ but a lot of our back ends have a ‘/ping’ route that gives more information about how the app is performing for our out monitoring solutions. To check those routes we simply change ‘HEAD /’ to ‘HEAD /ping’

Final Words

Man, that we sure a lot of information to write, and process. But using this setup has giving us a very stable, scalable and flexible load balancing solution. We are quite happy with the way that this is all setup, and has been running smoothly for us.

Written by George Beech

March 25th, 2014 at 4:39 pm

Fun With PowerShell, WS-MAN, and Dell Servers

without comments

Recently I’ve been playing with using the WS-MAN protocol to gather information (and eventually run updates) on our Dell servers. It has actually been a fairly insteresting project after I got through the pretty high learning curve to get started using WS-MAN.

First, what is WS-MAN? It’s a management standard developed by the DTMF. What it really boils down to is giving us the ability to access and manipulate CIM providers via HTTP calls.

One of the interesting things Dell did with their systems in the past two generations (Gen 11 and 12) is to add something they call the Life Cycle controller. They did not really make much information known on what you can do with it, or even how to really use it.

Recently I have been exploring what you can do with the Life Cycle Controller. And, quite honestly, you can do a ton of good stuff with it. Everything from getting system information to setting boot options, all the way up to updating all of the firmware on your box. This is all done through the WS-MAN Protocol.

First I would suggest doing some reading so you can get the basic concepts of WS-MAN.

Phew, got through all that?

Lets start off with a nice code snippet that I have been working on, and then step through what it is doing.

$DELL_IDS = @{
    "20137" = "DRAC";
    "18980" = "LCC";
    "25227" = "DRAC";
    "28897" = "LCC";
    "159" = "BIOS"
    }

$pass = ConvertTo-SecureString "ThisIsMyPassword" -AsPlainText -Force
$creds = new-object System.Management.Automation.PSCredential ("root", $pass)
$wsSession = New-WSManSessionOption -SkipCACheck -SkipCNCheck

$svc_details = @{}

$base_subnet = "192.168.99."
$addrs = @(1..254)
foreach ($ip in $addrs)
{
    $base_subnet + $ip
    $s = [System.Net.Dns]::GetHostByAddress($base_subnet+$ip).HostName
<code>$fw_info = Get-WSManInstance 'cimv2/root/dcim/DCIM_SoftwareIdentity' -Enumerate -ConnectionURI https://$s/wsman -SessionOption $wsSession -Authentication basic -Credential $creds
$svr_info = Get-WSManInstance 'http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_SystemView' -Enumerate -ConnectionURI https://$s/wsman -SessionOption $wsSession -Authentication basic -Credential $creds

$svc_details.Add($s, @{})
if($svr_info -eq $null)
{
    $svc_details[$s].Add(&quot;Generation&quot;, &quot;unknown probably 11G&quot;)
}
else
{
    $svc_details[$s].Add(&quot;Generation&quot;, $svr_info.SystemGeneration.Split(&quot; &quot;)[0])
}
foreach ($com in $fw_info)
{

    $DELL_IDS.ContainsKey($com.ComponentID)
    if($DELL_IDS.ContainsKey($com.ComponentID))
    {
        #need to see if I can update this to account for the different
        #way drac6 and 7's format this string
        $inst_state = $com.InstanceID.Split(&quot;#&quot;)[0].Split(&quot;:&quot;)[1]
        if (($inst_state -ne &quot;PREVIOUS&quot;) -AND ($inst_state -ne &quot;AVAILABLE&quot;))
        {
            $svc_details[$s].Add($DELL_IDS[$com.ComponentID], $com.VersionString)
        }
    }
}

}

The first part of this code is simply a hash table of dell component IDs and an associated easy-to-remember name matching them with the component. How did I get those? Well I queried the cimv2/root/dcim/DCIM_SoftwareIdentity namespace and parsed the output by hand to grab those IDs. They match up to BIOS, LCC v1, LCC v2, iDRAC 6 and iDRAC 7.

$pass = ConvertTo-SecureString "ThisIsMyPassword" -AsPlainText -Force
$creds = new-object System.Management.Automation.PSCredential ("root", $pass)
$wsSession = New-WSManSessionOption -SkipCACheck -SkipCNCheck

This next section of code sets up our enviroment for Get-WSManInstance. First we need to convert our plaintext password into a secure string, then create a PSCredential object to use later so we don’t have to enter our username and password over and over. Finally, we setup a new WS-MAN session options object so that it doesn’t error out on the self signed certificates we are using. If you are using fully trusted certificates on your dracs you can skip this step and not specify the -SessionOption $wsSession flag later.

$fw_info = Get-WSManInstance 'cimv2/root/dcim/DCIM_SoftwareIdentity' -Enumerate -ConnectionURI https://$s/wsman -SessionOption $wsSession -Authentication basic -Credential $creds
$svr_info = Get-WSManInstance 'http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_SystemView' -Enumerate -ConnectionURI https://$s/wsman -SessionOption $wsSession -Authentication basic -Credential $creds

Note You can specify either the DCIM Path or the schema, I’m showing both ways here. For the $svr_info variable 'cimv2/root/dcim/DCIM_SystemView' would also work.

Now, we move on to the meat of what we are doing. These two lines grab the system information that we want to parse. The $fw_info contains an XML object that returns all of the install components as exposed by the DCIM_SoftwareIdentity endpoint, and the $svr_info variable contains an XML object that has some interesting system information – such as Server Generation, Express Service Code, Service Tag, and so on. I use these two pieces of information to parse out the Generation, DRAC, BIOS, and LCC firmware versions.

#need to see if I can update this to account for the different</h1>
#way drac6 and 7's format this string</h1>

$inst_state = $com.InstanceID.Split("#")[0].Split(":")[1]
if (($inst_state -ne "PREVIOUS") -AND ($inst_state -ne "AVAILABLE"))
    {
        svc_details[$s].Add($DELL_IDS[$com.ComponentID], $com.VersionString)
    }

One last tricky bit. When you get back the versions that are installed, you will actually have two different versions. Once that is the active version and one that is the rollback version. Unfortunately you need to parse string to figure that out. And different DRACs use different string formats.

  • Drac6: DCIM:INSTALLED:PCI:14E4:1639:0236:1028:5.0.13
  • Drac7: DCIM:INSTALLED#802__DriverPack.Embedded.1:LC.Embedded.1

Once I have this information in my two-dimensional array I can create reports and manipulate the information to tell me exactly what version each of my servers is at.

Sweet! Step one to automating the update of our firmware complete! Next up figure out how to automate the deployment and installation of new firmware.

Written by George Beech

February 26th, 2014 at 10:17 pm

Puppet Workflow with Vagrant

without comments

I’ve spent a good deal of my time working with puppet over the last few years. And like most other things I’ve spent some time trying to optimize my workflow to avoid the annoying things. Just like anything else developing a good workflow for puppet has taken some time and I’d like to share the workflow that we have come to use over at Stack Exchange that seems to be pretty-darn-good.

Until recently our puppet dev workflow wasn’t horrible, however was a bit painful. Normally you would see something like (pg version):

commit -> wait for CI to pick up commit -> build failed -> damn -> fix typo -> commit -> wait for build -> build failed -> grrr -> fix new error -> commit … and on and on.

O, that way madness lies; let me shun that;
No more of that.
-William Shakespeare

Our new way of doing things is much, much saner. Which has in general raised team spirits!

The Dev Tier

We currently use Vagrant to do local dev work on new modules, and changes production modules. I have to say once it clicked that I can use Vagrant for puppet it was like that first sip of rocket fueled coffee in the morning. Everything suddenly became clear.

We have a vagrant file that uses two boxes – one is the client and the other is the server. They are actually seperate boxes. The client one is a box standard CentOS 6.4 minimal install. The server is the same but with all the puppetmaster bits all setup and ready to go. This includes Apache + Passenger, and all the CA goodness. It’s in a state just before the first run of puppetmasterd that creates the certificates.

First let’s look at the master config. There isn’t much special here but I do want to point out few things.

# Setup the Puppet master
  config.vm.define :master do |master|
    master.vm.box = "centos64-puppetm"
    master.vm.box_url = "http://<internal_server>/vagrant/vagrant-sei-puppetm-centos.x64.vb.box"
    master.vm.hostname = "master.local"
    master.vm.synced_folder "../../puppet-dev", "/etc/puppet"
    master.vm.synced_folder "../../scripts", "/root/scripts"
    master.vm.network :private_network, ip: "172.28.19.20"
    master.vm.provision :shell, :path => "master.sh"
    # Customize the actual virtual machine
    master.vm.provider :virtualbox do |vb|
      # Uncomment this, and adjust as needed to add memory to vm
      vb.customize ["modifyvm", :id, "--memory", 2048]
      # Because Virtual box is stupid - change the default nat network
      vb.customize ["modifyvm", :id, "--natnet1", "192.168.0.0/16"]
    end
  end

The first two highlights there are setting up synced folders. One is to the local puppet development repo on your disk, and the second is to a folder of our utility scripts.

This is actually the real magic. with our puppet development folder on our local host machine mounted at /etc/puppet this allows us to work in our local enviroment with all of our editing tools that we have come to love, and all we have to do is simply save our work and it is active in our Vagrant environment.

The next thing you will see highlighted is the master.vm.provision stanza. This is telling vagrant to use a shell (bash) based provisioner and to run the script that if finds in ./ on the host as it is starting up. There are a bunch of provisioners available for Vagrant. I’m using a shell provisioner here because I just want to do a couple of very very basic things to get the box in working order.

service ntpd stop
ntpdate pool.ntp.org
service ntpd start
service httpd stop
service puppetmaster start
service puppetmaster stop
service httpd start

As you can see, there is not much going on here. Syncing the time, stopping Apache/Passenger, starting then immediately stopping puppetmater to generate the certificates, then starting Apache/Passenger back up and we are ready to go.

The last two lines are simply modify the base VM to add more memory (Vagrant defaults to allocating 512M) and changing the NAT IP. We had to do the second one because VirtualBox defaults to using 10.2.0.0/24 for the nat network – which is a production network for us.

Our client configuration is almost exactly the same except the puppet folder is synced to /root/puppet for convenience and it pulls a different box down.

The provisioning script is also very basic for the client.

echo "172.28.19.20 puppet" >> /etc/hosts
service ntpd stop
ntpdate pool.ntp.org
service ntpd start
service puppet stop
rm -rf /var/lib/puppet/ssl/
service puppet start

The two biggest pieces here are lines one and six. Line one adds a hosts entry for the master puppet server so the client can bootstrap itself. And the sixth line makes sure any certs that might have been there are destroyed and lets puppet recreate them.

You can expand below to see the full source of our Vagrantfile

# -*- mode: ruby -*-
# vi: set ft=ruby :

# Vagrantfile which sets up one puppet master and one puppet client.
# Assumes "puppet-dev" and "scripts" repos are cloned into the same
# base directory.

Vagrant.configure("2") do |config|
  # All Vagrant configuration is done here. The most common configuration
  # options are documented and commented below. For a complete reference,
  # please see the online documentation at vagrantup.com.

  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "centos64"

  # Setup the Puppet master
  config.vm.define :master do |master|
    master.vm.box = "centos64-puppetm"
    master.vm.box_url = "http://<internal_server>/vagrant/vagrant-sei-puppetm-centos.x64.vb.box"
    master.vm.hostname = "master.local"
    master.vm.synced_folder "../../puppet-dev", "/etc/puppet"
    master.vm.synced_folder "../../scripts", "/root/scripts"
    master.vm.network :private_network, ip: "172.28.19.20"
    master.vm.provision :shell, :path => "master.sh"
    # Customize the actual virtual machine
    master.vm.provider :virtualbox do |vb|
      # Uncomment this, and adjust as needed to add memory to vm
      vb.customize ["modifyvm", :id, "--memory", 2048]
      # Because Virtual box is stupid - change the default nat network
      vb.customize ["modifyvm", :id, "--natnet1", "192.168.0.0/16"]
    end
  end

  # Setup the Puppet client. You can copy and modify this stanza to allow for
  # multiple client, just change all instances of 'client1' to another term
  # such as 'client2'
  config.vm.define :client1 do |client1|
    client1.vm.box = "centos64"
    client1.vm.box_url = "http://<internal_server>/vagrant/vagrant-sei-centos64.x64.vb.box"
    client1.vm.hostname = "client1"
    # Make puppet-dev accessable from the client for easier copying.
    client1.vm.synced_folder "../../puppet-dev", "/root/puppet"
    client1.vm.network :private_network, ip: "172.28.19.21"
    #client1.vm.network :forwarded_port, guest: 8100, host: 8100
    client1.vm.provision :shell, :path => "client.sh"
    client1.vm.provider :virtualbox do |vb|
      # Uncomment this, and adjust as needed to add memory to vm
      vb.customize ["modifyvm", :id, "--memory", 2048]
      # Because Virtual box is stupid - change the default nat network
      vb.customize ["modifyvm", :id, "--natnet1", "192.168.0.0/16"]
    end
  end

  if false
    config.vm.define :client2 do |client2|
      client2.vm.box = "centos64"
      client2.vm.box_url = "http://ny-man02.ds.stackexchange.com/vagrant/vagrant-sei-centos64.x64.vb.box"
      client2.vm.hostname = "client2"
      # Make puppet-dev accessable from the client for easier copying.
      client1.vm.synced_folder "../../puppet-dev", "/root/puppet"
      client2.vm.network :private_network, ip: "172.28.19.22"
      client2.vm.provision :shell, :path => "client.sh"
      # Customize the actual virtual machine
      client2.vm.provider :virtualbox do |vb|
        # Uncomment this, and adjust as needed to add memory to vm
        vb.customize ["modifyvm", :id, "--memory", 2048]
        # Because Virtual box is stupid - change the default nat network
        vb.customize ["modifyvm", :id, "--natnet1", "192.168.0.0/16"]
      end
    end
  end
 
end

So, what exactly does doing a dev setup like this help you with? It keeps you from having to constantly push to your testing environment to test every change. Which is huge – especially when you have a habbit of missing typo’s that puppet-lint doesn’t catch. You can smoke test on a local VM and iterate extremely quickly. Avoid madness, don’t spam internal chat with build messages. It’s a win-win-win to me. For those that don’t know my handle on the SE network is Zypher … so Nick is talking about me descending into madness.

puppet-madnexx

Test and Prod

I’m going to talk about the test and prod environments together here. The only functional difference between the two is the puppet module code that is run in those environments so for the purposes of this exercise they are the same.

What happens after you are done working locally and have a working change? Well that is simple, you push the change set(s) from your local clone to our mercurial server. You can obviously swap mercurial with any VCS system you like.

Once the changes have been pushed up the server, our CI Server (TeamCity) will pull the changes down and “build” them. Since nothing is really compiled with puppet what it is actually doing is running through a battery of tests, then if they pass it pushes the changes out to our puppet masters.

  1. Run a bash script that checks the changeset against puppet validate
  2. Run a bash script that generates the puppet docs for the changeset
  3. Push the changes to the puppet master servers

We have bot in our internal chat that will post a message with the success/failure of the build and a link to the build info page inside TeamCity. Once you have a successful build then your change is live in Test or Prod depending on which repo you pushed to.

That’s about it. There is nothing all that special or groundbreaking here. All the magic really happens at the local dev stage with Vagrant. The sanity checks on the server side before deployment are really just to make extra sure since you worked all the bugs out, right?

I’m sure that there is yet more room for improvement to our process but at this point i’m happy to say the decent into madness has been staved off for now.

Written by George Beech

July 25th, 2013 at 10:00 am

Time Flies

without comments

Excuse me for a bit while I get a little sentimental. Tomorrow (the 27th of September) marks my two year anniversary with Stack Exchange – it was still called Stack Overflow Internet Services when I joined! I realized this about two or three weeks ago when I was setting up my trip to Denver for the Opening Party. It just amazes me how quickly time flies when you are having fun.

It’s amazing to me how much the company has grown, how much has been accomplished, and the amazing people I’ve met while working here. It’s a little bit awe inspiring to look back at the day I started and what the company looked like then vs where we are at now.

On my first day the space at 55 Broadway hadn’t been renovated yet, they had done some demo, but that was about it. It was a puke green color, and there was a lot of work that needed to be done. I think there where 10 people in the office when I started. We lived in one half of the office, while they worked in the other half building things then we would switch sides; it was just a mess of wires strewn across the floor and a bunch of desks clumped together (along with some lucky people that got to share the few offices!)

Now, we are rapidly running out of room at 55 Broadway and have opened an office in Denver, with other great plans to expand and grow.

As far as the sites, we’ve gone from around the 300 mark to flirting with breaking the top 100 on Quantcast since I’ve been here. There have been many fun challenges, and lots of hard work to get us to where we are today.

As I sit here at 10,000 ft. writing this all I can think of is how this is exactly what I envisioned all these years ago when I decided that at least once in my life I wanted to work for a start up. How awesome it is to have a dream come true, and have it be exactly what you thought it would be.

Written by George Beech

September 26th, 2012 at 11:16 pm

Getting the most out of your Sysadmin team

without comments

Disclaimer: Most of my career i’ve been what you would call and “Individual Contributor” these are not observations from a managerial perspective, but from a team member perspective

In the past 10 years or so I’ve worked under pretty much all your basic boss types. I’ve seen how people react to different stimuli and managerial styles. ┬áJust like with our programmer cousins we sysadmins can be a peculiar type of person, who isn’t easily managed in the “traditional” style that is still pervasive though out the business world. We are the type of people that are really good at analyzing a problem, and thinking of – sometime very creative – ways to get around the problem.

When I look back at all of the situations that I have been the most productive in, and have seen others be the most productive in one theme keeps coming up over and over again. That is, let your smart people think and have free reign while NOT overloading them. The latter part is really important, there is a tendency to find one or two really good people, and put a supporting cast around them, thinking that the supporting cast will help out with some of the lower level things. While this seems like a great idea, in practice it never really turns out this way. You end up with only the most minor of duties being taken care of, and you rockstar having to go back and still do a lot of the low level things that take away from what you want them working on. Or, even worse, they have to go back and FIX problems caused by someone who doesn’t know thier limits.

Ideally you want a team of sysadmins that compliment each other, so they can support each other when needed but don’t step all over the other people’s toes. This doesn’t mean you need a team of all star generalists it means you need a team of people that are good at what they do, know their limits and most importantly want to grow as a professional not just stick to the same old things that they know.

So, once you have assembled this team, the question now becomes how to I get the most out of them? How do I motivate them into doing the best that they can do and help them when they are struggling? These are not easy answers at all, and my ideas may not fit all situations they are just some strategies that I have seen work really well in the teams I have been a part of that have worked well together.

Workload

The most dangerous thing for a team is to have people taking on too much work at once. The person in charge of the team – be it a team lead, manager, directory, senior, etc – should be aware of how much people are taking on. Additionally, the rest of the team should make sure that one person isn’t taking on too much. If you see someone else on your team taking on too much grab something from them and help out. A sign that a team working well is when people are helping out and moving workload around as needed, without management intervention. Since the person in charge has a better idea of everything that is going on and how much work people are putting in, they should also keep an eye on the big picture and move work around as needed as well if people aren’t doing it themselves.

Trust them to GTD

After people taking too much work for themselves, the second biggest killer of productivity is constantly having to update someone on how things are going with Project X. Having someone bug you every day doesn’t help you get things done all it does is distract you from actually doing these things. Now, this doesn’t mean that you things should never be checked up on but there should be a set schedule of when everyone checks in and gives status updates. These are the things weekly meetings are good for. Outside of a normal time to check up on how things are going with your team there should be very little “how’s xyz going? Have you done ABC yet?” Basically, constantly asking how things are going outside of a normal time feels like micromanagement, and also possibly like you don’t trust those people. Both of these things are very bad for moral and productivity.

Realistic Priorities

Make sure that the team has a set of realistic priorities. Very rarely are things “not that important, get to them when you can” although they always seem to come up. This just means that they will never get done – there is always something more important that comes up. Setting realistic priorities also allow for your smart guys to prioritize things correctly. Nothing sucks more than coming in first thing and getting jumped on by everyone over a task that had a low priority, but has now shut down the company. Alright, that may be an exaggeration but surprises in priority of things do not help anyone.

Communicate Decisions Permanently

When big(ish) decisions are made, don’t leave it up to ephemeral communication methods – IM, group chat, voice – someone should write them down and send an email to the team with a descriptive title. That way they can ignore it if they want or don’t really care about that piece, but you can always easily go back and do a quick email search to find out why something was done the way it was. This should not be relegated to one person always but to the person actively working on that project. At a minimum these should contain what the decision was, why it was made that way, and who was involved. This helps prevent those “Why did we do this this way, lets change it” and the loops of the same questions coming up over and over annoying everyone.

Listen

Not only should the management of the team be listening and understanding what is going on, but the team should be listening to each other. If there are complaints or points of concern they should be acted on, not just swept under the rug, or shot down out of hand. Remember the team is full of smart people, they don’t have concerns for no reasons. Sometimes they may be misguided or the person has missed something, but that is OK we are human we miss things. To constantly have your concerns disregarded or even worse ignored is not only detrimental to productivity but is down right demoralizing.

Limit Surprises

Everyone on the team should be working to limit the amount of surprises to the rest of the team. You don’t want someone to be surprised by a new setup, that then effects what they are working on. This really boils down to having good communication on the team. Everyone should make sure that as often as they can, other people on the team are not surprised by what they are working on.

Conclusion

Basically, what this really boils down to is there needs to be trust in your team from above, and trust inside of your team between members. Really these are the same tenants that make EVERY team work, not just sysadmin teams. If you want to get the best out of your smart guys, let them do their thing and help them when you see them struggling, listen to their complaints and act on them.

Written by George Beech

June 4th, 2012 at 10:00 am