Todd Wells

If Corona Were the Flu

Sat, 04 Apr 2020 00:00:00 +0000

Trying to understand the Corona virus and how it is affecting society is an exercise in frustration. Sifting through all of the news, blogs, tweets, gossip to get to accurate and useful data is hard. We are getting told “Social Distance”, “Stay at Home”, “Flatten The Curve”. But they are not presenting clear data to back up these statements. So I dig in and try to understand the data for myself. This is my attempt to present the data in a picture that will help people realize why we need to take all these extreme measures.

Flu Hospitalization and Mortality

So the first picture is “What if Corona were the flu”.

Lets start with flu data from CDC. I have picked a number within the ranges specified, assuming the number is towards the top of the range.

50,000,000 infections (15% of US population)
600,000 hospitalizations (1.2% of infections)
60,000 deaths (.12% of infections)

COVID-19 Same Infection Rate as Flu, Low Mortality

And now imagine we leave the infection rate the same but change the hospitalization and death rates to those of COVID-19. Based on the COVID-19 data from CDC the hospitalization and death percentages are (using the lowest percentage in the range) 21% and 1.8% respectively. I’m actually going to further lower the death rate to only 1% for now. With that here is what COVID-19 would look like

50,000,000 infections (15% of US population)
10,500,000 hospitalizations (21% of infections)
500,000 deaths (1% of infections)

COVID-19 Conservative Infection Rate, Low Mortality

Based on the assumption that COVID-19 had the same infection rate as the flu, and that our medical system could handle the increased hospitalization rate, the US would lose half a million people. But those are not valid assumptions.

COVID-19 is much more contagious than flu. After watching Ninja Nerd Science I am going with a conservative estimate of twice as contagious, doubling our infections.

100,000,000 infections (30% of US population)
21,000,000 hospitalizations (21% of infections)
1,000,000 deaths (1% of infections)

COVID-19 Conservative Infection Rate, Conservative Mortality for Overloaded Hospitals

And our medical system can also not handle the patient load. And so the death rate will go up as people can not get the medical treatment they require. Lets also make a conservative change and put the death rate back to the 1.8% of the CDC data.

100,000,000 infections (30% of US population)
21,000,000 hospitalizations (21% of infections)
1,800,000 deaths (1.8% of infections)

Comments

Even the conservative picture of COVID-19 without any “extreme measures” is pretty grim. The actual picture if we did not try to reduce the infection rate and prevent hospitals from being completely overloaded would likely be much worse.

I am feeling a little stir crazy after only two weeks of “Stay at Home”. And practicing “Social Distance” when I do have to go out for food is awkward, physically difficult in the typical grocery store, and emotionally draining. But is the cost of staying home worth saving over a million lives?

And how many lives we save is dependent on how well we do “Stay at Home” and “Social Distance”. To achieve what the CDC is projecting ( 100,000 to 200,000) we have to reduce the infection rate well below that of the flu. Which means potentially months of even more strict “Stay at Home” and “Social Distance”. Do I want to do this? No. Do I think it necessary? Yes.

AWS MySQL Choices: MySQL, RDS, Aurora oh my

Sat, 04 Feb 2017 00:00:00 +0000

My migration to AWS is to the point where I have to decide on what database to use. Or should I say what version of a database to use. Our current DB is MySQL, so I want to stick with compatible options. And there are three: self-managed MySQL, RDS MySQL, RDS Aurora. Which to choose?

My Ubuntu servers running self-managed MySQL perform their job just fine. Our only significant outage in the last year that was DB related was running out of space (oops, my bad). So from a basic reliability standpoint there is no problem with self-managed MySQL. The issues come in when you add in backups & replication, which is a given for most organizations. Developing the processes and procedures for performing backups and for creating slaves took work. And it takes ongoing work. Replication issues have caused me to rebuild two MySQL Slaves this year alone and many more last year. So I have that process down. But I would rather not have to deal with it at all. The story is similar with Backups. Time fixing backup issues and replication failures account for over 50% of production maintenance. I want to outsource all of this backup & replication maintenance. And both RDS MySQL and RDS Aurora do that. Backups … check the box. Creating a slave in another AZ … check the box. Oh and lets add in Encryption … check the box.

So RDS MySQL or Aurora?

RDS is MySQL with all of the automation/maintenance I am looking for in place. The 20-30% cost increase of an EC2 instance of the same size is well worth the cost for not having to do all that maintenance. But in the end it is just MySQL servers that have a lot of automation build around them. Aurora on the other hand is built differently: 6X data replication, dedicated storage subsystem, rapid replica creation. Maybe call this Oracle RAC Lite. Which to me is a good thing. I managed Oracle RAC for many years and the maintenance efforts required was way higher than a similar number of MySQL servers and slaves. RAC was cumbersome and required specialized DBAs to keep it running and of course cost many arms and legs.

Technically Aurora looks to be a better solution than RDS MySQL. So the decision gets down to price. Can I afford Aurora. When I first looked at Aurora, the answer was no, but that was primarily because there were no smaller instance sizes. r3.large for about $210/month was small enough for my production instances, but too much ( and too costly ) for development/test instances and special purpose slaves. And I would rather not manage PROD in Aurora and Dev/Test in RDS MySQL. Looking more recently I see that t2.medium was now an option with a $60/month price point. So I need to re-evaluate.

And that re-evaluation lead me to go over the Aurora documentation much more thoroughly. And when I did I found way down at the bottom of the pricing page

I/O Rate	$0.200 per 1 million requests

Wow good catch. No longer is a DB server a fixed cost ( add up instance and storage cost ). Looking online I see many people have been surprised by a large and unexpected I/O cost on their first bill. Not going to make that mistake. An AWS support chat asking for help with estimating Aurora IO based on my current MySQL servers did not give me much information on how they might be different. So I am going to assume MySQL IOPs and Aurora IOPs will be about the same. A quick calculation show me that an average of 100 IOPs/sec over a month adds about $50 in I/O costs. Looking at my monitoring ( NewRelic ) shows me only % utilization, so guess I have to go old school: install and enable sysstat. I’ll let you know how it turns out.

Can SaltStack do AWS, what's in a name

Fri, 27 Jan 2017 00:00:00 +0000

Two days of aws-formula being live and I found it was broken. Because of names. More precisely the collision of of names in different VPCs. Aren’t VPCs completely independent? Well yes, except for names it appears.

The normal methodology for a salt state is to check if something needs to be updated and then update it if it does. It appears that this methodology is breaking down in the AWS state/module. I think this is partly due to how AWS is structured and part to difficulties with the boto module.

The aws-formula uses object names to create all objects. One reason for this is the lack of a mechanism for capturing the results of one state ( create Internet Gateway ) and using a part of that captured value ( Internet Gateway ID) as an input for another state ( create Route Table ). Ansible’s Registered Variable can accomplish this, but it is somewhat messy as the registered variable has to be manipulated in a command in order to parse out the part of the output desired, then that parsed out value captured in another variable before being used in subsequent commands. Using Names instead of IDs for all AWS objects is a much cleaner methodology for creating a VPC and all its required components. It removes all the complexity of parsing IDs based on state results.

AWS uses a tag to capture the name for objects. In VPCs, this includes: Subnets, Route Tables, Internet Gateways, Network ACLs, Security Groups. Multiple objects of these types can have the same name. Here is an example:

The boto modules appear to get caught by this as well. I have two VPCs in the same region. Neither have the routing table named my_route_table.

But yet when I run the following states instead of getting both to succeed, only one succeeds.

boto_vpcProdEast5_my_route_table:
  boto_vpc.route_table_present:
    - name: my_route_table
    - vpc_name: vpcProdEast5
    - key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    - keyid: XXXXXXXXXXXXXXX
    - region: us-east-2

boto_vpcProdEast6_my_route_table:
  boto_vpc.route_table_present:
    - name: my_route_table
    - vpc_name: vpcProdEast6
    - key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    - keyid: XXXXXXXXXXXXXXX
    - region: us-east-2

The results show that the second state does not run because there is already a routing table by that name present, the one in the other VPC.

----------
          ID: boto_vpcProdEast5_my_route_table
    Function: boto_vpc.route_table_present
        Name: my_route_table
      Result: True
     Comment: Route table my_route_table created.
     Started: 04:52:35.891737
    Duration: 1887.427 ms
     Changes:
              ----------
              new:
                  ----------
                  route_table:
                      rtb-1224827b
              old:
                  ----------
                  route_table:
                      None
----------
          ID: boto_vpcProdEast6_my_route_table
    Function: boto_vpc.route_table_present
        Name: my_route_table
      Result: True
     Comment: Route table my_route_table (rtb-1224827b) present.
     Started: 04:52:37.779406
    Duration: 239.461 ms
     Changes:

Summary for local-core-api-prod-1
------------
Succeeded: 2 (changed=1)
Failed:    0
------------
Total states run:     2
Total run time:   2.127 s

Since the state requires both the routing table name and the VPC it is in, I would expect the above to work since the combination of vpc_name and name ( routing table name ) is unique. But looks like it is not. The method route_table_present in states/boto_vpc.py is being called, which eventually calls _find_resources in modules/boto_vpc.py. _find_resources is only using the name tag to filter the list of routing tables. So until the filter also includes the VPC name, we will not be able to have the same object name in multiple VPCs.

In the past I have sometimes used globally unique names … Security groups is an example. These needed to be globally unique since a Security Group can be referenced across a VPC Pairing. Guess I need to make this more general So for now making all VPC object names globally unique is going to be the SaltStack best practice for boto_vpc states.

aws-formula has been updated to include this best practice. All states will automatically append the VPC Name to all object names so you don’t have to clutter the pillar with repeat reverences to the VPC Name.

Can SaltStack do AWS, the VPC

Sun, 22 Jan 2017 00:00:00 +0000

We started on Rackspace back when it was still trying to compete with AWS in the IaaS. As a startup, Rackspace worked just fine. But if a company grows and needs more and their IaaS vendor is not growing their offering, there comes a tipping point. Security was our tipping point. We could not get the security we needed without either expensive dedicated hardware or a way to convoluted network architecture. And so we are moving to a new IaaS vendor, AWS.

To get started at AWS I created a VPC by hand: subnets, routing tables, gateways, security groups. Add in a salt server and I am able to use Salt to spin up and manage my AWS servers. And I thought, like many others before me, “I can turn this infrastructure into a cloud formation template”. And I did, all 1793 lines of it, in JSON. Since everything I do in SaltStack is YAML, I would much rather use YAML. And as I look I find that the AWS export tool provides JSON, but CloudFormation will take either YAML or JSON. A web translator and 1/2 hour later I have a YAML file, and it is only 1214 lines long. Removing 588 lines containing only start or end brackets/braces makes it more readable, but it is still 1214 lines I have to go through and change references into CloudFormation variables so that I can use this as a template for another VPC. Which could take days ( per a friend that did this same thing it took 2 days ).

Should I invest two days into this with the results being a template that will create exactly one VPC design? Or should I try to use the native SaltStack states? When I looked at this a couple of years ago the answer was to go with the CloudFormation template. And when I look now I don’t find a lot more out there to indicate people are using states. I find a few questions on how to do a VPC using salt states, but It seems like more people are using the boto_cfn to run their CloudFormation template from within saltstack. And there are no saltstack formulas that actually create a VPC and the base components like routing tables, gateways, etc…

Are there no formulas because people can not make it work? Or is it just that no one has taken the time to write one? Looking at SaltStack documentation it appears that the hard working SaltStack staff and volunteers have gotten all the pieces in place to make it possible to use states. So looks like option two. I guess I will have to try to write one.

And a couple of days later I have a formula that works. Well, it works 95%. I get an error when trying to use nat_gateway_subnet_name inside a boto_vpc.route_table_present. So SaltStack issue created and I find a workaround until that error gets fixed. Which means a 2 run solution. First formula run creates all the pieces. Then copy the NAT Gateway ID into the pillar and re-run the formula to add the NAT Gateway as the default for internal subnet routing table rules.

Two days invested in creating an aws-formula is much better than creating an inflexible CloudFormation Template. And the aws-formula is now live on salt-stack-formulas. The sample pillar shows the creation of a three tier, three Availability Zone VPC. It creates:

Key Pairs
VPC
Internet Gateway
Subnets ( web, app, db X 3 AZ)
NAT Gateways in the web subnet of each AZ
Public and private routing tables for each AZ
Add default routes with public and private routing tables ( Internet Gateway for public, NAT Gateway for private )
Associate Subnets with Routing Tables
Security Groups ( Web, App, Salt-master, openVPN, ipsec VPN)

There are no servers in this, but those will all be created by Salt. Sure wish we had VPC pairing between regions. That would allow this to really be complete. But alas not yet. So those ipsec based VPN servers still have to be created by Salt and then the routes added to the AWS pillar.

Can Ops use Dev Techniques Part 3, Deployment Workflow

Tue, 13 Sep 2016 00:00:00 +0000

Piece 3: Deployment Workflow

The last piece of the development “process” I want to look at is the Development Workflow. What is the process for getting the code from the developer and onto a physical server where customers can use it. Generally speaking companies have a set of environments that code will flow through before it gets into production. Lets keep this simple with three environments:

development ( dev )
quality assurance ( qa )
production ( prod )

After code is written it gets deployed to a dev server where the newly created features are tested individually to make sure they work as expected. Then the code is promoted to a qa server where the overall application is tested to make sure new changes did not affect the rest of the application in ways we did not expect. And once this is validated, the code goes to a prod server.

Can this also work for the operations automation code? We first need to work through differences environment definitions. My automation server deploys code to dev, qa and prod. So this server is partly prod, partly qa, partly dev. Anyone in Sysadmin or DevOps is used to this definition difficulty. The environment lines are not so clear cut when it comes to Ops servers. But I think it is clear that our automation code needs to be tested. So It is better to create a new set of environment definitions for the Ops servers/services. Here are mine:

My SaltStack automation server has three environments, Lets call them test, stage, prod. Test is where I try out the SaltStack pillars and states I am updating. I create servers just for the purpose of testing these changes. Since this is only for testing salt code, the developers do not have any access to this environment. I use this environment for both testing individual features and testing the overall application, but since it is done separate from the “real world” it it is roughly the equivalent of dev.

Once tested in dev, SaltStack code is moved to the stage environment. All dev environment servers are in this automation environment, and so the automation code now in a real world with users and real code being run. So from a environment standpoint this is roughly equivalent to qa. It also means that either an automation change or a dev code change could be responsible for a Dev feature not working. I have not found a way to work around this, and I don’t think we really need to. Because this is not really different than front end and back end engineers testing their new features together and having a problem. The issue has to be researched to determine which teams code the problem is in.

Now that the SaltStack code is tested in “real world”, it gets promoted to prod. The SaltStack prod environment contains both the qa and prod servers. Using the same automation code makes qa and prod are as similar as possible to ensure the validity of our qa testing before our Dev code goes to prod.

Using Ops specific environments, the automation code workflow looks very similar to the Dev code workflow. And so once the answer to “can Ops use the same process as Dev” is theoretically yes.

To make git flow work practically for my automation code, I use branches corresponding to the described environments. I have three directories on the salt server. Each directory is a copy of the automation repository set to a different branch. Then in my master config I use Salt file_roots to connect a directory to a salt environment and the top file to connect servers to the appropriate environment. The result of using a branch per environment is that my git flow has three persistent branches ( develop, stage, master ) instead of the typical two ( develop, master ).

Just one more part of this series to go, where I look at the questions “Will Ops use the Dev processes?” and “Why is it important for Ops to do so?”

Can Ops use Dev Techniques Part 2, Code Workflow

Thu, 08 Sep 2016 00:00:00 +0000

Piece 2: Code Workflow

Another piece of the the development “process” the the code workflow. We have bunch of developers writing code with the goal of providing new and hopefully useful features to our customers. So we need a way to organize how all these separate pieces of code get merged into that finished product. This is code management. After a couple of iterations, code management now typically uses a distributed code repository where multiple people can edit a piece of code and changes are added to the code base and then merged with other versions. the most common tool for code management is git. But it is a tool with much flexibility, so like the previous post a framework/methodology is needed. With git these are usually called branching strategies. And a popular one that our organization uses is git flow.

With git flow, one ore more developers will work on a branch created for a particular feature. These branches can be deployed to servers and tested. Then once the feature is complete it is merged into the develop branch along with other features. At some point, usually at the end of a sprint, a release branch is created to test all the new features together. There will likely be changes needed to have them all working, which are made on the release branch. Once it is fully functional and thoroughly tested, the release branch is merged in to production ( master ) and also to develop.

So can this development code workflow be used by operations too? Well if the ops staff ( or the operations focused members of the DevOps team if you like that semantics better ) is doing everything by hand ( or with an loosely organized assortment of bash scripts), then the answer is no. So lets assume that Ops is using an automation tool to replace manually typed commands with “recipes” for creating servers, customizing the configuration and services available on those server and then deploying our applications to those servers. Most of the automation tools ( like my current one Saltstack ) have these instructions written in files organized into a directory structure. Sounds a lot like “code”, hence the term Infrastructure as Code ( IaC).

If Ops has automation code and more than one person working on it, then Ops is in the same code management boat as development. So it seems reasonable that they manage it just like Dev.

Unlike Scrum where it really requires the full team to participate and make it work, a code workflow like git flow can be used by a few people or even a single person. I found it pretty easy to incorporate into my daily practice and it is working well for well for my automation code. There are some ops specific issues you will have to work through, like how to securely put passwords and other sensitive data into the automation repository, but those are doable details.

Next post comes the last piece: Deployment Workflow

DevOps, Plums and Apricots or Pluots. Can Ops use Dev Techniques?

Tue, 06 Sep 2016 00:00:00 +0000

Is DevOps shoving plums and apricots into a bowl and pretending they are the same thing or is it breeding something new and giving it a new name like pluot? When DevOps was coined in 2009, it was a label for a bottom up movement: people in the IT trenches trying to figure out how to make things work better. Agile programming had changed development and Ops was having trouble keeping up. In the last 6 years the term has definitely gone corporate but retains its roots in the trenches. So as a trench dweller I have to wrestle with the practicalities of DevOps every day. Today I am wanting to see if the DevOps theory that Ops can use the same processes as Dev really works in practice.

When I search Goole for “devops definition” the first result is The Agile Admin. There are two definitions listed:

DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support.
DevOps is also characterized by operations staff making use many of the same techniques as developers for their systems work.

Can an operations staff managing servers, creating automation, supporting CI/CD solutions, and dealing with middle of the night outages use the same techniques as developers? And if ops can, will they? Why is it important that they do? Lets start by tearing development into some pieces to see if each piece can also be used by operations. Today is piece one.

Piece 1: Development Methodologies/Frameworks

One piece of the development “process” is the methodology and/or framework used. At there essence these methods/frameworks are ways to organize how work gets done. Many organizations are using some form of Agile, frequently Scrum. For a team using Scrum, business requirements get turned into features that are broken down into small pieces. The team prioritizes those pieces and selects a subset of those pieces they think can be completed within a defined period of time. The selected pieces are divided amongst the staff assigned to the project who work on and hopefully complete them within that time period. At the end of the time period, the team reviews the results and makes plans for the next defined period of time. And the process repeats.

Can the above work for Ops? As an Ops person, I don’t have a Product Manager or a Product Owner determining business requirements for the infrastructure. But I can look at the infrastructure as it exists today and make a list of what pieces are missing and what can be improved. This list and feedback from management allows me to organize and prioritize my work. And I do need to organize my work. So there is nothing that prevents me AKA Ops from using something like Scrum.

Am I using Scrum for my Ops work? At this point not really. I work daily with the developers in a DevOpsy way. but we are still trying to figure out the whole sprint thing after a period of rapid growth. Teams using a framework like Scrum have to go from framework to practice by trying things and seeing what works and what does not, then trying other things. We tried a new tool for organizing our sprints that did not work well and Scrum suffered. So we are changing that tool and re-booting our Scrum process.

I have been here before and know that this is a normal part of starting with a framework and ending up with functional practices. It is messy, it takes time, but it eventually gets figured out. And I think the end result is much better than starting with a more rigid methodology and trying to conform our practices to that methodology. And so I think I will be back to using Scrum soon.

The next piece: Code Workflow. Stay tuned.

Chatops at 6 months. How has it changed development. Getting There

Wed, 24 Aug 2016 00:00:00 +0000

Using ChatOps to move from Ops-Serve to Self-Serve obviously took some technical work. But as stated before I am focusing on the developer in this series, so I want to look at what it took to transition the development team to using this new tool.

Changing our thinking and consequently our actions always takes work. So as the implementer/expert/guru of something new I need to have a plan. What can I do to help my team members in their process of thinking change? For me it means getting to the aha moment, where something clicks and I get the first glimpse of the bigger picture. That glimpse helps drive me forward in learning and exploring, leading to thinking change. It might not be the same aha for everyone, but I think we usually have a tipping point where our driving force changes and becomes more internal.

An important thing to realize and plan for is the process of change takes time. I implemented ChatOps and told the team about the new tool and what it could do. Some people were excited and started trying it out right away. But not everyone did. Both are to be expected and both are fine. It is much better to give people the freedom to move at their own pace and to get to their personal aha moment than to force something on them, particularly forcing it on them “for their own good.” Since this tool was designed to help my team mates, I want them to see the value in it and use it because it is valuable.

Now the tool is launched, the team knows about it and some are trying it out. Time to put extra effort into those early adopters. For me this generally meant three things:

Talking with them and learning what they want to do, then giving them example commands to do it.
Looking at the log of ChatOps commands to see what they were doing and giving them suggestions at how to do it better.
Looking at the log of invalid commands to see what they were not able to do because the syntax was unclear. I would help them with the syntax and then try to update the ChatOps help so the syntax was clearer.

Just doing these three things over a month or two helped the early adopters see the value in the tool and how it could improve their workflow. This is thinking change. It was usually followed by change in action, which in this case was incorporating ChatOps into their workflow. Doing the above also lead to thinking change in me. When I learned how people wanted to do something my brain would churn on how to change existing commands or add new ChatOps commands to make it easer, and then I would go make the changes. And those improvements helped the early adopters to aha faster and to ask if it could do xxxx. And this circle continued.

Once there was a few people using ChatOps as part of their development workflow, my role became much less active. These early adopters used ChatOps as they collaborated, meaning the other team members got practical demonstrations of ChatOps. It did not take many interactions like that before most of the remaining people started trying out ChatOps. And for the most part all I had to do was help people with syntax and making occasional suggestions on better ways to do things.

Of course there were some people who did not use it for many months. Some were “too busy” and did not really try to use it until they had breathing space. Others might have a role where creating development server environments was not part of their normal job, or someone else always did it so they did not have to. Whatever the case this is to be expected. At this point most everyone has transitioned their thinking and workflow to the new tool, so a few late adopters not really using it is not a concern.

One significant item not mentioned above that was key to ChatOps success: Fast Bug Fixes. Developers had sympathy for me because they are familiar with new implementations and finding bugs in new code. But they are also under pressure to perform in their jobs. So if I give them a tool and they find it does not work as needed due to bugs that are not fixed quickly, they might decide it is not worth the time and effort to learn a new tool AND to work around the bugs in that tool. Which would set the adoption process back with those important early developers. And so any time they discovered a bug I would give it high priority and usually have it fixed the same day.

Chatops at 6 months. How has it changed development. After

Fri, 19 Aug 2016 00:00:00 +0000

In my last post I looked at where the development process was before ChatOps was discovered and implemented. My how things have changed in these six months. Developers can do a lot more now.

A developer can ( with a single Slack command )

Get a list of development servers
Get a list of development environments
Get details for all servers in a development environment
Create a single server fully configured, on the correct code branch.
Create a full environment with all needed servers fully configured and on the correct code branch. Web servers would be automatically pointed at the api server in the set.
Deploy new code branches to a server
Terminate a single server
Terminate an environment

A developer can sometimes not

Get the syntax for the Slack command correct
Troubleshoot why a particular server build failed

A developer can not

Update the environment file ( .env ) for an application
Deploy a new application

Server Testing Process

Like before, after working on code locally the developer moves that code to a server environment. But the process of doing that is much simpler now.

Figure out what servers and code branches they needed for testing.
Run a Slack command to create the needed environment.
Proceed with testing.
Run a slack command to terminate the environment

Thinking Process

In the last few weeks I have been seeing the thinking process of many developers getting close to this:

Creating a server development environment is easy
I can wait until I am ready for testing to create a server environment
If I need to make any changes, just terminate the environment and create a new one.
When I am done testing I can terminate the environment

I say getting close to because thinking and habits don’t always change quickly. Some of the developers are fully at the above thinking process. Others are somewhere between the before ChatOps thinking and the above thinking. Transitioning people’s thinking process, even if the change is to their benefit, still takes effort and time. The transition of my developers thinking process was no different. Next post will describe the transition process and what is needed to make sure it is successful.

“Can Not” Notes

Troubleshoot why a particular server build failed is was initially hard for a developer as ChatOps would return all the SaltStack failed states. This meant a simple error like incorrect branch would return 50+ lines of SaltStack JSON output due to all the states that failed because they were dependent on the git state. Filtering out these dependent failures cut out the noise. Now a developer can pretty easily interpret the most common failure, typo in the branch to deploy, without understanding SaltStack or even knowing that the error message comes from SaltStack. Here is an example:

{
  "git_|-xxxxxxx-repo_xxxxxxxxx_|-git@github.com:xxxxxxxx/xxxxxxxx.git_|-latest": {
    "comment": "No revision matching 'hotfix/x.xx.xx' exists in the remote repository",
    "name": "git@github.com:xxxxxxx/xxxxxxx.git",
    "start_time": "20:34:56.757768",
    "result": false,
    "duration": 489.839,
    "__run_num__": 3,
    "changes": {}
  },

Failures not caused by command syntax problems are usually not as clear and require Ops help to troubleshoot. Fortunately these do not happen very often.

Deploying new applications and updating environment files are purposely not within the developer’s abilities. The server development environment is a intermediate step between development and production. Therefore it is controlled to ensure it is a similar as possible to production and the tests run in this environment are valid tests of what will happen when this code gets deployed to production.

Chatops at 6 months. How has it changed development. Before

Mon, 15 Aug 2016 00:00:00 +0000

Yesterday one of my developers created a server environment to test a feature branch and had a problem: one of the 4 servers in the environment did not get built. A short investigation showed a typo on the ChatOps command was the cause. When I explained what happened and suggested building just that one server the developer replied “I will just terminate the environment and recreate it.” Terminating and recreating servers is a normal/reasonable option for a situation like this, but for some reason that statement stuck in the back of my mind. A day later I realized why: it was said so casually. To the developer accomplishing what he said was easy. One Slack command would terminate the 3 servers and another command would build the 4 servers again. Why mess around with fixing a development environment that did not get built correctly the first time when you can create it again with 2 Slack commands.

The above experience was the catalyst for me to look a the impact ChatOps has had on development and the developer’s thinking. This is not going to discuss the technical aspects of making it happen. That will have to be saved for another time. But suffice it to say that ChatOps requires good IT Automation package ( I use SaltStack and a lot of work automating install/deployment to allow ChatOps to really shine.

My “discovery” of ChatOps was 6 months ago and my first post was five months ago. At that time, the developer was really not able to do that much.

A developer could

ssh to a development server and change the code branch
commit to a branch and a cron job would deploy the new branch

A developer could not

Get a list of the current development servers
Get details about a development server ( app installed, code branch, API server pointed at)
Create a new development server
Configure the new server as a particular role ( web server, app server, … )
Install/configure an application
Setup the application to use a particular development version of an API.
Terminate an un-needed development server

Server Testing Process

A web engineer would work on code locally and when they were ready to test it in a server environment follow a process like this:

Figure out what servers and code branches they needed for testing.
Ask the Ops team member what development servers were currently built/active.
Ask around to see if any of the development web servers were currently unused.
- If yes, ssh and change to the correct branch ( if they were command line savvy otherwise ask the Ops team member )
- If no, either wait until it was available or ask Ops to create a new server.
Ask around to see if any of the development api servers were currently unused or if all were in use was one of them on the needed branch of api.
- If yes, ask Ops to point the web server at that api server. And in the case of unused ssh to server and change the branch ( if they were command line savvy otherwise ask Ops )
- If no, either wait until it was available or ask Ops to create a new server.
Proceed with testing.

This process could take 30 minutes if all the planets aligned ( other developers responded promptly, servers were available, the Ops team member was available to repoint web servers ). But I think on average it would take between several hours and two days. And so a developer would have to plan ahead. Back then I would frequently have developers say to me something like “I have a feature I am going to need to test next week. Can you help me get the correct servers setup.”

Thinking Process

Based on my interactions with the developers during that time, the typical developer though process was:

Getting a new feature onto a server is hard
I always need help from a busy Ops person
I have to coordinate with 15 other developers
Maybe I don’t need to test this feature on a server environment. An untested hotfix might be easier.
It took so much work to get this environment setup as I want it so I am going to hold onto it even if I won’t be doing any testing in it for a few weeks.

My next post will show how all of this has changed and the benefits it has provided to the developer and development workflow.