As Director of Engineering my job is to help build and facilitate the company culture for the engineering team. Culture is where you spend your days and your efforts. Culture cannot be confused with perks. Perks are free drinks and stand up desks. Culture is ultimately what makes top performers leave or stay at companies.
You can usually assess a team’s culture starting right at the development environment. How much time have they invested in automation? How long do builds stay broken? Does anyone care when a build is broken? Do they have test automation running? Have they centralized logging? How do they know a new code push doesn’t introduce a regression? Can someone checkout a project, follow the README and be up and running or do they have to chase down someone to help troubleshoot? How can we quickly can we safely get from laptop commit to production? What are all the steps involved there, are they automated?
These are all things that add friction to the one thing engineers really want to do… ship product. The more you separate and engineer from production the worse your engineering culture tends to be and becomes a culture of bare minimum, throw it over the wall mentalities. That statement is based on my experiences over nearly 20 years in software, across startups and enterprises of various size and scale (so your mileage will vary).
Engineers want to be productive and not be held back by issues that pop up day to day. Ultimately, it’s management's job to hold the team to a higher standard, remove roadblocks and enforce behaviors until they become part of the culture where engineers can self-regulate. You want new hires to come in and just say “ok so this is how this place works and everyone is on the same page, good”. You don’t want mixed messages for people coming in, as they will settle into their patterns within the first month.
To me it's analogous to a chef vs a line cook. A chef keeps a clean kitchen, food properly stored and prepared. The chef knows the food cost, knows inventory, knows how the kitchen runs and what everyone does. The chef treats the kitchen (their dev environment) as a pristine place where the end product begins it’s journey. You wouldn’t expect a 3 star Michelin restaurant to serve meals from a kitchen with dirty floors and grease all over the walls. Line cooks on the other hand are usually implementers. During work hours they take tickets, make some food hopefully to spec and go home. In a startup you need chefs. You want a team who wants to remove roadblocks that slow down getting ideas to production. You want a team of chefs who want to know when things are broken and know where to go to fix them. You want a team of chefs who treat their development environment with respect and their teammates time with respect by keeping it operational. An engineer with a chef mentality will work a Saturday to automate something that will save their teammates 10 minutes of friction.
So what are some things you can do on your team to start down the path of excellence? As an overall goal start with the statement:
“Aim to minimize as much friction as possible that prevents developers from coding as efficiently as possible"
Create a mission statement that documents the expectations of the team and the standards and hold the team accountable for them. What are the 5-10 things you value most as a team? Give it to new hires to read so they understand what your expectations are as it relates to the development life cycle and hold them to it.
Sustaining Engineer role that rotates:
It is important in large, complex systems that as many people as possible know how all the cogs fit together. Where do I find error logs? Who owns what component? How do I monitor all the servers from a central place? A test just failed how do I know who recently pushed code? Why does that thing talk to this thing?
These are very time consuming items if you’re spreading that over the entire team’s capacity every time an issue comes up. Create a rotating sustaining engineer role to run point on broken builds, failed tests, daily log/error reports, running releases, troubleshoot, first pass on production issues. I’ve seen this become extremely valuable getting the team’s knowledge of how the overall system works, the number of components we have, reading logs, troubleshooting, etc… We have built internal tools to provide this information from a central place for easy diagnostics.
Source Control Standards:
As your team grows you’ll want to make sure you have some standard for commits. If you use a ticketing system, would it help if you put the ticket number as the first item in the commit so you can write a tool to auto generate release notes? Do you accept commits like “had to change this thing”. Source control is your window into the past so you’ll want some thought there to make sure people understand WHY something changed. It’s easy to see the change, but the intent is usually the critical part. It also becomes more important as you need to start sharing release notes with customers as well as auditing.
Are you able to go back in time and recreate your cluster based on a known deployments? In large scale systems weird bugs come up, what works with one version combination of software, fails with a minor revision of the next. Are you able to take component A at version 1.2.3, component B at version 3.1.0 and component C at 1.0.1, load it up on a test cluster and troubleshoot outside of the production environment? You should aim to be able to recreate an environment that you deployed a month ago.
What good are logs if no one looks at them? Early in your cycle define standards for error logging, get central logging into place for your various environments and create reports and alerts to find abnormalities or error conditions. Everyone looks at their logs as they develop and release a feature but what about 2 months later? Who’s looking at the logs? Probably no one so make sure to set up alerting early on. PagerDuty is an ideal application for this to make sure there is a point person assigned to triage when issues come up.
This will definitely be one of the better mandates for the team. Getting shared knowledge, having experts in a language shaping developers who may be new to a language. Making sure features and concepts are spread among the team. Many bugs and issues have been caught at this phase. It also gives you confidence your ideas have been vetted.
Unit tests cover specific use cases and use mocks/stubs for most interactions. Most systems have complex interactions outside of themselves so it’s critical to write full stack Integration tests that exercise even the simplest cases of send raw data in, get back nice processed data from APIs, if everything passes and no errors are scraped from the logs, assume processing succeeded. Start with broad strokes and work your way in. With limited resources try and get the most test coverage for your time invested. Invest early on in API monitoring checks that run in production repeatedly if your APIs are private.
Continuous deployment and testing:
One of your goals should be to find issues as soon as possible in your development pipeline. The sooner you find a bug, the cheaper it is to fix it. To that effect I’ll walk through what happens when you check in a line of code in our team’s repository.
- code is checked in with an issue number in the beginning to support automated release note generation
- checkin kicks off a build server plan that builds the project and runs the unit tests
- assuming local tests pass a process is kicked off that then builds the code into deployable packages and uploaded to a package repo
- The role those changes affected is determined and then the running servers in dev with that role are identified. This kicks off a rolling restart in dev that installs the new packages
- Once the servers have been restarted automated smoke tests run through the environment that test a number of known scenarios to protect against regression issues.
- If tests fail during any part of that process alerts are emailed out and the SE runs point on getting it fixed.
Metrics On Everything
This is pretty standard but make sure your application has proper instrumentation that’s logged. For example Code Hales metrics library stream to statsd for historical trending. Make sure you have metrics on server health. Provide alerts and dashboards that summarize the important information. Try to put counters on everything that runs and have JSON API endpoints you can extract the counters for alerting and graphical UI analysis.
Automate all the things! If you find yourself doing the same things daily or weekly or you have to do 10 steps to get code to your dev environment, spend the time to automate it. Heavy investments in automation and repeatability continue to pay off over time.
As people on my team can attest... failing tests, broken builds, unstable dev environments are my hot button topics. It is because I see that as a window into how you think about production. If you can’t keep your own house in order then I lose faith you have the ability to make the best decisions for production environments. Be a chef.