Time to finish what we started: the last 4 points/suggestions on DevOps and Horizontal Scaling.
Manual and Automated Code Review
Manual code reviews are easy — GitHub, Gerrit, and a host of other tools support this workflow. The gist is that everyone’s pull requests need to be seen and reviewed by a senior developer/manager/etc. before being merged into the main development branch. The idea is that with many eyes on the code, problems can be spotted and solved early while in context rather than later when the source of the issue can be harder to find. This is the first stage of code reviews.
The second stage (and final, in my eyes, as it’s not horribly complicated) of code reviews is automated code reviews. This sounds (for now, until AI takes over) more complicated than it really is — ideally through a Jenkins job or some other task runner, run a series of tools like pylint, jslint, Codacy, SonarQube, or even valgrind over the PR. These tools produce all kinds of output, so they’ll need to be set up to provide the best kind of feedback to the developer who made the PR. These types of tools do style analysis, static analysis, and even memory usage analysis, and can find harder-to-spot problems or hotspots for null pointers, memory leaks, and other types of issues.
As your team grows and every developer is handling a different kind of complexity, manual and automated code reviews help to make sure that everything is being reviewed and understood before being deployed, both promoting high-quality code and ensuring that if something were to go wrong, at least 2 developers should understand any particular part of the codebase. Redundancy is your friend.
Scalability through PaaS & Non-monolithic Architecture
For a very large number of systems, it makes a lot of sense to be using a PaaS (platform-as-a-service) provider like Heroku, Google App Engine, or Engine Yard. You don’t need to worry about knowing how to use Ubuntu/Debian or Windows Server and you don’t need to understand how to use Docker or lxd or rkt or any number of new containerization tools. These cloud PaaS providers just work.
(As an aside, You might look at Heroku’s pricing and think “but hosting a small server on AWS and managing it myself will save me $200 per month!”, but in the end, I’m sure you will spend far more in paying employees for hours spent setting up, debugging, and maintaining than the $200 you save each month.)
In addition to the relieved knowledge pressure by using a PaaS instead of self-managed cloud servers (or on-premises machines), you remove a specialization-of-function role of the “ops/IT/deployments person”, so if something isn’t working right, more of your team will be effective in diagnosing and solving the problem. That’s a good thing!
Paired with this is the use of a non-monolithic architecture. This one actually adds complexity, but the payback from this investment can be significant for some engineering teams. I’ll go into this further down this section.
“Level One” of this concept is simply migrating your web services and backend resources over to a PaaS (I’ll use Heroku as an example, due to my familiarity with it) like Heroku. All of a sudden, all of the components in your deployments are accessible from one place, and often come with monitoring dashboards or inspection tools baked right in. Settings are configured via environment variables (12factor.net) and logs are all piped into one nice display. Cool!
Getting to “Level Two” involves integrating your environments across something like a “pipeline” in Heroku. This may exist for your PaaS provider or you may have to create this abstraction yourself, but the results are fantastic. With Heroku’s Pipelines feature, you can integrate “Review Apps” (where GitHub Pull Requests are automatically built and deployed to allow for manual testing and usage before merging into the development branch), continuous integration (previously with an outside tool, but Heroku has released their own CI function), and staged deployment. With these tools integrated, it becomes easy for developers (and honestly any other stakeholder) to see and understand the flow of how new code and bug fixes make it down the pipe. This makes it much easier to add new engineers and hit the ground running quickly.
“Level Three” starts to introduce microservices or another non-monolithic architecture. Let me start by saying microservices aren’t for every organization — in fact, I highly recommend against them for small organizations or for prototypes/new projects. In a lot of cases like that, having a monolithic architecture is going to keep things simple and keep your feature deployment cycle times much leaner.
Adapting a non-monolithic architecture encourages a number of good practices: decoupling of components, increased test-ability, and system isolation, to name a few. It also increases the complexity of deployments and where things can go wrong. For a highly functional software engineering team, the benefits can often outweigh the drawbacks — especially one with engineers devoted to the infrastructure itself (especially when cost and other factors move you away from PaaS offerings). With microservices used in a large project, it becomes easier for new engineers to become acquainted with complex functionality quicker, because they (hopefully) no longer need to understand far-reaching implications between systems or comprehend huge modules that span a wide swath of functional domains. It also becomes easier to roll out new functionality: as the tight coupling between components decreases, the risk of changes to one function breaking another function decreases as well.
Logging, Instrumentation, and Monitoring
Logging seems pretty simple, and it is. Unless you have specific logging requirements, like from HIPAA, you should pretty much spit out a lot of logs (but not too many… I know that’s vague) and save them for at least long enough that it’s unlikely you need them again (and if it is likely, compress them and save them somewhere cheap). Instrumentation and monitoring get a little more complex. Tools like New Relic or Dynatrace are instrumental (ha!) to doing this successfully — they provide tools and dashboards that work with all types of projects to analyze performance and detect problems that otherwise go undetected.
The “Level One” of logging, instrumentation, and monitoring, is simply to have these tools hooked up. On Heroku, I often use the free tiers of LogEntries and New Relic to start. The benefits you get from simply ensuring that these are hooked up right are immense: you can immediately start seeing slow requests/queries or performance bottlenecks in infrastructure, and you can begin to trace anomalous events back to specific sections of logs to get better context when hunting down issues.
“Level Two” (the final level, much like my view on code reviews) of this concept involves completing full-system coverage using these tools, and configuring these tools for automated alerts (what I would call “true monitoring”). Instead of going into the New Relic control panel when something is going wrong, it’s far better to be warned by New Relic (via Slack, email, or a host of other options) that something is slowing down or about to go wrong.
LogEntries and New Relic can both be set up to warn you about certain things: Apdex score warnings in New Relic can warn you of overall app sluggishness, while New Relic’s Key Transactions feature can be set up to warn you of sluggishness or errors in particular processes, such as customer checkout or PBX/customer service initial response. LogEntries has a number of built-in pattern recognizers for Heroku, and StackOverflow and Google are full of many, many more for detecting anomalous behavior or otherwise unseen errors.
The benefit of these concepts to horizontal scaling (and general operation) of a team are plentiful: democratized error handling and response — the whole team has the tools to analyze even the most technically complicated of issues; continuous measuring and monitoring of performance —sluggishness introduced by new features or new code is caught early and quickly; and performance quantification — product managers and executives can better understand the impact of technical debt and architectural refactorings on the quality of the product.
Architecture Review Team
The last item on my list of thing for building out a “DevOps culture” and a horizontally scalable team doesn’t have levels: the Architecture Review Team. I think at large enough organizations, a team needs to exist, composed of senior engineers of sub-teams and/or software architects with engineering managers (i.e. people with budgets).
Ideally, all participants have a strong understanding of the impact that spending engineering time has on the overall organization (i.e. it’s expensive), since software architecture at large firms is a balancing act of figuring out the best architecture at the appropriate cost. During each meeting, which can happen as often as once a month and as infrequently as every six months (though I like the happy medium of quarterly), the team should get together and discuss the recent and upcoming major initiatives regarding new features, the state of the bug tracker, the state of the performance monitoring tools, and the infrastructure budget (money spent on servers, software, etc.).
The goal of the meeting is to make sure everyone has an understanding of the macro state of the system, and to allow for discussions about issues like tech debt, refactoring, hiring, and infrastructure changes. It should also prompt discussions about future software initiatives — architects and team leads will have a good idea about the scope of new initiatives and whether or not it will have a deleterious effect on the existing system without additional planning or a round of refactoring first.
How do you do it?
As I mentioned in the last post: careful planning and working with the engineering team and outside product owners/users is critical to implementing these steps successfully. This is actually something that I love to consult with companies about — shoot me an e-mail at z@z11k.com to get a bit more information about that. I love traveling and will gladly come to you wherever you are to help your company succeed in implementing these tools.