OfferZen's Best Practices for Low Maintenance Infrastructure

As OfferZen grew, our infrastructure needs grew with it. While setting up a more solid system to manage infrastructure, we developed best practices that now guide all of our work. Here are our most important ones.

Rocket with icons indicating infrastructure best practices

I’m a Senior Platform Engineer in OfferZen’s platform squad. The platform squad is responsible for managing OfferZen’s IT infrastructure, everything that connects someone using www.offerzen.com with the code and services developed by OfferZen. This consists of software, networking components and data storage on cloud infrastructure. When OfferZen began scaling, we needed an infrastructure system that wasn’t so manual and didn’t require so much human attention and time.

We wanted to build a system that would require next-to-zero maintenance in the future, so we could focus on solving other important problems. In order to do this well, we drew on best practices from different areas. Here’s why and which helped us the most.

Why best practices?

It always makes sense to follow the suggestions from those who took the time to understand and develop a product. For example, when my car manufacturer says it’s best to fill up with diesel, I don’t do the experiments to see if I can use petrol instead. I believe that the manufacturer knows what the best practice is.

The same thing goes for software projects: If I integrate with a service, I want to use it in the way it was designed to be used. Having a list of best practices from the creators of that service means I can be sure I get the full benefits of the service by using it as intended. Best practices are an efficient and proven way to improve software projects.

To improve our infrastructure, we referenced community best practices, the OfferZen way of doing things and each platform engineer’s experience of best practices. That way, we could develop a core set of best practices that optimised for everyone’s knowledge across web development, IT and network management, and embedded engineering.

Here are the six most important best practices we developed to guide us and that we still use today:

DRY (Don’t Repeat Yourself)

The idea of DRY is simple: You shouldn’t have to update multiple things when only one change occurs. If your knowledge is repeated twice in your code and you need to change it, you might forget to change it in both places.

In infrastructure land, we found that defining things once and reusing them creates consistency and means we don’t have to repeat ourselves.

For example, we used DRY for a template to create new infrastructure Github repositories. We created a template with an automated task to scaffold code for a new project. Additionally, we created a mechanism to update the scaffolded code from a central place because we wanted to iterate on the shared code. We have now created over 20 repositories using this template. It’s saved us time and has made us less prone to errors.

Diagram showing code scaffolding

However, following the DRY principle blindly has its pitfalls. Repetition isn’t always bad, and abstraction isn’t always the best idea.

I agree with Sandy Metz, who said, “Duplication is far cheaper than the wrong abstraction”. Getting a feel for the right level of abstraction comes with experience, and not all things should be abstracted.

Infrastructure as Code (IaC)

IaC involves the managing and provisioning of infrastructure through code instead of through manual processes. We wanted to have a reproducible, reliable and on-demand way to provision and manage our infrastructure while reaping the benefits of using code, hence infrastructure as code.

At first, our cloud setup on Amazon Web Services (AWS) was done by clicking buttons on their user interface, but as we scaled the business and teams, we realised that this approach could quickly become error-prone.

We now use Terraform to manage our resources using IaC. For instance, if I wanted to create an AWS user before, I would have to create a user, a policy with the correct permissions and a role on the AWS console. In contrast, by using IaC, I can add the following code to my repo to create a new user with all the correct permissions and policies in place:

module "iam_user_my_user" {
  source  = "t-reg.offerzen.dev/offerzen/iam-user/aws"

  email     = "my email"
  full_name = "my name"
  username  = "my username"

  permissions = local.default_permissions
}

This allows me to follow quality control processes such as peer reviews, code checks and testing.

Automate manual tasks

While we developed the new infrastructure, we had some tasks that we needed to do manually. Once the new infrastructure was in place, we could automate these tasks. We wrote scripts to automate them so we could increase efficiency, keep consistency and reduce margins of error.

We introduced Task to write complicated terminal commands, have it in version control and call it easily.

I mentioned a way to scaffold and update DRY code. Here it is in practice:

=> task --list
task: Available tasks for this project:
* infra:scaffold: 		Scaffolds infrastructure for a new project
* infra:upgrade: 		Upgrades the shared Terraform and infrastructure files

By running task infra:scaffold, I get the boilerplate code I need to get started, and if I need to update the code, I can run task infra:upgrade. Behind the scenes of these simple commands are 48 lines of terminal commands.

In this way, I don’t have to remember all 48 lines, reducing cognitive load and opportunities for errors.

Consider the best practices provided by services

When working with a service, we consider the best practices they recommend. We don’t blindly follow all best practices but rather understand and adjust them to our needs. Using a service’s best practices gives you access to helpful resources. For example, we chose to follow the recommended AWS best practice for setting up accounts because we could get guidance and help from the community and AWS.

We followed their whitepaper on Organising Your AWS Environment Using Multiple Accounts, which provides best practices for organising your overall AWS environment.

The main idea was the separation of concerns. We wanted to separate things to ensure that each AWS account does one thing and does it well. We separated permissions, costs and environments but still allowed them to communicate with each other.

As you can see in the diagram of our account structure below, each block is an account that has a clear purpose and is well-documented. For example, Sandbox is a playground account where anyone has access to play with AWS services.

Diagram showing AWS account structure

Consistent and clean code

Everyone on the platform squad appreciates working with clean and easy-to-understand code, so it is no surprise that we have baked linting, formatting and code conventions into our projects.

For linting and formatting, we use the terraform-provided tools:

terraform fmt
tflint

We run those as pre-commit hooks and make our pull requests fail when tflint or terraform fmt don’t pass.

We created our own Terraform Conventions Guide for code conventions and styling, which we developed while writing the code. It includes:

File naming conventions
Folder structures
Style guidelines
Terraform resource naming conventions

Having a conventions guide helps us maintain consistent code standards without having to think about them. Working with clean and consistent code is a pleasure because I know exactly where to look for something, and I know what it will look like. Since I’m also a Ruby on Rails developer, I appreciate convention over configuration!

Review apps

Making changes to infrastructure can be scary because it can cause bad experiences for our users. Having a way to test changes provides confidence and is one way to make it less scary. We do this by using review apps.

A review app is an isolated environment that runs a complete application based on code changes. This is done before the changes are made live. It’s a best practice we have in most software projects at OfferZen.

For example, we have nginx running as a reverse-proxy. Making changes to the configuration can result in an entire site not routing as expected. We can now spin up a review app built on the code in a pull request to test whether the changes work when plugged into the end-to-end system and avoid major issues.

We use Github Actions and comments to spin these apps up and down:

Screenshot of deploying changes using Github Actions

Having a way to test changes consistently gives us confidence before making changes live.

Aim for a good developer experience

Happy developers tend to be more productive, and having a good developer experience leads to happy developers.

Creating tools, processes, and documentation so developers can easily get started and contribute is how we ensure a good developer experience.

In our infrastructure project, we created auto-generated sections in READMEs and a series of posts on the topic with diagrams and guides. In addition to everything mentioned above, CLI tools contribute to a great developer experience.

Of all the projects I’ve worked with, the infrastructure project is my favourite by far due to the great developer experience - that’s why I wanted to share it with you!

Conclusion

Infrastructure is a critical part of the survival of OfferZen, and mistakes can be expensive. That’s why we went about developing a solid infrastructure system. Throughout the development of this project, we established a couple of best practices for ourselves that we continue to carry over to other software projects.

These best practices make our system more reliable and a joy to work with. As a result, we’ve had two issues with the system since making it live in March 2021, which enabled us to focus on delivering other important features for developers and the business.

I hope you can take some of these principles and that they make your projects a joy to work with!

Madelein enjoys setting up end-to-end software projects, threading quality and efficiency throughout the lifecycle of a project in her role as Senior Platform Engineer at OfferZen.