Cloud infrastructure end-to-end or conformance testing

The rapid growth of cloud computing left a space for standardization, leaving the DevOps approach kind of in the dark. The “you built it, you run it” approach became the de facto standard in the industry. Unfortunately, it didn’t scale enough and required specialized skills. Moreover, searching for people with this expertise was challenging and expensive.

According to multiple State of the Cloud surveys, 91% of businesses use public cloud, 76% of companies are already multi-cloud, and the trend is only upward. The amount of information we can retain hits the limits. We see increasing cognitive load caused by emerging technologies, tools, frameworks, and new methodologies. That means a lot of code is being actively produced and maintained. That’s why the right infrastructure testing strategy becomes even more crucial today. In our best efforts, we believe we are doing everything to provide the best possible standard. But are we?

In this article, we’ll introduce you to infrastructure testing, provide test cases, and offer advice on how to set up your tests to achieve sustainability and resilience in the cloud.

Why is infrastructure testing a new practice?

To assure quality, we write tests with all of the application code. So why is writing tests for infrastructure, not a standard practice? Thinking about edge cases for infrastructure requires not only vast knowledge about the proper way of utilizing it but also an understanding of the underlying problems of the used solutions. As the Cloud Team at VirtusLab, we would like to share with you a few long-term observations and our solution for infrastructure testing.

This will help you to pave the way and give some insights to upgrade your infrastructure testing strategy. See how your organization can benefit from it.

These are our observations on end-to-end infrastructure testing:

Infrastructure testing takes a long time and includes templating, which introduces more complexity. We do not want random errors during resource provisioning.
Static code analysis tools find bugs in the configuration before a much longer test deployment, which saves lots of time in case of a misconfiguration
End-to-end tests (E2E) can produce resources that affect consecutive tests, so it’s important to provide necessary sandboxing and reduce potential blast radius.
New versions of Infrastructure as Code (e.g. Terraform providers) and cloud APIs are released, some containing breaking changes. It’s not always possible to foresee how they will affect dependent resources without prior testing.
Properly written tests, with design patterns in mind, can be easily reused in spite of the cloud provider
Properly structured and verbose tests can reduce the troubleshooting time from hours to minutes.

The right way to start the infrastructure testing flow

Infrastructure is written in stone in configuration files, which come in many different formats such as YAML, JSON, etc. They all have a way of checking spelling and indentation.

The infrastructure testing process can be as simple as generating actual files from templates and evaluating the correctness of the outcome in terms of values and formatting. It can also be as complicated as dry-running configuration files with their dependencies.

We can choose from a variety of linters and code snippets that take care of different aspects of code quality:

Simple linters – standalone or IDE-integrated, it doesn’t matter. They provide basic syntax error detection and ensure readability. Examples of simple linters are built-in terraform fmt and terraform validate.
More advanced static code analysis tools – these are tests that check for security vulnerabilities, which are a result of misconfiguration, e.g. checkov or tfsec for Terraform configuration files.

If you’re a cloud enthusiast, this deep dive into how to choose the right cloud technology might interest you:

Actual encounter of different parts

Everything looks good on paper, but things mostly fail when dependencies on other components start to play a role. The following two types of tests need to create the infrastructure – end-to-end tests and conformance tests. What’s the difference between them?

Let’s take a look.

In end-to-end infrastructure testing, we create real components directly via the cloud providers’ API. They are ephemeral in the context of a particular test. We then check if the actual configuration aligns with the desired one and if everything results in the expected behavior.

The most common test cases consist of checking whether:

The created resources provide necessary connectivity between each other and between the users and the resource, e.g. setting and getting Key Vault secrets from a command line.
Secondary resources were created, e.g. managed identities, which are not really the “main course” in a module but definitely need to be in place to ensure we have access to the created resource.
Basic operations specific to that resource can be handled without error, e.g. CRUD operations on a database.
And, of course, if the created resource has the test configuration applied.

On the other hand, applications have specific needs to conform to in terms of security, availability, and connectivity. These are the things that can’t be tested thoroughly with E2E tests, because we need to recreate the environment in which the application is running 1:1 to be sure. In these cases, conformance testing can be used. These tests are run inside of the runtime environment, which is Kubernetes in the context of this article. Some things look different inside of the cluster, e.g. the access policies are different for the resources running inside of it than they are for users and other managed identities. Moreover, using conformance testing, we can mock failures of dependent resources and check the efficiency of the disaster recovery tactics.

Let’s take a quick look at the differences between end-to-end and conformance tests below:

	End-to-end Infrastructure testing	Conformance testing
Location	Infrastructure is created in the cloud, and tests are executed outside the Kubernetes cluster.	Infrastructure is created in the cloud, and tests are run in the Kubernetes cluster, creating Kuberbetes specific resources.
Scope	Check standalone resources and their configurations and connectivity, outside the Kuberbetes cluster scope. Create, update, delete actions on resources.	Cluster-specific tests. Check if workloads have applied policies, access rights, connectivity to external resources (even in separate workspaces, like cloud environments), role assignments, DNS resolving, security and stability in case of disruptions.
Use-cases	Run to test if the infrastructure is created correctly.	Run after creation and update of the infrastructure, to check if it is chreated/updated correctly or periodically to check its health or its behavious in case of disruptions.
State after testing	Infrastructure is destroyed at the end of the test.	Infrastructure can remain undisrupted or disrupted. For periodic checks, only non-disruptive tests should be run, and only additional resources created specifically for testing purposes are deleted.

End-to-end infrastructure testins in depth

E2E infrastructure testing can be divided into given-when-then parts, as any other test. The only difference is what comprises these parts. The illustration below gives us some overview.

Given	Establishing environment details Stating resource-specific values Passing both to templating functions
When	Rendering resource templates Creating resources in the Cloud
Then	Verifying correctness of the output configurations Verifying existence of the resources Verifying resource-specific operations (set, get, create, update, etc.) Destroy resources in the Cloud

Our team at Virtuslab has created a custom E2E framework, written in Golang, to encapsulate test logic for Terraform modules. It uses two modes:

Local development – infrastructure can be tested remotely but using code from a local machine. The infrastructure won’t be deleted, so we can inspect it in the cloud. It is very useful for finding the root cause of a test failure. For example, sometimes secondary resources weren’t created because of a networking problem or a provider was left misconfigured after an upgrade.
E2E tests in the continuous integration (CI) pipeline – carried out using code from the remote repository. The infrastructure will be deleted at the end. This mode is adapted for full automation.

In each of these contexts, resource templates are rendered from actual module directories. We can pass the necessary configuration variables to test specific use cases. All these rendered terraform files are then created in the actual cloud environment by running the init, plan, and apply steps. By calling everything from code and using libraries like “Terratest” and wrapping them in additional logic, we get a custom library where we have all the necessary variables in the code that are ready to use, even the runtime values from Terraform outputs. We can easily use them for calling actions on the resources. That way, we can compare the desirable configurations with the actual remote state from the cloud. The additional advantage is that we can ensure that the providers and cloud API don’t inject any default values that would break our desired outcome.

Infrastructure testing scenarios differ for each resource. We can check network connectivity via different protocols, the success of the CRUD operations, the existence of secondary resources and configurations, etc. As all the infrastructure creation, updating, and changing configurations need time to propagate, resource and status polling are a must.

The simplest example can be creating a Key Vault.

Let’s use the given-when-then diagram from above for better understanding.

Given	Pass subscription details Create values for Terraform Key Vault module variables Pass both to Terratest
When	Render Terraform templates Create resources in the Cloud
Then	Verify correctness of the output configurations Verify existence of the resources Create a secret and get it from the Key Vault Deploy VM and check if connection to Key Vault is private Destroy resources in the Cloud

Does your application environment conform to the expectations?

In addition to the end-to-end test, conformance tests are run to check the infrastructure’s overall health and troubleshoot problematic features in runtime. They usually run in Kubernetes, in isolation, in dedicated namespaces with test resources. Such tests must be non-disruptive for existing infrastructure as well.

The most popular tool used for such tests is VMWare Tanzu’s Sonobuoy.

As Sonobuoy is designed to run in-cluster tests, we extend its use case to running a variety of different types of tests such as in-cluster connectivity, testing authorization, and lifecycle management. This gives us a nice baseline for future customization.

Some applications are deployed in the cluster but can be considered part of the infrastructure. These include for example monitoring and logging apps such as Splunk or Thanos, CI/CD solutions like ArgoCD or Jenkins, Certification Issuing Authorities, or event webhooks you created for the Developers. These solutions need to be resilient, and conformance tests are ideal for testing their lifecycle.
During runtime, these resources create temporary subresources, the existence of which needs to be checked for proper ecosystem functioning.
Internal and external cluster connectivity is often a tricky combination of many network rules, which must be thoroughly checked to ensure security. Connections to dependent infrastructure resources like image repositories and storage entities require both connectivity and permission check-ups.

How to structure a conformance test

In the case of extended conformance tests, the infrastructure already exists. So we only need to either try to create a working connection to a resource and trigger some logic or create a secondary resource to do it for us. As with the E2E test, this can be nicely divided into phases:

To be more descriptive, we have an example below of a test case checking the ability to pull images from Azure Container Registry into the Azure Kubernetes Service cluster. We can divide this simple test into the following steps:

In the Setup block, we create a separate namespace to limit the blast radius and create a special pod that will try to pull the image. Creating a separate namespace for a test case can greatly simplify removing all of the created resources and ensure we start without any old residue that can affect the test logic.
If the pod reaches the running status, the image has been successfully downloaded, and we assume that the cluster components have the necessary permissions set in the repository. The test logic is placed in the middle part of the code.
When the test ends, we clean up the namespace.

1func TestJobPullFromACR(t *testing.T) {
2	f := features.New("Pull from ACR").
3		Setup(func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
4			var err error
5			ctx, err = createNSForTest(ctx, cfg, t)
6			require.NoError(t, err)
7			namespace := fmt.Sprint(ctx.Value(contextNamespaceNameKey))
8
9			t.Logf("Creating pod with testing image")
10			pod := buildPullACRImagePod(namespace)
11			assert.NoError(t, cfg.Client().Resources(namespace).Create(ctx, pod))
12
13			t.Logf("Pod %s/%s scheduled", namespace, podName)
14
15			return ctx
16		}).
17		Assess("Pull image from ACR", func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
18			namespace := fmt.Sprint(ctx.Value(contextNamespaceNameKey))
19
20			pod := &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: podName, Namespace: namespace}}
21			err := wait.For(conditions.New(cfg.Client().Resources()).PodRunning(pod), wait.WithImmediate(), wait.WithTimeout(time.Minute), wait.WithInterval(time.Second))
22			if assert.NoError(t, err) {
23				t.Logf("Pod reached phase 'Running'")
24			} else {
25				t.Logf("Pod didn't reach phase 'Running': %s", err.Error())
26			}
27
28			return ctx
29		}).Teardown(func(ctx context.Context, t *testing.T, config *envconf.Config) context.Context {
30		assert.NoError(t, deleteNSForTest(ctx, config, t))
31		return ctx
32	}).Feature()
33
34	testEnvironment.Test(t, f)
35}
36
37func buildPullACRImagePod(namespaceName string) *corev1.Pod {
38	acrName := buildAzureResourceNameWithoutHyphens("eun", "containerregistry")
39	return &corev1.Pod{
40		ObjectMeta: metav1.ObjectMeta{
41			Name:      podName,
42			Namespace: namespaceName,
43		},
44		Spec: corev1.PodSpec{
45			Containers: []corev1.Container{
46				{
47					Name:  "pull-from-acr-container",
48					Image: fmt.Sprintf("%s.azurecr.io/conformance-testing:latest", acrName),
49				},
50			},
51		},
52	}
53}

Putting it all together

Having gone through all these test types, we are ready to put together a solution that will make our infrastructure as resilient as possible. After making changes to the code base, the CI pipeline should run all the test types sequentially. Starting with the linting and static code analysis through the E2E and conformance tests. When we are ready to deploy, the CD pipeline should run the non-disruptive conformance tests to ensure that the environment is functioning properly after the update.

Closing words

Testing cloud infrastructure comes with many hardships, but it definitely pays off. Following the steps presented in this article should get you started. There are many design details specific to each infrastructure that need to be treated separately, but they all fall into one of the main categories you need to remember.

For end-to end-tests, check: