How to create a reference architecture with Kubernetes on Azure an extensive guide
Bartosz Janda
Solution Architect
Published: May 15, 2024|15 min read15 minutes read
According to a recent study, 91% agreed that Kubernetes had brought significant advantages to their company. These benefits cover various of aspects, including improvements in operations and business, increased productivity for developers, and better efficiency in IT. The Kubernetes tooling ecosystem constantly evolves.
It presents new ideas and solutions to enhance existing development and deployment models, using abstraction layers. This changes the way companies manage their applications. However, alongside this progress, organisations face a growing level of difficulty. Kubernetes itself is a complex system. Managing it on Azure adds another layer of difficulty to it. The various components, services, and configurations involved are sometimes intimidating.
Especially users who are new to Kubernetes and Azure face issues. In addition, Azure Kubernetes Service (AKS) is still evolving. Microsoft constantly introduces new features and integrations to the ecosystem. This means, a solid architecture design serves as the fundamental building block for creating reliable and scalable systems.
VirtusLab is a company well-versed in cloud-native technologies and with a wealth of experience serving many clients. The company has gained a deep understanding of the common difficulties related to infrastructure. Based on this expertise, we want to share our expertise and show you how to pinpoint the most suitable technological options and scalable design strategies tailored to your projects.
This guide provides the knowledge and practical advice to effectively design, implement, and scale Kubernetes-based infrastructure on the Azure cloud platform. It serves as an invaluable resource for engineering teams, enabling them to focus on driving business value rather than becoming entangled in the intricacies of solution selection.
In today’s landscape, it’s crucial to create a smooth development and deployment process, guarantee security, and implement efficient monitoring practices. This necessitates the mastery of various tools and solutions. Before you can effectively design, implement, and operate Kubernetes clusters on the Azure cloud platform, you will need to understand the technology, design principles, and core concepts used in reference architectures. But no need to worry.
We will go through every detail so that you can create your own Kubernetes infrastructure on Azure. But let’s start with an environment that provides scalability, accessibility, and supports change. If you want to establish and manage a Kubernetes environment ready for production, you will face challenges, especially with scalability in mind. Let’s have a look at a reference architecture that is designed explicitly for Kubernetes on Azure.
Let’s break it down to grasp the reference architecture clearly.
1
The control plane manages, configures and operates the environment.
2
Environments should function independently and self-sufficiently. They need to offer tools and solutions for development, deployment, management, and monitoring.
3
Cloud resources ensure security by preventing unauthorised access since secure connectivity is essential for cloud resources.
4
Third-party solutions, like Thanos, Splunk, and more, can handle monitoring and logging.
Every organization is different and adheres to unique business rules and processes. This translates into divergent needs and technologies used. If your company wants to maintain its market share and leadership position, it must quickly adjust to match specific environments.
You can effectively respond to unique conditions once you employ a flexible baseline for cloud infrastructure and adhere to strong design principles. This enables you to make rapid adjustments while ensuring a strong and stable foundation for your operations.
So, what design principles are we talking about?
Based on our cooperation with clients, we’ll give you the opportunity to establish an infrastructure that adapts to your specific needs.
Principle
Benefit
Embrace open standards in technology
By relying on open standards, you ensure compatibility and interoperability with other technologies. You steer clear of an opinionated approach that could hinder future modifications and enables existing solutions to remain flexible and adaptable to evolving needs.
Implement end-to-end testing strategy
An end-to-end testing strategy is a critical component of any infrastructure development process. To uphold the highest quality standards for your products, you should have devised internal solutions for testing infrastructure, including provisioning, drift detection, backward incompatibility, and more. By utilising of these solutions, you guarantee that the infrastructure remains up-to-date, secure, and reliable.
Adopt a modern GitOps approach
By utilising a modern GitOps approach, you achieve standardised and automated deployment strategies, as well as robust disaster recovery capabilities for applications. This empowers your business to maintain resilience and agility, even in the face of unexpected events.
Leverage modular architecture
From day one, you should prioritise flexible and modular architecture. This allows you to tailor solutions to meet your specific requirements.
Promote an internal open contribution model
Fostering community-driven solutions and nurturing a developer-friendly culture within organisations enables innovation. Avoid becoming a bottleneck for infrastructure setup, and introduce new changes to ensure that knowledge is shared and distributed effectively
Scale operations through automation and runbooks
While people scale linearly, automation holds the key to exponential scalability regarding infrastructure lifecycle management and deployment processes. By leveraging automation and employing runbooks, you can operationalise critical knowledge and streamline processes, leading to more efficient and scalable operations.
Once you have a structure for your design process in place, you can move on to the next stage.
You need core concepts used in your reference architecture to efficiently design a scalable Kubernetes-based architecture on Azure. We divided all concepts by necessity and order of implementation.
The foundation
Before creating or implementing anything else, you should lay down a solid foundation. Otherwise, you might increase the difficulties of repairing or changing the infrastructure. According to the design principles, you need to start with a modular infrastructure.
Modular infrastructure automation with Terraform
Design for high-availability
Scalable networking
Kubernetes networking considerations
Must have best practices
Make sure you establish a modern, secure, and automated infrastructure.
GitOps deployment in Kubernetes
Secure access to your application
Consistent authentication and access management
Workload Identity
Automated secrets management
Daily support and maintenance
Create an easy-to-maintain environment with some best practices in mind.
Security and compliance by design
Release management
Monitoring and logging
Infrastructure end-to-end testing
Large-scale infrastructure
Once you have created a secure and automated environment, consider scaling it to 20+ instances. This is mainly needed at the enterprise level to ensure operational excellence.
Managing large-scale infrastructure with a control plane
The foundation of any architecture lies in supplying infrastructure assets. Rather than manually setting up and keeping track of resources, we advise adopting automation and defining all assets as code. Consider utilizing a collection of Terraform modules that enable the configuration and management of cloud components.
These modules handle essential elements, including networking, DNS zones, Key Vault, Container Registry, and AKS clusters. Through the separation of modules, the management of infrastructure becomes more flexible. Each module can be modified independently, tested separately, versioned, and effortlessly reused in various scenarios.
Assigning a version to each module and offering thorough documentation encompassing usage explanations, examples, and migration guidelines for implementing significant changes is crucial. To facilitate seamless collaboration and accessibility, we suggest storing these modules in separate git repositories.
This guarantees effortless access, development, and sharing of the modules. Furthermore, housing the modules in distinct git repositories empowers developers to clone the repository, grasp its internal mechanisms, make adjustments, and contribute to the module. This fosters collaboration stimulates enhancements and facilitates knowledge exchange within the development community.
To enhance lifecycle management, instead of using Bash wrappers, you could utilize Terragrunt. It efficiently handles dependencies between modules and deploys them sequentially. It also enables you to separate components into autonomous units, while still being able to deploy the entire environment with a single command.
Securing Terraform Environments and Secrets Management
Maintaining a well-structured code is crucial for distinguishing between environments and reusable Terraform modules. Here are some general guidelines to follow.
Environments consist of resources created across multiple regions.
The environment.hcl file contains configuration values specific to each environment, which are reused across regions.
For each region, there is a terragrunt.hcl file that holds region-specific configuration and a collection of Terragrunt modules.
Modules related to a particular component, like AKS, are grouped together and stored in the aks folder.
Each Terragrunt module points to a versioned Terraform module located in the TerraformModules directory.
Utilising semantic versioning for modules simplifies the migration of environments.
It allows testing a new Terraform module in a specific region without impacting other environments or regions.
While the source code of Terraform modules can be placed directly in the TerraformModules folder, a more effective approach is to reference modules stored in an external Terraform registry or a git repository.
In this case, the Terraform code in the TerraformModules folder acts as a wrapper, providing a standardized set of variables and outputs, such as tags, subscription ID, and region.
By adhering to these guidelines, developers can ensure a more organized and manageable code structure for their Terraform projects.
1├── Environments
2│ ├── environment_1
3│ │ ├── environment.hcl
4│ │ ├── region_1 (northeurope)
5│ │ │ ├── terragrunt.hcl
6│ │ │ ├── aks
7│ │ │ │ ├── aks
8│ │ │ │ │ └── terragrunt.hcl
9│ │ │ │ ├── application-gateway
10│ │ │ │ │ └── terragrunt.hcl
11│ │ │ │ ├── argocd
12│ │ │ │ │ └── terragrunt.hcl
13│ │ │ │ └── ...
14│ │ │ ├── container-registry
15│ │ │ │ └── terragrunt.hcl
16│ │ │ ├── key-vault
17│ │ │ │ └── terragrunt.hcl
18│ │ │ ├── private-links
19│ │ │ │ └── terragrunt.hcl
20│ │ │ ├── subnet
21│ │ │ │ └── terragrunt.hcl
22│ │ │ └── ...
23│ │ ├── region_2 (westeurope)
24│ │ │ └── ...
25│ │ └── region_3 (swedencentral)
26│ │ └── ...
27│ └── environment_2
28│ └── ...
29├── TerraformModules
30│ ├── README.md
31│ ├── aks-1.0.0-0
32│ │ ├── main.tf
33│ │ ├── outputs.tf
34│ │ └── variables.tf
35│ ├── application-gateway-1.1.0-0
36│ │ └── ...
37│ ├── argocd-1.0.0-0
38│ │ └── ...
39│ └── ...
If you want to safeguard Terraform state files, you must maintain their security. Storing these files in a secure location with controlled and limited access is crucial. It is vital to hold off from keeping any secrets, tokens, or passwords within the state files. Instead, consider utilizing the Key Vault, a secure storage solution for safeguarding various secrets.
By configuring your application to retrieve secrets from the Key Vault directly, you enhance security and maintain a clear separation between sensitive information and infrastructure. When generating secrets, it is best to leverage Terragrunt hooks instead of relying solely on Terraform resources.
Terragrunt hooks are shell scripts that execute custom code, providing the flexibility to generate new SSH or TLS key pairs, create random passwords, or retrieve secrets from alternative sources and upload them securely to the Key Vault. For example, Terragrunt executes a set of configured hooks each time it runs. A hook can check whether a secret exists in the Azure Key Vault. If it fails to find one, or if the secret is about to expire, it creates a new secret and uploads it back to the Azure Key Vault.
As soon as you adhere to these best practices, you reinforce the security posture of your infrastructure. You centralize and manage secrets in a secure and controlled manner. This approach ensures that sensitive information remains protected and minimizes the risk of exposure through Terraform state files.
Design for high-availability
Enable an uninterrupted application availability since it is paramount for most businesses. A well-designed infrastructure should handle expected and unforeseen events, including regional failures, application errors, and upgrade complications. We suggest deploying Kubernetes instances across two distinct regions and multiple availability zones to improve high availability.
This configuration provides improved resilience, ensuring continuity in a failure in one region. You can designate one region as a failover to manage traffic if the primary region becomes unavailable. Alternatively, both regions can operate simultaneously, effectively distributing and handling user traffic in an active-active manner.
You will achieve optimal performance and cost efficiency by scaling your infrastructure in response to fluctuating user traffic or sudden increases in load. Overprovisioning resources can result in unnecessary expenses, particularly during periods of low application demand. Azure load balancers and application gateways provide automated scaling capabilities, but it is equally crucial for your application to scale dynamically to match the current demand.
Kubernetes offers built-in scaling capabilities that cater to various scaling scenarios. It allows scaling both applications and underlying nodes, addressing common scaling requirements efficiently
The Kubernetes cluster autoscaler
Adjusts the number of nodes based on their usage and available pod resources.
Horizontal Pod Autoscaling (HPA)
Enables the scaling of pods within Kubernetes deployments. In its basic form, HPA scales the number of pods based on CPU or memory usage. HPA can be integrated with external providers like Prometheus Adapter for more advanced setups, allowing you to scale pods based on custom metrics such as HTTP request rate.
Kubernetes-based event-driven autoscaling (KEDA)
Provides advanced scaling options based on external factors. For example, you can scale based on the number of elements in a queue system or the results of an SQL query. KEDA even includes a cron-like scaler to scale down development environments after working hours.
By leveraging these scaling solutions, you can effectively manage resource allocation, optimise performance, and ensure cost-effective scalability for your applications in Kubernetes.
Scalable networking
Create a robust and scalable network infrastructure to maintain seamless operations. Modifying network configurations can be difficult once live systems are in place, such as expanding the network size to increase the number of IP addresses available. The modification of these resources often necessitates the deletion of the entire subnet and its associated resources, leading to cumbersome network migrations.
A network architecture that meets business requirements is crucial for forward-thinking organizations. It allows flexibility for future changes, such as preventing IP address exhaustion, and helps to omit said challenges.
There are several key considerations for designing a well-crafted network architecture.
1. Network layout
Each environment should have at least two subnets.
The public subnet serves as the entry point for applications accessible over the internet or by other applications within the network. This layer leverages managed services like application gateways, public and internal load balancers.
On the other hand, private subnets are dedicated to hosting resources such as AKS workloads, virtual machines, and managed services. Access to the private subnet should be restricted, with only load balancers from the public subnet having access to the workloads.
2. Private connectivity
Applications should minimize reliance on the public internet to connect Azure resources such as databases, storage accounts, or container registries. Instead, private links can be established to keep network traffic within the Azure virtual network.
By leveraging private links, applications can securely access Azure-provided resources, including Azure Kubernetes Service, Key Vault, and Container Registry. These services should ideally reside in a dedicated private subnet with no direct internet access, ensuring a more robust security posture.
3. Network segmentation
For low-tier applications where tight network isolation is redundant, splitting the virtual network (VNet) into multiple environments is possible. Each environment can have its own subnets, separated by Network Security Groups (NSG). This approach works well for smaller projects or development environments where size and scale are optional.
4. Hub and spoke topology
Adopting a hub and spoke topology in larger networks is worth considering. The network hub serves as a central point of connectivity, linking your on-premises network to various virtual networks (spokes) within Azure. Utilise spokes to isolate and manage workloads separately, even across different subscriptions, representing different environments such as production and non-production.
The network hub typically includes a firewall (Azure firewall service or a third-party solution), an Azure VPN Gateway for secure connections to on-premises networks via VPN, and an Azure ExpressRoute Gateway for establishing private direct connections to offices.
By carefully designing and implementing these networking principles, you can ensure a resilient and secure network infrastructure that supports scalability, isolation, and efficient connectivity between your resources in Azure.
Kubernetes networking considerations
Let's dive into the critical realm of Kubernetes networking, where a well-designed network setup is vital for the smooth operation of pods within your cluster. Each pod demands a unique IP address, and Kubernetes equips you with the Container Network Interface (CNI) to configure various network solutions.
Kubenet vs. Azure CNI
In the vast landscape of network solutions, two stand out as commonly used and fully supported options in AKS: Kubenet and Azure CNI. Both bring their unique strengths and considerations to the table, and selecting the right one depends on several key factors outlined below.
Kubenet: Navigating the Overlay
Kubenet takes the approach of creating a virtual overlay network within Kubernetes. Here, pods are assigned IP addresses from this network, while Kubernetes nodes draw their IP addresses from the Azure virtual network subnet. A Network Address Translation (NAT) solution steps in to facilitate communication beyond the cluster's borders.
To manage pod communication effectively, you'll rely on a route table and User Defined Routing (UDRs). While Kubenet can reduce the number of required IP addresses in Azure subnets, it introduces additional complexity in network configuration and management. It's essential to bear in mind that Kubenet clusters are ideally suited for up to 400 nodes, with no support for Windows nodes.
Azure CNI: Direct to the Source
Azure CNI takes a more direct approach, assigning an IP address directly from the Azure virtual network subnet to each pod. Nodes reserve IP addresses from the subnet to allocate them to pods. With Azure CNI, pods effortlessly communicate with other resources and fellow pods, requiring minimal additional configuration.
This approach does call for careful network planning, often necessitating larger subnets specifically designated for AKS clusters. However, the setup and management process generally prove simpler, and the beauty of Azure CNI is its compatibility with Windows nodes.
A Detailed Comparison
For an in-depth comparison of these two options, consult the table below, which offers a comprehensive breakdown of their features and nuances:
Kubenet
Azure CNI
Pod-to-pod communication
no (an additional hop is required)
yes
Pods per node (default/maximum)
110/250
30/250
Maximum number of nodes
400
n/a
Linux nodes
yes
yes
Windows nodes
no
yes
Network policy support
yes (Calico only)
yes (Calico and Azure Network Policy)
Multiple clusters in the same subnet
yes
no
Additional requirements
Routes and UDRs (User Defined Routing)
no
In navigating the Kubernetes networking landscape, making an informed choice between Kubenet and Azure CNI is crucial. Each option brings its advantages to the forefront, and by considering your unique requirements and priorities, you'll chart a course that ensures your network setup aligns seamlessly with your cluster's needs.
Once the foundation is built, we can focus on improvement.
GitOps deployment in Kubernetes
Before we dive into the heart of the matter, let's take a moment to explore some essential technologies at our disposal.
Continuous Integration and Continuous Delivery (CI/CD) have evolved into the industry's gold standard for managing applications in today's software development landscape. Tools like Jenkins, Azure DevOps, and GitHub Actions excel at handling both the building and deployment of applications. However, a game-changing strategy involves separating these responsibilities and embracing a dedicated delivery tool. This approach empowers your teams to deploy smaller, incremental changes swiftly across various environments.
Enter GitOps, a revolutionary approach to deploying cloud-native applications in this context. GitOps leverages familiar developer tools, placing Git at the core of its operations. It stores your entire release history within a repository, and deploying a new application version becomes as simple as making a change in that repository. This approach grants you unparalleled visibility through the commit history, making reviews effortless and offering the safety net of reverting changes if needed.
Now, let's explore some of the outstanding GitOps solutions available in the market. Flux and ArgoCD stand out as the most popular choices, both proudly part of the Cloud Native Computing Foundation (CNCF) Graduated projects. They have earned their stripes, boasting stability and suitability for production use, all while thriving with vibrant and active communities backing their development efforts.
Among these stellar options, we wholeheartedly recommend ArgoCD, primarily for its extensive feature set. Here's a glimpse of what makes ArgoCD shine:
Jsonnet Support: In addition to Helm, ArgoCD offers robust support for Jsonnet templates. This versatility proves invaluable when deploying monitoring tools like Prometheus, Grafana, and Alertmanager, all of which rely on Jsonnet. This flexibility empowers you with greater customization and control over managing these essential components.
Graphical User Interface (GUI): ArgoCD goes above and beyond with a powerful GUI that presents a visual representation of your deployed applications. This intuitive interface equips your teams to effortlessly monitor application statuses, investigate issues, and review logs. It's a game-changer that simplifies operations and elevates the user experience.
Plugin System: ArgoCD is purpose-built for seamless integration with third-party plugins like SOPS and Image Updater. This robust plugin ecosystem extends ArgoCD's capabilities, making customization a breeze and enabling the incorporation of additional features when needed.
In the GitOps pattern, pull-based deployments are the name of the game. An operator, such as ArgoCD or Flux, deployed within your environment, diligently scans for changes in the Environment repository. This repository houses decoratively described application configurations. Any modification triggers the deployment process. The declarative configuration arms Kubernetes with all the necessary information to perform precise updates to your applications.
But that's not all; you can also monitor changes in an image repository. When a new image version meets specific criteria, an operator can synchronize updates between your application and the Environment repository simultaneously.
In this ecosystem, it's essential to recognize the value of deploying a dedicated ArgoCD operator for each Kubernetes cluster, independent of others and ArgoCD instances. No external service needs access to your cluster credentials. ArgoCD relies on Kubernetes service account credentials to secure the permissions necessary, effectively preventing a "god mode" design scenario where a single instance gains access to all clusters.
Lastly, the “App of Apps” pattern deserves a special mention. This approach enables the deployment of multiple applications within a single cluster, simplifying management significantly. A single ArgoCD Application object takes charge of deploying multiple ArgoCD Applications, streamlining the process for you.
To leverage this pattern, you simply register an “applications” object, allowing you to deploy multiple applications with ease. Adding a new application to your environment is a breeze – just introduce a new file to the environment repository. The Argo CD operator will automatically enroll and deploy your application to Kubernetes.
Beyond application deployment, this pattern proves invaluable for setting up core cluster components, such as an ingress controller, certificate manager, monitoring solutions, secrets managers, and policy agents, to name a few. It’s a comprehensive approach that ensures a robust foundation for your applications.
So, as you embark on your journey through modern software development, keep these powerful tools and strategies in mind. They’ll be your allies in building, deploying, and managing applications with efficiency, security, and flexibility.
Your application's security is a top priority, especially when it comes to safeguarding sensitive and proprietary data from any unauthorized access or tampering. In the Kubernetes landscape, the go-to method for exposing your services is through an Ingress controller. Now, let's dive into the two most popular choices available to you on the Azure platform:
Application Gateway Ingress Controller (AGIC): AGIC is a standout option that is custom-tailored for Azure. It seamlessly manages your Azure Application Gateway and integrates beautifully with Kubernetes. What's more, it opens up doors to using powerful tools like cert-manager to handle TLS certificates, sourced either from Azure Key Vault or dynamically from providers like Let's Encrypt. Beyond that, opting for Application Gateway brings a host of additional advantages, including configuring a Web Application Firewall (WAF), enabling autoscaling, and ensuring zone redundancy. It's a comprehensive choice for those looking for top-notch security and scalability.
NGINX Ingress Controller: NGINX, a trusted name in the field, brings its Ingress Controller into play. This controller sets up an Azure Load Balancer and deploys a set of NGINX pods within your Kubernetes cluster, efficiently handling incoming traffic. This option shines in scenarios with lower traffic volumes, where advanced traffic policies might not be crucial, and it typically comes at a more budget-friendly cost compared to AGIC. The Load Balancer it creates can be configured with either public or private IP addresses, with the latter being particularly handy for facilitating secure communication between services within private networks. Like AGIC, NGINX Ingress Controller also supports TLS termination, utilizing dynamic certificates managed by cert-manager. Additionally, it allows you to leverage TLS certificates from Azure Key Vault, seamlessly integrating with solutions like the Secrets Store CSI driver.
Both AGIC and NGINX Ingress Controller offer robust capabilities for secure ingress management, but the choice ultimately hinges on your specific application requirements and the features and functionalities that matter most to you.
Consistent authentication and access management
Securing access to Azure resources and Kubernetes clusters is paramount for your organization's safety. It ensures that only the right individuals can access what they need while keeping unauthorized parties at bay. Here's what you need to know:
1 Azure Active Directory (Azure AD) Integration: To establish robust access control, it's crucial to rely solely on Azure Active Directory identities. This forms the foundation of your security. Moreover, consider the benefits of integrating with external identity and access management solutions like OneLogin and Okta. These integrations not only bolster security but also streamline the authentication process.
2 User-Managed Identities: Whenever possible, opt for user-managed identities. These identities grant access to Azure APIs and resources, including virtual machines, without the hassle of manually managing or rotating secrets. Azure takes care of these tasks automatically, making your life easier and your environment more secure. This approach also eliminates the need for sharing sensitive credentials, such as service principal's secrets or SSH keys, among team members.
3 Individualized Access: Empower each user to perform tasks using their unique identity. Reduce reliance on tokens, private keys, client secrets, plain usernames, and passwords to the absolute minimum, embracing a granular access model. If you require automated secret rotation, consider custom integrations to handle this.
4 Securing Static Secrets: Safeguard all static secrets in dedicated key vaults equipped with fine-grained access control. For custom integrations, ensure that automatic secret rotation is part of the implementation.
5 Azure Role-Based Access Control (RBAC): Manage resource access through the Azure Role-Based Access Control (RBAC) model. Limit access to authorized Azure Active Directory groups for resource groups or services. Tailor access levels to different user groups, offering read-only access for auditing application configurations, developer access for deploying and managing applications, and admin access for expanding the environment with additional Azure services.
6 Azure Privileged Identity Management (PIM): To become a member of the admin group, consider leveraging Azure Privileged Identity Management (PIM). This tool grants users temporary access to the admin group and all its associated permissions. Importantly, it requires approval from other team members or managers before elevated permissions are granted, adding an extra layer of security and accountability.
By implementing these practices, you'll not only fortify your access control but also ensure the safety of your Azure resources and Kubernetes clusters, aligning your organization with best-in-class security standards.
Workload Identity
Azure Kubernetes Service clusters bring a powerful feature to the table: ServiceAccount token volume projection. This feature represents a significant leap forward in token generation for pods. Unlike traditional Kubernetes Service Account tokens, these volume-projected tokens offer a range of benefits, including expiration dates, compatibility with OIDC (OpenID Connect), and unique tokens for each pod. This innovation sets the stage for identity federation, enabling seamless token exchange with Azure.
Now, let's explore Azure AD Workload Identity, a mechanism that establishes a trusted relationship between your application and an identity within Azure. This identity can take the form of an Azure AD application or a user-assigned managed identity. When deployed in Kubernetes, your applications can tap into the remarkable capabilities of identity federation. This means your application can leverage short-lived federated identity tokens, which can then be exchanged for an Azure access token. With this Azure access token in hand, your application gains secure access to a wide range of cloud resources.
Here's a simplified breakdown of the flow:
Kubernetes Token to Request AAD Token for the Application
The Kubernetes token is sent to request an AAD (Azure Active Directory) token for your application.
AAD Validation and Token Issuance
Azure AD validates the incoming token and issues a token tailored to your application.
Accessing Cloud Resources with the AAD Token
Your application utilizes the AAD token to securely access cloud resources.
By following this flow, you eliminate the need to handle and store client credentials or certificates to access Azure APIs from your application. The token, readily available from Kubernetes, is automatically generated and rotated.
Plus, the Azure SDK seamlessly supports this process, enabling developers to easily incorporate it into their application code without the need for custom modifications or automation controllers. It's a smart, secure, and efficient way to manage tokens and access cloud resources within your Azure Kubernetes Service environment.
Automated secrets management
Securing your sensitive data is non-negotiable. It's paramount to store all your secrets in a highly secure location like Azure Key Vault. Secrets must stay far away from your source code and any inappropriate places. Plain text storage, which anyone can access, is simply not an option. To achieve this, implementing granular access control is your go-to move. This ensures that users and applications only access what they absolutely need.
Access Control Done Right: RBAC and PIM
When configuring permissions, opt for the Azure Role-Based Access Control (RBAC) model over access policies. RBAC offers a unified access model that works seamlessly across various resources. Plus, it employs the same API, making your life simpler. To add an extra layer of control, consider Azure Privileged Identity Management (PIM). PIM enables you to grant time-based access and manage privileged identities efficiently. It even allows you to deny access to specific principles when necessary. Furthermore, you can finely configure access to selected secrets and keys within your key vault.
Secret Rotation: Keeping Things Fresh
Regular secret rotation is a must. Your applications should seamlessly adapt to new secrets when they're regenerated, without needing developer intervention or application redeployment. Achieving this involves fetching secrets directly from sources like Azure KeyVault, AWS Secret Manager, or HashiCorp Vault. However, this approach can tightly couple your application with the secrets provider, potentially causing headaches down the road.
Kubernetes to the Rescue
Thankfully, the Kubernetes ecosystem offers solutions to tackle this challenge. The preferred method is having your application read secrets from files and monitor changes in the file system. Tools like the Azure Key Vault Provider for Secrets Store CSI Driver and Vault Agent Injector can automatically update secret values in the file whenever changes occur.
It’s worth noting, you cannot change environment variables once the application has started. If an application needs to have secrets provided as environment variables or the application is unable to read new secret values from a file, then you need to restart the Kubernetes container.
Tools like Reloader automatically watch for changes in Kubernetes ConfigMaps and Secrets objects. When an update occurs, it performs a rolling update of Kubernetes Deployments.
By leveraging CI/CD solutions like GitHub Actions or Azure DevOps, organizations can streamline the process of secret rotation, ensuring enhanced security and reducing manual effort.
Exploring Secure Alternatives for Your Sensitive Data
Tools like Mozilla SOPS or Bitnami Sealed Secrets. Unlike storing sensitive data in the cloud, this approach keeps everything neatly alongside your source code, but with an added layer of security through encryption managed by key management systems, such as Azure Key Vault.
Your Control, Your Security
With this setup, only developers holding access to the encryption key—safely stored in Azure Key Vault—can interact with these secrets. This tight control ensures that your sensitive information remains in the right hands. Moreover, this coupling of secrets with your code enables lightning-fast application updates.
Transparency and Accountability
But that's not all. Leveraging this method, your Git commit history becomes a treasure trove of information. It reveals who made changes to the secrets and precisely when those changes occurred. This transparency and accountability are invaluable for efficiently managing and tracking secrets throughout the entire development process.
By considering these secure alternatives, you're protecting your sensitive data and streamlining your development workflow, all while ensuring complete visibility and control over your secrets.
In the ever-evolving landscapes of Azure and Kubernetes, compliance is your compass, guiding you toward robust security and adherence to standards. Let's explore the wealth of solutions available to fortify your compliance efforts.
Azure's Defender for Cloud
First, enter Microsoft Defender for Cloud (formerly known as Azure Security Center), your central hub for safeguarding cloud-based applications. Its standout feature? The security score, an insightful tool that evaluates your resources for security vulnerabilities and condenses them into a single, actionable score. This functionality extends across workloads, storage, containers, and databases, offering a comprehensive view of your infrastructure's security. Armed with this visibility, you're better equipped to plan your next steps with confidence.
Kubernetes Security and Compliance Tools
Now, within the Kubernetes ecosystem, you'll discover an array of security and compliance tools. Two stalwarts stand out:
Gatekeeper: Leveraging the power of Open Policy Agent (OPA), Gatekeeper provides a universal solution for managing policies across your infrastructure and applications. It seamlessly integrates with Kubernetes, Envoy, Kafka, and custom applications, allowing you to evaluate policies comprehensively.
Kyverno: Kyverno takes a user-friendly approach, employing familiar YAML syntax for policy definition. This simplicity eases the learning curve and policy maintenance while delivering essential features like object validation, object mutation, test execution, auditing, and metric disclosure. The choice between Gatekeeper and Kyverno often comes down to personal preference, but both are indispensable for elevating security and enhancing the developer experience within the platform.
Securing Applications and Dependencies
Beyond infrastructure, securing applications is paramount. Today's applications rely on numerous dependencies, including external libraries and frameworks. While these speed up development, they also introduce the potential risk of malicious code or security flaws. Regularly scanning both your application and its dependencies for known vulnerabilities is essential.
Docker images, a critical component of modern applications, require similar scrutiny. These images contain operating system binaries, standard libraries, runtime environments, and more, all of which must undergo vulnerability and malicious software scans. Leading continuous integration and container registry solutions often offer built-in scanning capabilities, but adding an extra layer of protection through solutions like Trivy, Grype, or Clair can be a smart move.
Code and Infrastructure as Code (IaC) Analysis
Your security journey extends to code and IaC. Static code analysis tools play a vital role in ensuring that your infrastructure remains correctly configured and adheres to industry-standard best practices. Consider integrating tools like Checkov, Tfsec, Terrascan, and Tflint to bolster your confidence in your code and IaC.
Security Benchmarks and Regulatory Compliance
Lastly, don't overlook security benchmarks and regulatory compliance. Azure Policies supports cloud-based compliance validation, while kube-bench serves as a trusty tool for assessing the security status of your Kubernetes clusters. Standards like the CIS Microsoft Azure Foundations Benchmark, Azure Security Benchmark and CIS Kubernetes Benchmarks are your allies in ensuring adherence to critical compliance requirements.
Release management
Prepare for a seamless transition in release management by keeping a few key principles in mind. Let's explore these considerations together:
Syncing with Kubernetes Release Cycles
First and foremost, your infrastructure lifecycle should harmonize closely with the Kubernetes release cycle. This means carefully aligning the timing of your infrastructure updates with Kubernetes updates. Each Kubernetes release leads to new APIs, features, and crucial fixes, enhancing stability, compatibility, and security, among other aspects.
Leveraging Cloud Provider Kubernetes Versions
Cloud providers, such as Azure Kubernetes Service (AKS), offer Kubernetes versions tailored to their platforms. These versions are designed for smooth integration with the cloud environment. When you decide to update your clusters promptly upon a fresh Kubernetes release, you seize an opportunity to enhance other parts of your setup that may require adjustments to align with the new Kubernetes version.
Unlocking Automatic Cluster Upgrades with AKS
AKS streamlines cluster management with automatic cluster upgrade channels, an excellent option for teams independently managing their clusters. These channels free you from the burdens of manual cluster upgrades. Instead, you select one of the available channels, and your cluster upgrades itself automatically.
Consideration for External Components
However, there's an essential caveat to bear in mind. If your cluster hosts numerous components, each with specific Kubernetes version requirements, the automatic upgrade feature may not be suitable. External components like Argo CD, Flux CD, Cert-Manager and Prometheus may have precise Kubernetes version dependencies that demand attention.
Holistic Kubernetes Upgrades for Large Deployments
For extensive deployments, planning Kubernetes upgrades that encompass all components is advisable. To streamline this process, specialized tools and methods are at your disposal.
Automation with Terraform and Terragrunt
Tools like Terraform and Terragrunt, both part of the Infrastructure as Code landscape, can automate upgrades for both your clusters and the components within them.
GitOps Solutions for Management and Automation
GitOps solutions, such as Argo CD and Flux CD, further simplify and automate the upgrade process, enhancing efficiency and reliability.
Testing for Confidence
To ensure everything runs smoothly, you can employ Terratest to independently validate Terraform modules. Conformance testing ensures that the entire platform update proceeds as planned, delivering the desired results.
For more in-depth insights into testing, refer to this End-to-End Testing article in Terraform, where you'll find additional valuable information. With these considerations and tools in your arsenal, you'll navigate release management with confidence, ensuring that your Kubernetes ecosystem remains resilient and up-to-date.
Mandatory steps for software releases
When it comes to smooth software releases, there are some important guidelines we suggest you follow:
Version Control System (VCS):
Releases and their dependencies should be versioned in a VCS to track changes and maintain a history of releases.
Changelog:
Create an automated changelog that lists all changes, including new features and breaking changes. This provides users with transparency about what's happening with each release.
Release Information and Documentation:
Maintain a dedicated place (e.g., a release information page or documentation) for users to find information about releases, documentation, migration steps, and upcoming deployment timelines. This helps users stay informed about changes and how they may affect their systems.
Announcements:
Send announcements to all relevant teams when a release is ready. Communication is key to ensuring everyone is aware of upcoming changes.
Cluster Upgrades:
Define a specific time for upgrading clusters. If an application team has multiple clusters in different regions, upgrade them separately.
Perform automated validation to ensure that each component of the cluster is functioning correctly.
If a cluster exposes an application through a load balancer, remove it from the load balancer pool during the upgrade and add it back once validation is successful.
Handling Breaking Changes:
Create migration scripts for each breaking change to be executed in a specific order during updates. This ensures a smooth transition for users.
Gathering Feedback:
Use each release as an opportunity to gather feedback, identify problems, and collect potential improvements. Document any issues reported during this process.
Infrastructure Update Pace:
Recognize that different parts of the infrastructure may take longer to update than others. Determine how and when to address any problems that arise during updates.
Runbooks:
Prepare runbooks for various actions, whether automated or manual. Runbooks should provide step-by-step instructions to achieve specific goals.
Application teams may have their own runbooks for tasks related to their applications, such as validation.
Bug Fixes:
In between minor releases, consider the need for bug fixes. Follow the same release process but avoid introducing disruptive changes, new features, or requirements for application teams to modify their code.
These guidelines promote a structured and collaborative approach to software release management, ensuring that updates are well-documented, communicated effectively, and carried out smoothly.
Implementing GitOps Workflow with ArgoCD for Kubernetes Deployment
In the management of Kubernetes deployments, it is essential to establish a clear workflow and segregate responsibilities effectively. This includes deploying separate ArgoCD instances for the platform team and the application team, each with distinct purposes.
1 Segregated ArgoCD Instances:
The platform team operates its dedicated ArgoCD instance, which is responsible for deploying platform components using the "app of apps" pattern.
A separate ArgoCD instance is allocated for the application team's use, enabling them to deploy applications and additional Kubernetes components, such as custom log and metrics forwarders.
2 GitOps Workflow for Continuous Deployment:
The GitOps workflow is adopted, facilitating continuous deployment through a Git repository.
Any changes made trigger automatic deployments to the relevant ArgoCD instance.
All configuration variables are maintained within Git repositories, offering a robust foundation for Disaster Recovery (DR) implementation.
3 Disaster Recovery (DR) Implementation:
In the event of a disaster, ArgoCD can swiftly redeploy the entire environment, encompassing both platform components and the team's applications.
GitOps automation streamlines this recovery process, ensuring rapid restoration that can be completed within minutes.
4 Managing Complexity with Automation:
While the solution offers a wealth of features and options, it also introduces complexity.
Automation addresses most scenarios efficiently, but some situations may necessitate manual intervention, particularly when integrating with external systems lacking automation capabilities.
This approach to Kubernetes deployment management empowers teams to maintain clear responsibilities, automate deployment processes, and leverage GitOps for enhanced efficiency and Disaster Recovery readiness.
Monitoring and logging
Today's software systems have evolved into intricate, distributed, and dynamic entities, making the task of monitoring and managing them an impressive challenge. While cloud-based solutions offer a certain degree of visibility into infrastructure and applications, the presentation and granularity of data can vary compared to dedicated monitoring solutions. As complexity grows, so do costs. To effectively monitor both core components and applications, integrating external monitoring and logging services at the foundational platform level becomes imperative.
The Power of Separation
To maximize effectiveness, it's wise to separate infrastructure monitoring from application monitoring. The platform team, entrusted with overseeing the infrastructure layer, should focus their visibility on core Kubernetes components and infrastructure-related services. This targeted approach ensures that they can maintain the stability and reliability of the underlying infrastructure without being overwhelmed by application-specific details.
Empowering Application Teams
Conversely, application teams responsible for their own applications should harness their unique monitoring solutions. This autonomy allows them to exercise full control over monitoring and logging, tailored precisely to their applications' distinctive requirements. By doing so, they can proactively address issues, fine-tune performance, and gain invaluable insights to drive their application's success.
With this strategic separation of responsibilities, you'll effectively navigate the complexities of modern software systems, optimizing both infrastructure and application monitoring for enhanced efficiency and effectiveness.
Enhance Visibility and Efficiency with Azure Monitoring Services
When it comes to optimizing your Azure monitoring strategy, tailoring it to your specific needs is key. Let's embark on this journey together:
Broadening Your Horizons with Azure Monitoring Services
Begin by integrating various Azure monitoring services to gain better visibility into your underlying resources. Azure Status offers a high-level overview of your cloud's health across different regions. Taking it a step further, Azure Service Health and Azure Resource Health provide more detailed insights.
Azure Service Health: This service offers a personalized perspective, focusing on the services and regions relevant to your organization. It provides valuable insights into affected services, outages, and planned maintenance activities.
Azure Resource Health: Dive deeper by gaining status updates specific to individual resources, such as virtual machines.
Azure Monitor's Suite of Tools
Azure Monitor boasts a suite of tools for monitoring, logging, and tracing. While the option to send AKS control plane logs or Application Gateway diagnostic logs to Azure Monitor's Log Analytics workspaces exists, it's worth noting that this comes at a significant cost, despite the robust log analysis interface it offers.
The Cost-Effective Analytics Approach
For those seeking a more cost-effective solution with analytics capabilities, integrating with another solution is the favored path. Azure Storage account serves as a basic storage solution, but it lacks advanced analytics capabilities. Therefore, considering alternative options is prudent, emphasizing both cost-effectiveness and analytics provision.
Container Insights or Affordable Alternatives
Azure presents Container Insights as an add-on that gathers logs and metrics from pods, transmitting them to analytics workspaces. However, similar capabilities can often be achieved at a more budget-friendly price point. This can be accomplished by leveraging a combination of tools like Prometheus, FluentBit, ElasticSearch, or exploring off-the-shelf third-party solutions such as Splunk, NewRelic, or Datadog.
Empowering Infrastructure and Platform Teams
For the infrastructure team, standardizing platform monitoring tools like Prometheus with Grafana dashboards is a smart choice. Many existing Kubernetes components, such as Cert-manager, ArgoCD, and Strimzi, support Prometheus metrics, making it an ideal candidate for monitoring Kubernetes and other platform components.
Moreover, the platform team can opt to centralize logs and metrics from various Kubernetes clusters, environments, or regions. The Thanos project offers a robust solution for moving metrics to a central location, facilitating easy visualization and data management.
Empowering Application Teams
For application teams, the focus shifts to leveraging monitoring solutions that seamlessly integrate with applications, offering features like data visualization, distributed tracing, application performance metrics, error statistics, and more. These integrations may require minor adjustments to both applications and the platform.
By separating monitoring stacks and empowering different teams to deploy solutions tailored to their needs, you enable greater flexibility and independence, allowing each team to make the choices that best suit their requirements.
By adopting this approach, you'll optimize your Azure monitoring strategy to ensure that it aligns perfectly with your unique organizational needs and goals.
Infrastructure end-to-end testing
You should test production-ready software before it goes into production. Implement different levels of testing, like unit testing, integration testing, end-to-end testing, etc., to validate application logic and behaviour at different levels. You should apply the same approach to the infrastructure. It is very important to validate the configuration and working order of all components.
You can use two main types of infrastructure testing:
End-to-end testing verifies the quality of the built infrastructure, typically held once the entire infrastructure has been assembled. Frequently, it necessitates the creation and subsequent deletion of the complete environment post-testing.
Conformance testing centres on assessing the accuracy of the established infrastructure. It verifies the connectivity and the well-being of platform elements. This type of testing sets up the necessary resources for its purposes but refrains from altering existing applications or services. Importantly, it should be feasible to conduct this testing on the live infrastructure without causing any disturbances.
The Terratest framework provides comprehensive support for Infrastructure as Code (IaC) testing, offering versatility beyond its primary functions of Terraform and Terragrunt module testing. This tool's utility extends to validating Docker images, Packer artifacts, Kubernetes objects, and cloud services across AWS, Azure, and GCP. Its optimal use case involves serving as an end-to-end solution, encompassing the entire infrastructure lifecycle, including creation, testing, and teardown.
This testing approach is particularly valuable when scrutinising individual components, as it enables the setup of a dedicated infrastructure solely for a specific module. Furthermore, it facilitates the evaluation of various behaviours and configurations.
To ensure rigorous testing integration into the CI/CD pipeline, it is advisable to exercise caution due to the time-intensive nature of running the entire suite of tests. Ideally, these tests should be executed either on-demand (triggered by a Git pull request) or scheduled periodically.
Comprehensive Compliance Validation
For compliance testing purposes, Sonobuoy emerges as a prominent solution, tailored to VMware compliance assessments within Kubernetes environments. Sonobuoy empowers users to craft and execute custom test cases within Kubernetes clusters.
These tests can encompass diverse scenarios, like verifying the correct operation of applications exposed via an Ingress controller, DNS domain resolution correctness, TLS certificate issuance, and proper handling of HTTP requests. These tests are scripted in Go and deployed as Docker containers, offering considerable flexibility in the scope of assessments.
In contrast to end-to-end testing, conformance testing can be conducted at any time. These tests are well-suited for validating the status of the platform and can be applied to production environments. Typically, they should be run after each cluster upgrade, but they can also be scheduled at regular intervals.
Managing multiple environments can be challenging with each new instance. As the number of clusters and teams grows, you might spend more time on maintenance than on platform enhancements.
Infrastructure as Code (IaC) plays a pivotal role in ensuring reproducibility and ease of managing infrastructure. By representing infrastructure configurations in a straightforward and legible format, it becomes feasible to create identical environments effortlessly and implement updates seamlessly. Moreover, these configurations can be reused across various environments, effectively resolving numerous challenges and offering consistent, drift-free setups.
Managing large-scale infrastructure as code with a control plane
When you integrate IaC into pipelines, it amplifies the level of flexibility. Environments can be dynamically generated, thoroughly tested, employed for experimentation, and automatically decommissioned as needed. Pipelines also prove invaluable for maintenance tasks and automated infrastructure upgrades, streamlining the support process and significantly reducing administrative workload.
However, as the number of Kubernetes clusters grows, new complexities emerge that necessitate an additional layer of abstraction known as the Control Plane. This Control Plane, akin to Kubernetes' role in managing clusters and their components, assumes responsibility for overseeing entire environments, encompassing Kubernetes clusters, network configurations, cloud resources, and more.
The Control Plane functions as a superior abstraction layer, positioned above environments and platforms, serving as a centralized hub for managing multiple environments. It continuously monitors the health of Kubernetes clusters, validating the status of Kubernetes components and external integrations utilized by the platform. Additionally, it collects performance metrics from all instances and issues alerts in the event of anomalies. These alerts are automatically relayed to the platform team and application teams, ensuring swift awareness of any issues. All tasks pertaining to the platform are orchestrated by the Control Plane.
To effectively scale the number of managed environments, a proportional increase in the number of Control Planes is imperative.
Crucially, each environment should maintain independence from any Control Plane component to minimize the potential impact of Control Plane failures. In such scenarios, only a limited number of environments would temporarily lack monitoring and management capabilities.
Nonetheless, these environments would continue to function normally while the Control Plane undergoes maintenance. The sole delay would be in automated infrastructure changes, which would resume once the Control Plane is operational. This approach empowers application and development teams to carry on their work without disruptions.
The control plane paradigm enhances the visibility and management of underlying environments, providing platform teams and developers with a powerful tool for monitoring and managing instances.
In conclusion
Creating a strong and scalable architecture involves using various tools, understanding concepts, and having different skills. Once you've done it successfully, you will be able to scale up, make sure things are available when you need them, and keep everything safe. Both developers and businesses reap the benefits of an easily manageable and adaptable infrastructure foundation.
In this journey, the importance of harnessing Kubernetes on Azure for organizational empowerment becomes clear. This dynamic combination aids in the navigation of complexity and instills confidence in tackling intricate tasks effectively.
Delivering quality in design principles further ensures that the architecture is robust, reliable, and capable of withstanding the demands of modern applications. It lays the foundation for a modular infrastructure automation with Terraform, streamlining the deployment process and enhancing efficiency.
To meet the challenges of high availability and scalability, Kubernetes networking considerations play a crucial role. By adopting GitOps deployment and implementing secure access measures, organizations can ensure a consistent and well-managed authentication and access management system.
Security and compliance remain central to every stage of the architecture, with automated management of secrets contributing to a robust and protected environment. Integrating release management practices ensures seamless updates and deployments while adhering to security measures.
Monitoring, logging, and infrastructure end-to-end testing fall into the realm of continuous improvement, allowing organizations to gain valuable insights and optimize their architecture continually. Finally, managing large-scale infrastructure with a well-controlled plane ensures smooth operations as the organization grows.
To sum it all up, the journey from complexity to confidence is a transformative one, where organizations harness the power of cutting-edge technologies and best practices to create a resilient and future-proof architecture. By adopting these methodologies and concepts, businesses can remain agile and thrive in the ever-evolving landscape of technology.