We are in a globalized world where communication plays an important role. And almost everything is connected to the internet.
Telecommunication companies must serve 24/7 services or nonstop services. In these times, we observed how home offices were massively applied and how customer support services encouraged people to use more online services and more self-management.
Amazon AWS provided us with several tools useful to bring resilience to services. But, first, we need to understand the core definitions of Horizontal Scaling, Vertical Scaling, Availability Zones, Resources and Services.
Horizontal Scaling vs. Vertical Scaling
AWS provides tools for monitoring (policies) to see how much hardware resources your instance is using. This lets you control what to do when the usage of resources has increased or decreased over time. Always monitor the service’s state with measures like processor usage, memory usage, and space in disk available.
Region, Zones[1]:
“AWS Instances are hosted in multiple locations worldwide. These locations are composed of Regions, Availability Zones, Local Zones, and Wavelength Zones. Each Region is a separate geographic area.”
Availability zones are physical locations in a region representing a data center. So it is important to take this into account when deploying services in this platform if we want to have Reliability. The AWS instances should not be located in a single zone. It’s considered best practice to have instances in the region near the client who will consume this service or the end-users using the application.
To eliminate SPOFs (Single Point of Failure), we could use Active Redundancy or Standby Redundancy. The first one, “Active redundancy Pattern,” is when a load balancer sits in front of two running instances in different availability zones.
On the other hand, we have Standby Redundancy, where we have two instances of the same service, but only one is running. When the master instance goes offline, the system wakes up one instance in another available zone to maintain the system’s desired state.
Also, you could apply some pattern architecture to assure reliability in regions. Some of the usual patterns are: “Pattern Pilot light architecture,” “Active-Active Configuration” (you could use intelligence in the DNS cluster which executes a health check and assign the users a cookie to redirect to the correct o nearest region based on latency or the health check itself), “Active /standby Configuration”[5].
Resources and Services[3]:
According to the AWS terminology, we have different resources: Instances (One instance represents a virtual server), Security group (ACL / Control Access List), Volume (could be a Disk or Partition), Key pairs, Load balancer, Snapshot, Elastic IP. In simple words, any piece of infrastructure that is created to deploy a service.
A service could have multiple resources. It is highly recommended that to achieve loose coupling. We should implement microservices using queues, topics, and a distributed cache system. If you could not implement this scheme, AWS can support Hybrid networking legacy systems with a VPN so you could have “on-premise” infrastructure with some services in the cloud (maybe using bootstrapping script to set initial configuration of legacy system). To assure the correctness of the services between upgrades and in that way we’ll minimize the risk of errors, we should take into consideration some Deployment strategies: Canary Pattern deployment, blue/green, In-place Strategy, Rolling Strategy.
Canary deployment: is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody[3]. When a new feature is ready to go to production, this is useful to evaluate the user’s reaction to this new feature(s). In this way, only a small portion of users notice the changes.
Blue/green strategy: creates two independent infrastructure environments. The first one, “blue,” contains the older (current) version of the application or component, while the green environment contains the newest version. The traffic is then shifted to the newest environment (green) by redirecting the DNS record to green’s load balancer using Route 53 as long as the health check has been passed. This strategy has the advantage that the rollback could be done quicker than in other strategies. However, this approach has a drawback: the cost of running two infrastructure environments simultaneously.
In place strategy: This is a cheaper strategy but has service downtime because you’ll use the current infrastructure to deploy new versions of the application or services. In simple words, we’ll shut down the application running on it before installing the new code. Finally, once the new version of the application has been deployed onto every infrastructure resource, the deployment is complete. Another drawback we have is this strategy has a slower rollback.
Rolling strategy: this kind of deployment replaces all elements of the infrastructure. It is similar to blue/green. Resources are offline and, one by one, replaced with new resources running the latest version of code. The difference between blue/green is that the network infrastructure or environment is the same for the new and old codes. This has the same drawback as the in-place strategy: A limited number of resources can be worked on at a time to prevent downtime and could have a slower and risky rollback.
This was a summary of methods, patterns, strategies that you could use to assure High availability by enforcing operational excellence and reliability pillars. As a result, application outages can be minimized while offering an automated solution to deployments and providing the NOC[4] personnel with tools to monitor and maintain the services 24/7.
References
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2] regions reference AWS patterns
[3] Continuous Integration: Improving Software Quality and Reducing Risk, Paul M. Duvall, Steve Matyas, Andrew Glover https://books.google.com.ar/books?id=PV9qfEdv9L0C
[4] Network operation center.
[5] https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf