Skip to content

Latest commit

 

History

History
61 lines (50 loc) · 38.5 KB

File metadata and controls

61 lines (50 loc) · 38.5 KB

Design Principles

Well-Architected Pillars

Well-Architected
src: MS Official Documentation on Mission Critical Workloads (https://learn.microsoft.com/en-us/azure/well-architected/mission-critical/mission-critical-design-principles)

NOTE: The official documentation has the design considerations under each of the 5 pillars of the framework. The following tables will explain the considerations that we have adopted in implementing this solution

1. Design for Reliability

Design principle Considerations
Active/Active design The System has Azure as the secondary region to which the overflown traffic would be diverted to. This helps in addressing the availability requirements of the currrent system on-premise. All the components (public IP address, Virtual Machine Scale Sets and Azure SQL DB) have been made zone-redundant to improve the availability of the stamp within that region. This solution does not represent a proper active-active model but an active-Hot-Standby model. Azure region which serves as the secondary region that handles the excess application traffic, can also be configured to be the "Fail-over" target of the primary region i.e. on-premise
Blast radius reduction and fault isolation A) Cascading Failures will be avoided wherever possible. FMEA (Failure mode effect analysis ) and Chaos Engineering can help understand the ways in which the system could possibly fail. The Design has to be made in such a way that the system either gracefully degrades or handles the failures though other possible means. **Note**: FMEA and Chaos engineering have not been performed in this solution implementation. B)   Implementation of **Resiliency Cloud Design Patterns** at the  aplication level can also help in handling the reduction of the blast radius (This includes the retry and circuit breaker patterns). **Note**:   These patterns have not been implemented in the sample app used
Observe application health We have configured for the application specific logs to be collected in the regional application insights instances. The infrastructure logs are collected in the regional log analytics workspaces. Monitoring the health of the application is implemented by using the **Reliability Health Model** that suggests maintaining the statusses of all the services/components in a system and deciding the health of the system based on the aggregate health of the components. **Note**: The health model has not been implemented in this solution. The official documentation has a reference implementation for AKS based workload. For a worklod that is similar to the one used in this solution, refer to **https://github.com/gsriramit/AzWellArchitected-Reliability/tree/main/HealthModelSystem\*\*
Drive automation As a best practice, the deployment of the stamps, regional monitoring resources and the global resources are through ARM templates. At this point, we use a powershell script that can be used to run the deployment, however we are looking to use Github actions to create deployment pipelines. **Note**: Refer to the Official documentation on the best practices to include *Unit and Performance Tests* in the pipeline. This would help in understanding the performance changes in the app with each deployment.
Design for self-healing The Azure Services by default have been designed to self-heal. Azure manages the way the services operate at times of failure. E.g. 1) Azure public load balancer would by default be provisioned with multiple nodes handling the customer traffic. When one or more nodes (across data centers) degrade, the monitoring platform would be able to understand this issue with the platform and *spin up additional nodes to maintain the SLA attached to the SKU of the service chosen*.  2) Based on the concept of **Shared responsibility** for reliability, we are expected to design the services to be highly available. In our solution, VMSS should be configured with a minimum of 2 instances so that the platform would be spinning up new instances if the current set of instances fail
Design for scale-out Scale-Unit in this solution is represented by all components in a single stamp (in the azure region). The components in the scale unit are expected to scale together when there is an increase in the traffic handled. VMSS is configured for auto-scaling. Azure SQL DB business critical tier is chosen with an appropriate SKU that supports sufficient read and write scale. However scaling up Azure SQL DB has to be automated in response to a monitored performance bottleneck caused by Azure SQL DB. **Note**: If cosmos DB had been the backend, then as a part of the scaling of the scale unit, the *Request units (RUs)* would be increased. Also, if the solution had used Application Gateway instead of the Azure external LB, the same should be configured with auto-scaling enabled. This helps maintain the HA of the entire chain i.e. scale unit

2. Design for Performance Efficiency

Design principle Considerations
Design for scale-out This has been addressed as the last consideration in the "Design For Reliability" section
Automation for hyperscale A) Scale operations in this solution are based on the health of the application. Only one app-insights metric has been used to identify the threshold of the traffic that on-premise can handle. In reality, there could be multiple metrics or a combination of multiple factors that would determine the time to burst to cloud. B) The scale-out and scale-in of the virtual machine scale sets are automated based on the metric highlighted in the previous point and is completely automated. Also, change of the weights of the Traffic manager are also automated using a logic app that gets triggered based on a threshold determined by one or more of the app metrics/health.
Continuous validation and testing Not performed in this implementation
Reduce overhead with managed compute services We have used VMSS to host the app component to illustrate the case of a typical workload on-premise (that has been modernized). The following are the considerations in this context A) Azure SQL DB has been used for quicker implementation. This should have actually been an Availability Group on Azure VM that extends from the on-premise to the cloud. B) If the on-premise app had been modernized to use containers, i.e., at least the stateless components (ui , api and other services), then the logical option would have been using an AKS cluster that would be the burst destination of the app. Since Azure manages the AKS cluster, this would have made the operational part a lot easier than managing our own K8s setup
Baseline performance and identify bottlenecks Baselining process has been detailed in the Design Areas page. The process should address multiple different questions about the system. In this solution, the following have been considered A) The peak load that the on-premise stamp can handle provided its current configuration (capacity of compute and database). This can be determined using a load test with a profile that consists of a combination of all different User Flows that the app usually has (in the right proportion as well, i.e., the app could be receiving 75% write traffic and 25% read traffic on regular days. This profiling needs to be considered when running the load tests) B) The configuration of the infrastructure that needs to be provisioned on Azure to handle the burst. This can be determined by repeating the exercise mentioned in the previous point. However, the main concept that needs to be considered is that the **baselining process should only identify the optimum capacity of the resources required to handle 60-70% of the expected traffic. The expected traffic in this case could be 1 million requests per hour. The resources should be configured to scale-out beyond a certain point (Note: The scale-out point can be represented based on resource usage levels or on application specific metrics)
Model capacity This has been addressed as a part of the baselining process

3. Design for Opertational Excellence

Design principle Considerations
Loosely coupled components As this solution aims to illustrate a HA design, the application architectural aspects have not been considered.
Automate build and release processes The solution does not have Build and Release pipelines yet. However, the approach is to have separate pipelines for infrastructure and app deployments. There is one school of thought to have the infrastructure and the app deployments as stages in the same pipeline. (Reference from the Design area- TBD) On the contrary, having separate pipelines can help the app and cloud platform teams get the corresponding changes done through independent channels
Developer agility Not implemented
Quantify operational health Instrumentation has been added to the app instances such that the application logs would be sent to the regional app insight instance. The infrastructure logs would also be configured (not done at this point) to be collected in the regional Log Analytics workspaces
Rehearse recovery and practice failure BC and DR capabilities are not the main intentions of this solution. However, as indicated in the Reliability section, the next logical step would be having Azure as the fail-over target of the on-premise stamp. This can be performed using Chaos Engineering exercises where the entire on-premise stamp goes down and the fail-over happens to Azure. While performing these exercises, the feedback should contain data on a) if the fail-over happened successfully and the system was able to maintain the availability SLO b) The selected method of DR was able to satisfy the expected RPO and RTO and c) if the fail-over was completely automated or required human intervention and if this can be avoided
Embrace continuous operational improvement The health model has not been implemented in this system. We have provided a reference to a health model implementation in the reliability section of this page. The official documentation has a very good reference implementation too.

4. Design for Security

Note:: The considerations in the following table form a very small subset of the Security measures to be implemented for any system running on the cloud.

Design principle Considerations
Monitor the security of the entire solution and plan incident responses By the design principles stated in the "Mission Critical" workloads, a log analytics workspace has been configured per regional deployment and one for all the global resources. These analytics workspaces would serve as the base for Microsoft Sentinel. By design, each Sentinel resource would operate atop one log analytics workspace. Having multiple sentinel workspaces and have the monitoring, threat hunting and incident response from a single place can probably be achieved through the Sentinel Multiple workspace architecure - https://learn.microsoft.com/en-us/azure/sentinel/extend-sentinel-across-workspaces-tenants#microsoft-sentinel-multiple-workspace-architecture
Model and test against potential threats This solution does not implement Pen Testing, Threat hunting and other offensive/defenive security activities
Identify and protect endpoints Implementation of Network Security among all has been elaborated in the Design Areas page
Protect against code level vulnerabilities This is beyond the scope of this solution. However, the suggestion would be implementing A) DevSecOps to shift-left the security measures and start them from the code development stage. Design for DevSecOps would include testing the code for security issues and vulnerabilities (SAST) and would extend throughout the build and release pipelines. Scanning of the container images for security issues and also the running containers become a necessary part of the DevSecOps design B) Using Azure WAF to examine the incoming HTTP(s) requests and look for the OWASP top ten vulnerabilities. The current architecture can be modified to have the Azure App Gateway with WAF v2 handle the regional traffic (instead of the external load balancer)
Automate and use least privilege The principle of Least Privilege based on the Zero-Trust approach has been implemented in a few places in this solution. Some of the measures include a) Network Security groups that permits only targeted East-West and North-South Traffic b) Bastion hosts to remote into the VMSS instances with applicable network controls and identity access c) Usage of Private endpoints for the applicable PAAS services so that access to the same over the internet is prohibited.
Classify and encrypt data This solution does not implement data classification. The suggestion however is to use Azure Purview to classify the data based on the security considerations. Azure SQL DB used in this solution by default comes with the TLS 1.2 encryption support that guarantees the encryption at rest requirements. All the data exchanged in the N-S and E-W flows also use TLS 1.2 to guarantee the data in transit requirement. Additionally, for the full-blown implementation, Secrets, Certificates and keys are to be maintained in the Azure KeyVault and the same is protected at the IDentity (RBAC) and Network (Private endpoint) levels

5. Design for Cost Optimization

Design principle Considerations
Calculate the tradeoff of increase in the overall cost of the solution in an attempt to make the solution highly available and secure (the basic requirements of Mission Critical workloads) 1) High Availability measures (redundant infrastructure, multi-region deployments in an  active-warm standby mode and zone-redundant instances) are implemented so as to achieve the availability SLO requirements of 99.9 and higher. This would cause the overall cost to go up but the tradeoff should be accepted and signed off for Mission Critical workloads. 2)  As is the case of reliability, making the system highly secure would also reflect in the cost of the system (ingestion of infrastrucure and application logs,  configuring Network Security including DDOS, and optionally Azure Firewall for egress traffic scanning etc.,). **Note:** The business should understand the *loss of business and reputation* when a critical system goes down. This should be documented as a justification the high cost of the system
Automated Deployments Use of IAC and deployment pipelines to reduce the manual work in spinning up new stamps when needed
Cost Governance Cost Governance measures implemented through Azure policies and/or initiatives ( resource restriction, cost specific tags and deployment hierarchies to map costs to the appropriate cost centers)
Cost Reviews Proactive and reactive cost reviews to understand the current spend and make improvements as seen necessary. **Note:** the suggested best practice is to use the Advisor recommendations to understand the areas of cost optimization
Cost Optimization Use the general measures to optimize the cloud spend. In this solution, **even though the infrastructure components in the line of the mission critical workload cannot be stopped**, other components if any outside the line of the flow of the app, can be stopped and started through automation. Serverless components can be used wherever possible. This solution uses Azure Logic Apps to execute the automation sequence and the same runs on the consumption tier to reduce costs