Author – Prashant Bharadwaj, Cloud Engineer at GreatLearning
This is it; you have done your homework, followed the recommended best practices, listened to the experts, and now your entire infrastructure has been successfully migrated to the cloud. Congratulations!
But, is this where the work ends? Absolutely not! Just as an on-prem setup requires continuous monitoring activity to keep the wheels running, so does a cloud-based setup.
Welcome to the world of Cloud Operations, or CloudOps.
Due to the very nature of the cloud, CloudOps practices can include an extra layer of complexity as compared to traditional IT operations.
Servers are virtual: In a traditional IT setup, physical access to the data centre is vital to troubleshooting and fixing common issues with the hardware servers. However, in a cloud environment, any and all physical troubleshooting will be performed by the cloud vendor itself. For other tasks, you will have to rely on the set of cloud-native tools and interfaces provided by the vendor to manage the servers.
Scalability: Contrary to the fixed capacity of physical servers, cloud infrastructure has the ability to scale up or down based on the needs of the consumers. Extra care must be taken here to ensure that scaling is performed in a controlled manner to prevent wastage of resources.
Be proactive, not reactive: If a customer has to face a fault in your cloud deployments, you have already failed. Monitoring programs should be used to monitor and analyse infrastructure metrics, spot any underlying issues, and correct them before these issues propagate to the customer. These activities should typically be automated for maximum efficiency.
Now that we understand the technical challenges behind CloudOps practices, let us have a look at the principles behind these practices which can help mitigate these issues.
Abstraction: Cloud deployments are typically distributed and stateless and can span across a large number of resources being deployed all over the world. These resources need to be placed behind a layer of management tools that can abstract the process of monitoring these resources. These tools usually comprise of cloud management tools, that let you provision and delete resources, and monitoring tools which will let you monitor resource performance to preemptively spot problems and rectify them.
Policy Enforcement: Cloud security needs to be taken very seriously to prevent unauthorised use and access to resources. To enforce this, strict policies need to be put in place to limit the activities performed by users and applications in the cloud. Every security policy put in place should always follow the “principle of least privilege”
Automation: Routine tasks, such as security management, fault correction, and resource deployment, should typically be automated. The reason for this is that it is not possible for humans to manually manage and react to events on the scale of the cloud. Automating these tasks will reduce the element of human error and improve reaction response times.
Responsibilities of a CloudOps engineer
Out-of-box deployments: A CloudOps engineer should be able to provide deployment environments for varying kinds of applications as per customer requirements. For example, if a user wishes to deploy a Java application, then a complete deployment environment with all the required dependencies, such as Maven or Spring, should be made available to them.
Resource allocation: A CloudOps engineer should have the know-how on deploying resources, such as servers or network components when required. Typically, there will be a collection of IaC(Infrastructure as Code) scripts that can be run at a moment’s notice to provide the required resources.
Performance: Checking KPIs on the computing, network, or storage resources being run is a key responsibility of a CloudOps engineer. Metrics such as database read/write speeds and CPU performance should be within the limits guaranteed by the cloud provider’s SLA’s. This also extends to checking application health and taking corrective measures if required.
Compliance: Your organisation’s host country or your clients’ host countries might have compliance laws in place relating to the deployment and usage of resources. You, as a CloudOps engineer, must be familiar with these laws and regulatory frameworks and ensure that your cloud infrastructure meets those requirements. Failure to comply might result in heavy legal penalties for you and your organisation.
Based on the above responsibilities, we can now talk about some job roles that are relevant to this domain.
Cloud Operations Engineer: In this role, you will be responsible for managing all aspects of a data centre including servers, storage, and supporting systems. You will require knowledge of cloud infrastructural components such as VPC, Load balancer, WAF and Route53.
Operations Support Engineer: You will be in charge of the maintenance and migration of databases from on-premises to cloud setups. In addition to having a basic understanding of SQL server DDL and DML commands, you will also need to be familiar with IaaS and PaaS services associated with Azure, including Azure VMs, CosmoDB, Azure functions, etc. You will also require skills in infrastructure automation using Terraform and Cloud Formation.
Senior software engineer – CloudOps: You will be providing operational support for provisioning and maintenance of the cloud infrastructure, along with technical support to the organisation and associated business partners. IaC is an important component of this role, so you will need to be proficient in Terraform and CloudFormation. You will also be responsible for maintaining container platforms, so knowledge of Docker and Kubernetes will be an advantage on your side.
Senior Cloud Operations Analyst: This job role will primarily focus on application deployment in the cloud. You will need to have a strong proficiency in IaaS components, such as VMs, virtual networks, and storage services. In addition, you will be required to maintain backups and snapshots of resources, as well as set up security policies for users using Azure Security Centre.
Cloud Network Operations: You will be required to perform activities related to network management, and capacity and cost optimisations for existing infrastructure. This will require a strong understanding of networking services such as VPC and subnets, load balancers and target groups, Route53 and WAF.
As we have seen, CloudOps is an entirely different beast altogether! It is an area of continuous operations and continuous improvement. These operations may seem like an afterthought, but their proper implementation is the key to success for you and your organisation.2