A Practical Guide to Adopting Public Cloud Service Level Agreements

By Steve Sherwood Cloud Practice Lead, Business Aspect

Public cloud offers a vast and expanding array of pay-as-you-go services, from basic compute and storage, to machine learning and advanced analytics, even a customer contact centre. More and more organisations are recognising the flexibility of public cloud and are considering how to leverage these services.  Service performance and availability are key, so it’s not surprising that I have had a number of conversations with clients recently regarding public cloud service level agreements (SLA’s). What is surprising was that each was unique; one was seeking selection and adoption advice, another questioned their cost vs benefit, and another challenged their entire relevance. 

In this article I am seeking to address some of the concerns that surround cloud-based SLA’s, and offer practical advice to determine which SLA’s deserve organisational resources, and suggest when you can potentially set them aside.

My approach will not suit every organisation. Naturally compliance with internal and external policy requirements will take precedence. My aim is to get you thinking about public cloud SLA’s earlier in the planning process and provide guidance about assessing their value.

This article discusses SLA’s offered by Microsoft Azure, Amazon Web Services (AWS) and Google Cloud Platform (GCP). Whilst public cloud SLA’s are also relevant to Software-as-a-Service (SaaS), these SLA’s are as varied as the number of SaaS products/providers in the market and beyond the scope of this article.

Introducing Public Cloud SLA’s

It is apparent to me that many organisations struggle to recognise or appreciate public cloud service level agreements. This may result from not identifying them as a consideration in operational planning, or perhaps not realising they are fundamentally different to traditional SLA’s that are negotiated within the contract creation.  Public cloud SLA’s are globally standardised ‘one size fits all’ agreements, and therefore very difficult to negotiate to suit individual customer requirements.

Characteristics of Public Cloud SLA’s:

  • SLA’s are a global, one-size-fits-all, non-negotiable service level commitment,
  • The burden of proof when making a claim rests with you, the customer,
  • They provide relatively poor credits when service levels are not met,
  • There are many exclusions which vary between cloud service providers, and;
  • The SLA may not apply if certain customer minimum obligations are not met, e.g. implementing a minimum number of virtual machines in an availability set.

It should come as no surprise that the Cloud Service Providers (CSPs) will generally only offer to meet service levels they know to be well within their capability.

So is there any value in public cloud SLA’s at all? Well yes, but you need to consider their relevance and value to your organisation differently to traditional SLA’s that incentivised supplier performance (with a carrot or a stick).

Locating the SLA’s

CSP SLA’s are readily accessible on their respective websites. Azure SLA’s are here, AWS SLA’s are here, and if you’re considering Google Cloud when it enters Australia later in 2017, then look here. SLA’s are relevant to any cloud service model, i.e. Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Infrastructure-as-a-Service (SaaS), however CSP’s do not necessarily provide an SLA on every service or product they offer. If you compare the number of SLA’s provided by each CSP you might be surprised to see such variation. Microsoft provide many SLA’s, AWS the least, and Google in the middle. How much does this ‘SLA count’ matter? In fact, not much.

In addition to individual SLA’s, you should also review the CSP’s ‘Terms of Service’ as general terms are likely to take precedence over individual SLA terms. Azure terms are here, AWS are here, and GCP are here.

Inside the SLA’s – What to Consider

Generally any of the SLA’s will be structured around the following areas and you should examine these carefully as purposefully placed terms such as ‘aggregated’, ‘contiguous’, or ‘total’ can make a significant difference to how the CSP calculates the availability of the service.

  • What is Covered, e.g. all or a part of the service in more precise terminology
  • What is Excluded, e.g. customer activities, 
  • Definition of Downtime, criteria that must occur for the service to be considered unavailable
  • Calculation of Uptime, e.g. 100% minus downtime.
  • A basis for calculating service Credits when service levels are not met, e.g. 10% of the monthly spend.

It’s important to truly understand each area within the SLA. Consider Google Cloud’s SLA for its Compute Engine as an example. Downtime is defined as the “Loss of external connectivity or persistent disk access for all running Instances that are hosted across two or more zones in the same region combined with the inability to launch replacement Instances in any zone in that region.” This describes a scenario far worse than you may anticipate; “Loss of external connectivity” to “all running instances” across “two or more zones”, and “the inability to launch replacement instances”. That’s not good, and when all engines on the plane have stopped, claiming a credit won’t be top of mind.

The basic SLA structure is where the similarities end as each CSP takes a different approach to the offer and design of their SLA’s. Azure offer a SLA for almost every service, and while this may be attractive to organisations that perceive public cloud SLA’s to drive service quality, you’ll come to realise the SLA’s can do little to incentivise CSP service availability. On the other hand, AWS offer comparatively few SLA’s but AWS demonstrates its commitment to service availability by hosting its online shopping empire Amazon.com on AWS, using the same platform and services as your business. You can imagine the lost revenue if Amazon.com was offline for only minutes, even seconds!

Perhaps the one shared incentive across all CSP’s, and the biggest incentive overall in a highly competitive market, is to not be in the news for the wrong reasons. Lost credibility in maintaining availability, integrity or confidentiality could be terminal.

Pragmatic Steps to SLA Adoption

The following steps will help you determine which SLA’s matter. You might consider it as an initial framework where steps can be added or modified, initially or over time, determined by your unique circumstances and cloud maturity. By adopting an objective approach, you will have a justified and auditable decision process.

Step 1 – Determine SLA Relevance

For all SLA’s offered by the CSP;

  • Discard the SLA’s for services you do not use,
  • Discard the SLA’s for services used but considered low risk (e.g. services just in test/dev).

Step 2 – Assess SLA Monitoring Complexity

As I mentioned earlier, one of the characteristics of cloud SLA’s is that you carry the burden of proof when making a claim. In other words, you need to substantiate the claim with evidence. This requires that you monitor, log and alert on the specific service components identified in the SLA that determine its downtime. This can be a considerable undertaking, for example, the Azure Storage SLA includes eight transaction types (e.g. PUT, GET, COPY etc.) each having criteria that describe the maximum processing time threshold, with four exclusions.

At this point you are not considering the cost of monitoring (i.e. cost/benefit). It may be that you simply do not have the tools, capability or expertise to monitor it. If you cannot adequately monitor the service, you will not have the evidence to substantiate a claim.

  • Discard the SLA’s you cannot monitor.

You may revisit this step as organisational capability matures.

Step 3 – Evaluate SLA Benefit

This considers cost vs. benefit. At this step you have identified a relevant SLA that you are able to monitor and log to support a claim, but is the effort worth it? This is a commercial decision comparing the cost of meeting the SLA requirements to the SLA benefit from a successful claim. It’s worth revisiting this over time as the equation changes with improved toolsets and increased automation reducing the resource effort and cost.

  • Determine the actual or estimated monthly cost of the service.
  • Based on the SLA credit calculation, predict the likely benefit. Model a worse-case scenario. With the low price of many cloud services you might be surprised how small credits are.
  • Compare the potential resource effort to potential SLA credit.

Result

What you’re left with is a selection of relevant SLA’s, which you are able to monitor effectively and efficiently, and return a worthwhile benefit. In addition, you have an auditable decision process for excluding certain SLA’s. It is entirely possible that you are left with no SLA’s, if so don’t be alarmed, you have the reasons why.

Key Takeaways

  • Understand public cloud SLA’s, but do not rely on them.
  • Understand any customer obligations so as to not void the SLA.
  • Develop an objective decision process to identify which SLA’s matter, which don’t, and why.
  • Architect the cloud solution around the SLA requirements and limitations, i.e. architect for availability, architect for cost. There are many whitepapers and design patterns to help.
  • If financial compensation is important, consider cyber risk insurance.

Want to learn more? 
Ask us how Business Aspect can support your organisation on its cloud journey