SLOs should be easy, say hi to Sloth

Published in

ITNEXT

8 min readJun 6, 2021

Kubernetes apiserver latency SLO with Sloth

As in other areas, in the technology world, every year there are some buzz words that are being said more than others. Some examples:

2017: Cryptocurrency and blockchain
2018: Observability and tracing
2019: Service mesh
2020: Gitops

And this year, the fancy word is SLO.

I’m sure you have been hearing service level objectives lately (not only because of the SLOConf). SLOs aren’t new, and most likely that if you are reading this, you are already familiar with SLI, SLO, SLA, error budgets… lingo.

Apart from the buzzword, SLOs are awesome, and if you are not using them at this moment, explore them if you can, because I’m sure that you will get value from them.

Now I’ll talk about how I end up implementing Sloth, so you can skip directly to Sloth section if you want.

A long long time ago…

I don’t know when exactly SLOs were created, however, I was introduced to these when I started being fascinated by the SRE culture, I read them for the first time in chapter 4 of the SRE book, sometime later continued on the SRE workbook.

For me, SLOs are not (only) a tool/framework, but, a way of thinking and applying observability (Ding! 2018), a language… in a word, culture. However, every time that I tried explaining to friends, family, coworkers… in multiple companies, happens the same:

Everyone understands SLOs, the terms, purpose… but lots of them struggle to apply them, understanding the process, or putting them in a practical way.

(2018–2019) Service level operator

It was 2018, I was working in Spotahome, I was a platform engineer/SRE/someone that writes code and does ops. We wanted to start applying SLOs in the company, there wasn’t too much out there and we wanted to make it easy, so people didn’t need to think in all of the existing SLO terms.

We created a Kubernetes operator that had the SLOs of a service defined as a CRD. We named it Service Level Operator. The CRD was simple: give me a query for total events (e.g requests), give me a query for the error events (e.g requests 5xx), and the Operator will get a ratio (0–1) calculated at regular intervals and give it to a Prometheus using some metric conventions.

The operator worked well, we had very simple dashboards that gave us insights quickly, however, weren’t period window metrics, alerts, SLO metadata information… we lack a lot of things about SLOs, but as a first step was very nice.

(2019–2020) Asadito

At the end of 2019, I moved to a new company called Cabify. Some teams in the company were already applying SLOs and implementing them. However, there were a lot of error-prone manual tasks when creating these SLO Prometheus rules. Even, we started creating Prometheus unit tests to be sure about these manually created Prometheus rules! it was pure toil.

I thought about that, and how we could do something to improve the process, and started a new project inside the company. Asadito, a CLI that generates all the Prometheus rules required to create the SLI recording rules and SLO-based alerts (multi-window multi-burn). The main idea was simplicity and flexibility to improve SLO adoption and that moment SLOs usage.

In other words, I took some of the experience I had creating the service level operator and evolve it to adapt to this company’s needs.

Multiple teams started using this tool and suddenly we had standardized SLOs around the company (only the teams that wanted to use it), you could discover all different teams’ SLOs, we had a generic SLO Grafana dashboard, a repository with easy to understand SLO specs…

A very good evolution of what I did back in 2018 with the service level operator, and worked successfully.

(2021) Sloth

I’m no longer in any of those companies. It's 2021 and I continue thinking that people have difficulties implementing SLOs, and shouldn’t be like this. Also, I think that I contributed in a positive way in this area in these past companies so…

After the experience of the service level operator and Asadito, I wanted something similar and to be available for everyone, and like everything that I develop in my free time, OSS.

Say hi to Sloth!

Sloth generates SLOs easily for Prometheus based on a spec/manifest that scales. Is easy to understand and maintain.

Is it a CLI? is it a Kubernetes controller/operator?. It’s both, It adapts to you so you can choose how you want to use it.

Sloth spec tries removing all the difficult configuration parts that make the people struggle when applying SLOs in a practical way: strange error budget formulas, different time windows, different forms of visualization graphs…

Sloth focuses on simplicity, has safe defaults and the purpose is to be able to work for 90–95% of the people (that’s marketing!). I’m sorry if you are in the 5–10%, anyway, is most likely that you have already created your own tool or custom SLO rules.

Let’s see an SLO example and what would Sloth offer you.

Spec

As it can be seen on the spec, we define multiple SLOs for a service, the idea behind this is that every service should have its own clean and simple SLO manifest. this has 2 purposes:

Having SLOs per service.
Documented service SLOs.

Each SLO has an ID, the objective/target for the SLO (e.g 99.9%), an SLI, in this example is an SLI events type (there is also raw SLI type), and finally, the optional alerting block to enable/disable the automatic multiwindow-multiburn alerts generation.

The SLI Prometheus queries, require a templated variable in the query, {{.window}}, SLOth will expand this on the generated rules to the required SLOs windows (e.g 5m, 1h, 3d…).

Validation

When Sloth generates the SLO rules, it validates the spec, including SLI queries, options… so the feedback loop of an incorrect SLO is the fastest as possible and you don’t end with a wrong SLO or spending 3h creating a simple SLO.

Usage (CLI)

If you use it as a regular CLI, for examples for gitops (Ding! 2020):

sloth generate -i ./myservice-slo.yml

Usage (Kubernetes controller/operator)

If Sloth is running as a controller, you could submit a similar manifest (it’s a CRD) automatically with Kubectl, Sloth controller will generate the rules, and store them as Prometheus rules for prometheus-operator.

You can run the controller with:

sloth kubernetes-controller

Get The SLOs created in the k8s cluster and its state:

kubectl get slos --all-namespaces

Prometheus rules

The Prometheus rules that Sloth generates are standardized. This will make that the discoverability of all SLOs easy, don’t lack information, and have a uniform SLO system in place.

The rules can be categorized into 3 areas:

SLI recording rules.
SLO metadata recording rules.
Multiwindow-multiburn alert rules.

You can get all the metrics that Sloth has generated using this Prometheus query:

count({sloth_id!=””}) by (__name__)

Dashboard

As was said before, all the SLOs share the same uniform implementation, so we can create a generic Grafana dashboard to show all SLOs status:

As you already know, Sloth removes SLOs’ configuration tricky parts and assumes safe defaults, like month (30d) time windows. This way the dashboard comes with an error budget burndown chart to see how is the monthly budget doing and make decisions based on it (I’m thinking setting also a 7d/week. Use the comments or GH issues if you think this would be interesting).

My home wifi experience SLO error budget month burn down chart.

Alerts

We talked about multiwindow-mutiburn alerts, Sloth implements the standard that Google describes, they are good enough for most cases as it tracks slow and fast error budget-burning based alerts. Sloth will create two types of alerts (each of these can be enabled or disabled):

Page (critical) alerts: Pay attention right now
Ticket (warning) alerts: Something is not correct, don’t need to worry now, but take an eye when you have some time.

You don’t need to think about how to set alerts, as with other things, Sloth will set safe default time windows to alert in these different cases.

Future

And what’s the future of Sloth?

OpenSLO

Lately, OpenSLO has been published as an SLO standardized spec. It’s funny because it has been created at the same time that Sloth was being developed, we didn’t know each other.

At this moment, I’m trying to adapt Sloth spec to an OpenSLO spec. I don’t think that will be 100% compatible (e.g OpenSLO doesn’t have anything about alerts), however, if we end supporting OpenSLO on Sloth, Sloths’ specs will be always maintained and be a first-class citizen due to its simplicity and full-featured spec.

Update (2021/06/30): OpenSLO support available

v0.5.0 release with OpenSLO support announcement

SLI plugins

I’m very excited about this feature, soon will be available, stay tuned in the following week…

Update(2021/06/10): SLI plugins available with the latest Sloth release

SLI plugins release Twitter Thread explaining them

These are simple Go files that anyone can develop outside Sloth. Sloth will load on start (passing some flag to read the from the FS). From that moment any SLO can reference these SLI plugins on the SLI spec part instead of writing a Prometheus query. You can imagine the power of this extensibility, some examples:

Share common SLIs on a repository at the company level, this could be maintained by a team or a group of experienced people.
Community-based SLI repositories (common ones, frameworks, libs, apps).
Complex Prometheus SLI queries made them easy.
Avoid query repetition on multiple services of a team or company…
Use the SLI plugins to learn and as examples to create your own.
Validation and safety of queries on the SLIs.

Sloth will have a common-SLI repository that people can contribute to, and also use to set up SLOs even easier than before.

Conclusion

I covered very quickly what Sloth can do for you, on Sloth documentation you can read more in-depth all the features.

In a few words, I like to say that Sloth:

Makes applying SLOs simple and easy.

If you can, give it a try with some services, you have prebuilt binaries, just download, create a simple spec and generate your Prometheus rules, in less than 5m you can have SLOs for this service up and running, :)

If you can give me feedback, opinions, or if are already using it for you or your company… I would appreciate it a lot! this kind of feedback motivates me a lot. You can use the comments below or GH issues :)

Thanks for reading and happy SLOing!

Oh, I forgot…

Blockchain! (Ding! 2017), Service mesh! (Ding! 2019).