Scalability:Projections Team

Maintained by:

Projections

This team is deprecated in favour of Scalability:Observability. Handbook updates to follow.

This team focuses on forecasting & projection systems that enable development engineering to understand system growth (planned and unplanned) for their areas of responsibility. Error Budgets and Stage Group Dashboards are examples of successful projects that have provided development teams information about how their code runs on GitLab.com.

As Dedicated becomes more mature, we will expand our remit to include projection activities for this platform.

We use metrics to gather data to inform our decisions. We contribute to the observability of the system by maintaining metrics that concern saturation and improving observability tools that we can use to help us understand how the system responds to load.

Team Members

The following people are members of the Scalability:Projections team:

Person	Role
Rachel Nienaber	Senior Engineering Manager, Scalability:Projections
Bob Van Landuyt	Staff Backend Engineer, Scalability
Igor Wiedler	Staff Site Reliability Engineer, Scalability
Kennedy Wanyangu	Engineering Manager, Scalability:Practices
Liam McAndrew	Engineering Manager, Scalability:Frameworks
Matt Smiley	Staff Site Reliability Engineer, Scalability

Team Responsibilities

We are responsible for Capacity Planning, Error Budgets and Infrastructure Cost Data.

Capacity Planning

We maintain and improve the Capacity Planning process that is described in the Infrastructure Handbook. This is a controlled activity covered by SOC 2. Please see this issue for further details

The goal of this process is to predict and prevent saturation incidents on GitLab.com.

Issues are kept in the capacity planning issue tracker. Where an issue is needed to improve metrics to support this process, we raise an issue in the Scalability group tracker with the label of Saturation Metrics.

We develop and release Tamland, our saturation forecasting tool for capacity planning. The capacity warning issues are created automatically by Tamland, the saturation forecasting tool we develop for capacity planning.

Capacity Planning Triage Rotation

The triage rotation is maintained in a PagerDuty Schedule: https://gitlab.pagerduty.com/schedules#PRMDCJG

The responsibility for reviewing Tamland reports rotates between all members of the Scalability::Projections team.

The rotation lasts for a minimum of two weeks. There is flexibility in the schedule to allow for OOO and on-call responsibilities. If you need to adjust your shift, please find another team-member to take your shift and add the override into the schedule.

The length of the rotation cycle is to try provide exposure to the wide variety of capacity warnings that occur and to enable each person to gain context on the components that we monitor. The handover day is Thursday to allow for any sync calls needed so that the review of the capacity planning issues can still be completed by the end of Monday.

Triage Duties

The triage duties are:

Review the handover issue and close it when you're ready to get started with triage
Review all capacity planning issues that are past their due date, or in the Open column (which means they do not have a capacity-planning:: workflow label). The saturation labels can help in choosing which issues to review first, if there are many with the same due date.
For each item, check if the warning still applies and follow up with the DRI or other engineers for an updated status.
Set appropriate due dates for the next review.
Raise any significant concerns through the SaaS Availability weekly standup(currently on Tuesdays in UTC afternoon) by adding them to the meeting agenda.
- An issue is a significant concern if it is a critical non-horizontally-scalable resource at risk of imminent saturation or an issue that needs additional attention from leadership.
- If something is really pressing, please raise significant concerns with the Engineering Manager who will escalate to leadership as appropriate.
Check that Tamland is running and generating output and bring this to the team's attention if it is not running. Check the scheduled pipelines on ops for this. There should be a daily job populating the cache, and a weekly job for tamland without failures.
Create a handover issue once your shift comes to an end and hand information over to the next person taking it.

Make sure to set aside at least half a work day during each week in your rotation to go through the items in the Capacity Planning board. Consider re-scheduling one of your shifts if it coincides with another rotation (e.g. EOC on-call duties). When your rotation is finished, you need to provide handover notes in the #infra_capacity-planning channel for the incoming person.

Some tips to help you to get started on duties:

For items that need to be monitored further, it is encouraged to attach the current forecast in the comment as the forecast would change over the following weeks and we wouldn't be able to see the previous forecasts.
Assign the issue to the Engineering Manager for the team that owns the service.
Oftentimes, you might want to query the underlying query of the saturation component in order to get more context of the current state. You could either:
- Follow the Grafana alert dashboard, copy over the query from the Explore page, and experiment with it in Thanos.
- Search for component: <component_name> (e.g. component: disk_space) in runbooks project, the underlying recording rule can be found in rules/autogenerated-saturation.yml (example for component: disk_space)
Looking for the component / alert name in the Capacity Planning Issue Tracker can be a good source of information, as some recurring saturation forecast share the same or similar causes, or just to gain some insight into how these issues have been investigated in the past.
Some events can impact negatively the accuracy of Tamland's forecast. For example, infrastructure changes can change the baseline for a saturation metric. Or a one-off usage spike can skew the forecast overly pessimistic. If you suspect one of these situations is causing unwarranted saturation alerts, you can try adding trend_changepoints and ignore_outliers entries to the forecast parameters. For more details, see Tuning the forecast.
Sometimes Tamland may generate alerts for services whose saturation forcast is generally trending downwards, but for which Tamland's confidence interval (the light blue area in prediction graphs) may still include the possibility of saturation. If your investigation yields no reason to suspect the saturation risk is real, you can opt for tagging the issue as ~capacity-planning::monitor and moving its due date one week or more.

Error Budgets

We maintain the Error Budgets process that is described in the Engineering Handbook.

Issues are kept in the Scalability group tracker with the label of Category::Error Budgets.

We maintain the metrics used to generate the Error Budgets and we ensure that the reports are published on time.

We advocate for improving the SLOs for Stage Groups and we provide support to help them achieve this. Providing the Stage Groups with data about how their feature categories operate on GitLab.com enables them to make good choices about how to efficiently improve the reliability, availability and performance of their feature categories.

Indicators

The Scalability group is an owner of several performance indicators that roll up to the Infrastructure department indicators:

Service Maturity model which covers GitLab.com's production services.
The forecasting project named Tamland which generates and displays utilization and saturation reports

These are combined to enable us to better prioritize team projects.

An overly simplified example of how these indicators might be used, in no particular order:

Service Maturity - provides detail on how trustworthy the data we received from observability stack in relation to the service; the lower the level the more focus we need to improve the service observability
Tamland reports - Provides a forecast for a specific service

Between these different signals, we have a relatively (im)precise view into the past, present and future to help us prioritise scaling needs for GitLab.com.