Solving Gitaly Memory Spikes: Why Cgroup v2 is Critical for GitLab on Kubernetes

Understanding Gitaly and Kernel Memory Consumption in Kubernetes on Self-Hosted GitLab

GitLab self-hosted on EKS experiences 35.6GB memory spikes during Gitaly repository backups due to Linux kernel page caching. While the Gitaly process itself uses only 195MB, the kernel fails to reclaim the active_file cache, leading to potential OOM kills.

Why This Matters

In Kubernetes environments, the shared Linux kernel manages node RAM without a granular understanding of pod-specific cache requirements under Cgroup v1. This architecture creates a dangerous gap where a single backup job can consume 37GB of the node’s memory as active cache, potentially starving critical services because the kernel assumes these files will be reused immediately, despite the Gitaly process returning to a baseline of 195MB.

Key Insights

Gitaly gRPC operations for backups trigger massive kernel page cache spikes that persist as active_file long after the job completes.
Cgroup v1 independent hierarchies fail to communicate memory pressure effectively, leading to 35.6GB of cache being held indefinitely in a 37GB pod.
Cgroup v2 Pressure Stall Information (PSI) at /sys/fs/cgroup/memory.pressure allows the kernel to automatically release cache during detected pressure.
The Linux kernel has no native concept of a pod or container, viewing the page cache as a global node resource regardless of pod limits.
Gitaly RSS memory usage remained stable at 195MB while total pod usage ballooned due to kernel-level file caching of .git bundles.

Working Examples

Cgroup v1 memory breakdown showing the discrepancy between RSS and active file cache.

cache: 38829035520 # 36.2 GB !!!!!
rss: 204779520 # 195 MB
inactive_file: 568246272 # 542 MB
active_file: 38260654080 # 35.6 GB !!!!!

Cgroup v2 Pressure Stall Information (PSI) interface.

# cgroup v2 exposes:
/sys/fs/cgroup/memory.pressure
# Content:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Practical Applications

Use Case: Migrate EKS worker nodes to Cgroup v2 to enable the kernel to automatically reclaim active_file cache when memory pressure is detected. Pitfall: Relying on Cgroup v1 which only limits container usage without providing the kernel context for global cache reclamation.
Use Case: Implement a privileged CronJob to manually drop caches after daily gitlab-toolbox-backup runs as a low-effort temporary fix. Pitfall: Increasing pod memory limits as an emergency measure, which only delays OOM kills without addressing the root kernel behavior.

References:

https://dev.to/camilacodes/understanding-gitaly-and-kernel-memory-consumption-in-kubernetes-on-self-hosted-gitlab-2je3

On This Page

Understanding Gitaly and Kernel Memory Consumption in Kubernetes on Self-Hosted GitLab

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

EKS Standard vs. EKS Auto Mode: The Evolutionary Leap in Kubernetes Operations

Mastering Memory Leak Debugging in Kubernetes

Kubernetes Is Not a Container Platform (And That Changes Everything)