Skip to main content

On This Page

Solving Gitaly Memory Spikes: Why Cgroup v2 is Critical for GitLab on Kubernetes

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Understanding Gitaly and Kernel Memory Consumption in Kubernetes on Self-Hosted GitLab

GitLab self-hosted on EKS experiences 35.6GB memory spikes during Gitaly repository backups due to Linux kernel page caching. While the Gitaly process itself uses only 195MB, the kernel fails to reclaim the active_file cache, leading to potential OOM kills.

Why This Matters

In Kubernetes environments, the shared Linux kernel manages node RAM without a granular understanding of pod-specific cache requirements under Cgroup v1. This architecture creates a dangerous gap where a single backup job can consume 37GB of the node’s memory as active cache, potentially starving critical services because the kernel assumes these files will be reused immediately, despite the Gitaly process returning to a baseline of 195MB.

Key Insights

  • Gitaly gRPC operations for backups trigger massive kernel page cache spikes that persist as active_file long after the job completes.
  • Cgroup v1 independent hierarchies fail to communicate memory pressure effectively, leading to 35.6GB of cache being held indefinitely in a 37GB pod.
  • Cgroup v2 Pressure Stall Information (PSI) at /sys/fs/cgroup/memory.pressure allows the kernel to automatically release cache during detected pressure.
  • The Linux kernel has no native concept of a pod or container, viewing the page cache as a global node resource regardless of pod limits.
  • Gitaly RSS memory usage remained stable at 195MB while total pod usage ballooned due to kernel-level file caching of .git bundles.

Working Examples

Cgroup v1 memory breakdown showing the discrepancy between RSS and active file cache.

cache: 38829035520 # 36.2 GB !!!!!
rss: 204779520 # 195 MB
inactive_file: 568246272 # 542 MB
active_file: 38260654080 # 35.6 GB !!!!!

Cgroup v2 Pressure Stall Information (PSI) interface.

# cgroup v2 exposes:
/sys/fs/cgroup/memory.pressure
# Content:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Practical Applications

  • Use Case: Migrate EKS worker nodes to Cgroup v2 to enable the kernel to automatically reclaim active_file cache when memory pressure is detected. Pitfall: Relying on Cgroup v1 which only limits container usage without providing the kernel context for global cache reclamation.
  • Use Case: Implement a privileged CronJob to manually drop caches after daily gitlab-toolbox-backup runs as a low-effort temporary fix. Pitfall: Increasing pod memory limits as an emergency measure, which only delays OOM kills without addressing the root kernel behavior.

References:

Continue reading

Next article

Why Your AI Initiatives Fail Without a Semantic Layer

Related Content