GitLab
|
Raleigh, NC
|
July 2018 — Present
Site Reliability Engineer
Operate on one of the largest installs of GitLab to ensure stability and performance for the site at a global scale supporting millions of users daily.
- Led the automation and scaling of GitLab's core infrastructure, transitioning from a manual, VM-based environment to a dynamic, auto-scaling system capable of supporting millions of users. This resulted in a significant increase in platform capacity and resource utilization.
- Drove zero-downtime deployments for GitLab.com, supporting millions of daily users, by implementing advanced release strategies and automated rollback procedures, ensuring a highly available and reliable platform.
- Reduced cloud storage costs for container images by collaborating with the Container Registry team on data migration and storage optimization, earning a discretionary bonus nomination.
- Developed and implemented automated escalation procedures for critical incidents, reducing MTTR by hours and improving incident response efficiency.
- Mentored junior SREs and engineers on best practices for infrastructure monitoring, incident management, and deployment processes, fostering a culture of reliability and knowledge sharing.