Each domain has an SRE, who is involved at the beginning stages of product development to ensure that the domain stakeholders are in alignment with the SRE initiatives. A key way we’ve been able to create this alignment is by having our SREs embedded into domain and product teams. We want an engagement model where product, SRE and development teams are closely aligned. Our goal is to maximize the engineering velocity of developer teams while keeping products reliable. With a long-term goal of “no toil,” our SREs work on identifying and reducing toil to a manageable level across the organization. We used supervised and unsupervised learning techniques to automate our toil. Any triage or resolution that an engineer can perform, a machine can be trained to do the same. As an example, we made sure engineers were not the first point of contact for any alert. ![]() So, to tackle toil, we focused on automating away the need for manual intervention. We think of toil as work that is manual, repetitive, tactical, devoid of enduring value-but automatable. To do so, we followed four key principles that allowed us to meet changing customer needs quickly and release fast and reliably.Īs we moved from traditional Ops to an SRE ecosystem, our biggest opportunity was reducing toil, so that engineers can spend time on activities that drive business impact and customer outcomes. Then as COVID-19 hit, we really had to accelerate this journey as customers increasingly moved to online ordering and delivery to meet their Total Home Improvement needs. And to better manage this new architecture, we embarked on an SRE journey. Bootstrapping SRE at Lowe’sĪs we moved from on-prem to Google Cloud, we decided to move from a monolithic- to microservices-based architecture. Production concerns were not surfaced into the product roadmap, which resulted in delays in making fixes. On-call structures and incident management efficiency were not at optimal levels with too many repetitive and manual tasks, resulting in operational toil. But looking back, we are excited to see how much we have accomplished for our customers as a result.īack in 2018, before adopting SRE practices, we were more reactive than proactive, following an “eyes on glass” approach. Every step along the way brought some challenges. Our SRE transformation didn’t happen overnight, though. With these efforts, we’ve been able to go from one release every two weeks to 20+ releases daily-a 300x increase. ![]() To modernize our systems and build new capabilities for our customers and associates, we leverage Google’s SRE framework and Google Cloud, which helps us meet their needs faster and more effectively. They share about how they have been able to increase the number of releases they can support by adopting Google’s Site Reliability Engineering (SRE) framework and leveraging their partnership with Google Cloud.Īt Lowe’s, we’ve made significant progress in our multiyear technology transformation. ![]() Editor’s note: Today we hear from the Lowe’s SRE team.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |