Nirmitee.io - Lessons from a Platform Engineering team

On a recent client account, we decided to field a platform engineering / enablement (PE) team to streamline work across the account. The account had multiple application teams supported by the PE team

 · 1 min read

On a recent client account, we decided to field a platform engineering / enablement (PE) team to streamline work across the account. The account had multiple application teams supported by the PE team.

The PE team had a lot of responsibilities, primary among them being:

  • Understand, support and insulate application teams from potential bureaucracy and delays in provisioning accounts, infrastructure, access and more. 
  • Accelerate infrastructure provisioning and continuous delivery via "paved roads" so that application teams can focus on their specific areas of responsibility.
  • Ensure consistency and reduce duplication of infrastructure code across application teams
  • Plan for and help application teams with implementing aspects such as observability, backups, disaster recovery, high-availability etc.

This is a story based on lessons learned from that PE team effort.

As it happened ...

Team setup

There were two main PE team setup approaches considered:

  1. Centralized PE team with infrastructure specialists
  2. Infrastructure specialists embedded within each application team

In our case, we chose to go with a centralized team, while encouraging internal open-source contributions from application team members and the wider client organization. This meant that all members of the PE team worked closely together, making it easier to share knowledge and help members of the team new to the infrastructure world. This also helped reduce duplicated work across application teams, since all platform work across teams was centralized and visible.

However, this centralization had downsides related to prioritization, understanding teams' needs and more as detailed later in this article. This approach also led to the application teams depending on an over-stretched and often-delayed platform engineering team to make changes that they needed to move quicker. The application teams felt a lack of ownership, since the PE team was ultimately responsible for provisioning all the infrastructure and even writing and maintaining the code for the Continuous Delivery pipelines!

Lesson learned: While there were benefits to the centralization approach, we were strongly considering moving to a hybrid model of a tiny centralized team to ensure consistency and shared learnings, with the decentralized, embedded mode of working being the default for the rest of the team. This would likely have scaled better and would have been better for the application teams in the long run. A build-then-extract model of work (based on work done in application teams), rather than a build-then-use model (platform-team led) could have worked out better.

If centralizing early, then it makes sense to use some of that time to agree on naming conventions, Terraform module structures, testing strategy and more so that duplication and waste is reduced once application teams start onboarding.

Pete Hodgson, in the article "How platform teams get stuff done" details other approaches we were considering.

When should the PE team start?

There were discussions about starting platform engineering work earlier than the rest of the application teams by a few weeks. The idea behind this was to give the platform team a head-start so that they could have the beginnings of a platform ready for the application teams to use, as soon as they started.

The challenge with starting the PE team earlier than the application teams is that it might not be clear what needs to be built without inputs from the application teams. There is a definite risk of the platform team building unnecessary platform infrastructure which might need rebuilding at a later time.

In our case, the PE team had a little time before application teams started, but wasn't able to use that time effectively. It wasn't clear that increasing the amount of time the PE team had would have helped a lot. It was also a new client and understanding their ways of working and learning to work together with them took time. As a consequence, when the application teams started, they were immediately blocked and waiting for the bottleneck, that was the PE team, to provide them what they needed.

Lesson learned: This one is tricky. It is likely that the right move is to give the PE team a headstart, but only when adequate product ownership well-versed in infrastructure and building platforms is available. Without that, the PE team will not be able to prepare adequately for when the application teams start.

Staffing a product owner

We knew we wanted to bring in analysis and product-thinking to the PE team, to help the team decide when and what to build. However, this requires product-ownership or product-management which understands what building a platform entails. It needs someone who is familiar with infrastructure and cloud, who can work with application teams to understand their needs, who can prioritize those needs across application teams, who can plan for observability, security considerations and more which the application teams might not know to focus on.

Initially, we worked with a strong business analyst who wasn't familiar with infrastructure and cloud ecosystems. Unfortunately, due to the lack of that knowledge, it was challenging for that person to be effective, since every conversation that was had, every story that needed to be written went deep into infrastructure, platforms and the cloud.

We decided that it was more important for that infrastructure domain knowledge to be present in a product owner (PO) for this team, than to get a product owner who understood product-thinking aspects well, aspects such as experimentation, value, product strategy and more. We ended up asking an experienced infrastructure engineer who was capable and interested in learning some skills of product management. This helped the PE team (through the PO) and the application teams collaborate more effectively, since they were able to speak the same language and understand each other's issues better.

We also started tracking planned stories and ad-hoc support requests separately and were able to find patterns in those requests, so that we could write stories to address issues that came up.

Lesson learned: Lean towards a PO with platform and infrastructure knowledge for this kind of team, even if their product management skills aren't where they could be with a "real" PO. Maybe consider pairing a PO with an infrastructure engineer for this role.

Challenges of prioritization

Since there was only one PE team, one PO and multiple application teams, we faced many challenges with prioritizing the needs of the application teams. These included:

  1. Local maxima: Some teams needed much more from the PE team than others. But, in our attempt to be equitable, we ended up picking some lower-priority items from the teams which were less needy. Optimizing globally across the account, we would likely have prioritized differently. This caused some application teams to be disillusioned with the PE team, since they were blocked and it seemed to them that other teams got priority over them.
     
  2. Even though the PE team included all the experts in infrastructure, we went to the teams to ask them what they needed, the details of the infrastructure and when they needed them. This is because the PE team did not know enough about what they were planning to build, from an application perspective. Some teams were out of their depth (naturally!) with this, since infrastructure was not their specialty.

Lesson learned: Collaborate more with application teams. Designate people from the application teams to work with members of the PE team, so that both sides can learn from each other and collaborate better. Help different application teams differently based on their needs.

Reactive vs. Proactive work

At the beginning, the PE team was in a reactive mode, running to catch up with the needs of the multiple application teams. The team had no choice but to be in that mode, since application teams were blocked. As time progressed the PO issue was handled as mentioned above, and the team started to understand how to operate more efficiently. We started to notice similar needs across application teams. Since all the infrastructure was provisioned via code, this helped us reuse code across application teams and to start unblocking them.

As this happened, we started to understand what the teams were trying to achieve and to anticipate their needs. We started being able to carve out time to think about security needs across application teams and how to address them within the Continuous Delivery pipelines, what kind of support the platform can provide for running end-to-end tests within the pipeline, how we could architect our infrastructure-as-code better and more.

There are a couple of stories below, which go into more detail.

Lesson learned: Be reactive at the beginning if needed, but never forget that the platform is what the PE team is building, and not just providing infrastructure.

Reactive vs. Proactive: Bootstrapping infrastructure

The first story of this kind of work involves automation of initial bootstrapping infrastructure. The initial roles and identities for access, Amazon S3 buckets and pipelines were created manually. This was technical debt that was causing the team to build on shaky foundations which were prone to human-error whenever we needed to make changes.

The team formed a plan to automate these and implemented the changes as detailed in this post ("Bootstrapping infrastructure: The chicken-and-egg problem"). This helped increase the PE team's efficiency in on-boarding new application teams and new team members onto the platform.

Reactive vs. Proactive: Secure connections across AWS accounts

The PE team noticed a repeated pattern of needs from application teams around the ability to connect to services running in different AWS accounts securely. This used to be managed via time-consuming support tickets sent to the network security team, to open ports through the firewall, manually created jumphosts and other workarounds.

The team implemented a generic cross-AWS-account tunneling solution using AWS SSM as detailed in this post ("Tunneling across AWS accounts"). This approach could be used within pipelines to run end-to-end tests, and to connect securely to AWS services with auditing and observability in place. The PE team was able to offer this as a service, document it and point the teams to the documentation to be able to set up their own instances of this within their pipelines.

Pipeline design and responsibilities

As mentioned earlier, the PE team ended up being responsible for all pipelines across the application teams! This was both a consequence of how the teams were set up, but also due to a lack of documentation and knowledge-sharing from the PE team. Additionally, the Continuous Delivery tool used was new to everyone, including the PE team, which added complexity and increased the time taken to create them.

There were some application team members who were interested in how the pipelines were set up and felt confident in their ability to modify them. There were others who were completely hands-off and expected the PE team to take care of failures in their pipelines. This caused more bottlenecks, where the PE team had to step in help, slowing down other work which was also high-priority.

We started documenting our designs better, and started pointing team members to other application teams' pipelines and examples we created, so that they could borrow code from there. We started setting expectations with the application teams that they were responsible for their pipelines, with the PE team collaborating with and supporting them.

This started working well, but came with other challenges around governance -- for instance, around ensuring that security checks run in the pipeline, and aren't turned off, even by mistake. We considered templates and other enforcement mechanisms, but decided to leave it low-touch for the moment, handling those concerns via education, collaboration and increasing trust between the teams.

Lesson learned: The PE team being responsible for pipelines across teams does not scale. However, a completely hands-off approach without any governance capability brings its own challenges. We need to be clear about responsibilities between the PE team and application teams and the interfaces provided by the PE team, while putting in unobtrusive guardrails that help guide application teams without getting in their way.

Infrastructure as Code design

The two final lessons are about Infrastructure as Code (IaC) design.

First: Kief Morris of the "Infrastructure as Code" book fame has written about an anti-pattern called "Snowflakes as code". In the PE team, we ended up with a version of this which brought along all the problems mentioned in Kief's post (not repeated in this post). It took us a while to fix this, working slowly to undo mistakes around modeling IaC.

Second: As a way to prevent application teams from being blocked waiting for the PE team, we allowed them to spin up infrastructure manually with the promise that we would convert them to IaC as soon as possible. Due to this, the PE team was always in catch-up mode, trying to automate the infrastructure that was being brought up quickly. Sometimes, we didn't even know the exact settings that were used while bringing up the infrastructure. This caused further delays and caused issues when trying to propagate services across environments such as Dev, QA and SIT. Using tools such as terraform import and terraformer sometimes helped, but it was likely a self-inflicted wound that could have been avoided.

Lessons learned: Be aware of good practices around IaC design, and don't repeat costly mistakes. If in a situation where you are forced to allow manual provisioning of infrastructure, do it along with a member of the PE team to share context, or take very good notes. We could also have convinced application teams that we will all move quicker by having members from their team help the PE team out with IaC as well.

Conclusion

Platform engineering teams can act as accelerants, but can also be blockers if not set up right. Bringing product thinking into the team, the right capabilities and an attitude of collaboration with their customers (i.e. application teams) is essential to a successful platform engineering effort.

At the end of the day, there were lots of moving parts not mentioned here that caused the team to have to make the decisions detailed above. A similar team with a similar set of circumstances would likely have made similar decisions. This is an attempt to learn from challenges faced, with the hope that next time we try different approaches or be aware of potential issues with those decisions