March 13, 2024
How to start Platform Engineering practice to improve Developer Experience
TL;DR: Platform engineering removes the complexity of managing cloud environments and assets for developers, allowing them to focus on…
Kapil Gupta
15 min read
TL;DR: Platform engineering removes the complexity of managing cloud environments and assets for developers, allowing them to focus on building applications faster and with less friction. These applications and capabilities deliver business value and directly/indirectly contribute to business outcomes. Imagine an all-in-one, self-service developer toolkit that takes care of the grunt work, leaving developers free to code and innovate. That's the magic of platform engineering. This article talks about what Platform Engineering is, why it is important for better developer experience, how to prepare and start the practice, and finally what are some of the core tenets of Platform Engineering.
There are 2 key terms used in the topic; "Platform Engineering" and "Developer Experience" but first let's take a step back and go back to why we should even worry about these terms.
Part A — What organizations care about most? It's Customers
To be successful, this is the most fundamental question for any organization of any size. "Customer Centricity" and "Customer Obsession" are no longer the new terms and almost all successful organizations have kept these at the nucleus of their business strategy.
Such focus not only brings in more revenue to the company and better business outcomes but also builds trusted and loyal customer base. It is known that a bad customer experience/engagement can result in direct impact to the revenue, stock prices, poor Net Promoter Score (NPS) or worse, customers may decide to not do business with the company.
Organizations want to engage their customers more by delivering new meaningful features and capabilities that are easy to use and reliable. To measure the quality of customer experience, constant feedback is necessary. Here a customer can be a consumer or a partner. For example; a retail business' customers are shoppers, for a gaming company it's gamers, etc.
Part B — Engineering at the center of IT
[Now think of this from the IT point of view] What do the majority of your customers most care about?
- A top feature rich mobile app OR that you are running the most modern VM infrastructure?
- Features in your product catalog that are different/advanced from competitors OR that your organization is running most advanced Kubernetes clusters in your datacenters?
- Highly engaging customer experience OR the most advanced self-built machine learning platforms.
So for an organization the most valuable IT assets are the APIs, Apps, Websites, products, etc. that internal or external users use. Infrastructure, Security, IT Operations, Automation, etc. are the functions that help application engineering teams deliver these capabilities in secure and scalable ways.
Organizations have to react quickly to the ever changing customer behavior patterns and adjust the engagement accordingly. This is one of the key reasons why application delivery velocity matters.
Keeping the customers at the center of the business makes the business successful; keeping the "Developers" at the center of the IT makes the IT successful because they are the one delivering the direct value to the customers most of the time.
It becomes essential to make sure Developers do not see friction to do their day to day tasks, especially when they follow well established and known patterns.
What are some of the challenges Developers face while delivering business features:
- Growing complexity of IT environment: The IT operating environment is not getting any simpler. One example of this is the CNCF landscape (https://landscape.cncf.io/). There are 20+ sections in this eye chart and picking at least one icon in each section means developers need to know all those. Operating in a multi-cloud environment brings in another level of complexity for developers. This builds a lot of unnecessary cognitive load on developers which does not result in a lot of business value. You running your own K8s environment or a tight VM environment will probably not mean anything to the consumer, for example; as a consumer, do you know or even care which infrastructure google.com or your retail webpage or your banking app is running on or what security tools they are using? Most probably not as long as you are getting all the features in time.
- Dependency on other teams (approvals and others): It is a common scenario where Developers leave the development flow because of the dependency on the outside teams in the shape of approvals or work. For example the Security team needs to review the code or the network team needs to provision a load balancer or infrastructure needs to provision a namespace/pod, DBAs need to create a table, etc. This results in wait time or non-development work and is generally referred to as "development friction and delays". More friction, slower the delivery. With the DevOps movement, we tried to create several inner loops and ended up shifting a lot more responsibilities to developers. Developers who were supposed to deliver business features also started taking operational and automation responsibilities. Again, this is not very impactful to business capabilities delivered.
- IT Processes: There are many occasions when organizations try to fit years old IT processes into the modern Cloud native Operating environment. While these processes served the on-prem working environment very well, many such processes slow down the development in Cloud. More discussed later on this.
A poor developer experience, friction and unnecessary cognitive load may demotivate them or, worse, push them to decide to leave the organization. Reverse is true too; organizations that value developer experience and productivity always attract better developers, have a better chance to keep them for a longer time and produce better products for the business.
This brings us to the first important question of this article;
What kind of developer experience do you want to give to developers in your organization and who should build that experience?
So let's summarize the section:
- Most valuable assets that IT delivers to the business are the capabilities and features that customers/consumers/partners can directly or indirectly use.
- These assets are built by the engineering teams and any friction in feature delivery can result in direct/indirect business impacts.
- Landscape of modern IT is complex and it is a cognitive load for developers to learn the end to end of this landscape.
Now let's look at the other side of IT; the governance and operating model.
Part C — Centralized Cloud Governance and Operating Model
Term Governance is almost always considered a "bad" word in modern working environments and often considered a synonym for "Control to stop you from using what you want, to build your products and applications". IMO, it is not a true assessment and rather there is/was a reason to put Governance in such a way. Here are some reasons I saw:
Simplified IT portfolio: Technology lifecycle in an enterprise is a long and tough task. Not putting a Governance on it introduces many challenges.
Cost Control: Bringing a technology in an on-prem environment can cost hundreds of thousands to millions of dollars in licensing. Long term commitments are generally needed for cost efficiency. Once introduced organizations did not want to use a competing technology to further invest for similar capabilities.
Supportability: Stability is always at the top of the mind when introducing a technology in a large organization. "Who is going to jump on a call at 3 AM Sunday when things go wrong?"; I have heard this question hundreds of times when I was responsible for introducing a new tech in a large fortune 50 organization. It is a 100% fair question. So supportability was a key requirement, thus many centralized specialist teams were built to provide support for several IT components and developers depended on them to make any changes in the environment.
Full stack ownership: When you own a datacenter you are fully responsible for the whole stack, right from the data center security, to servers, to platforms, to network devices, and on and on. You have to support all of it, which influences organizations to carefully pick technologies after long evaluation processes. Such evaluations can take months before developers can get a chance to try them out.
Many such factors pushed organizations to adopt a heavy-handed Centralized Governance and Operating model framework that resulted not only in long change cycles and "Controls" but also forced developers to use certain languages, runtimes, databases, platforms, etc. Many times in my past life, I struggled to get even a simple architecture change approved due to long manual reviews.
Some of the fundamental promises of Cloud are agility, "pay as you go" cost model and independence/flexibility to developers to use tech they want. So, any governance model that breaks these promises probably will not work for Cloud. Also, with cloud adoption and automation, organizations do not have to choose between agility/innovation and Governance (consistency, cost and security), you can have both.
Biggest question arises for organizations that are adopting Cloud at scale is whether the Cloud Governance Model should be centralized or decentralized, and what impact it makes on developer experience.
Part D — Challenges with centralized Governance and Provisioning model
Many times organizations continue to use a centralized infrastructure team to build all Cloud components for the application development teams. This includes networks, projects/folders (Google Cloud), IAM permissions, GKE clusters, namespaces, GCS, PubSub, and many more. For example; when an application team needs a GCS bucket, they go to this centralized team and then this team provisions a bucket (with some Infrastructure as Code) which is then used by the appdev. Generally these centralized teams have a limited number of professionals that learn and support the cloud initiatives. Most of the time
Number of components in the cloud grows with a rapid pace, while the centralized "cloud team" gets very limited new resources over the period of time (if any). Now they have to support existing on-prem workloads and cloud. The gap between the demand for new cloud components and the professionals to support the environment grows more and more with time.
At a certain point in time it becomes a choice for the organization to either slow down (or worst stop) the cloud adoption and application changes agility or rethink about the operating model. Sometimes the cost of waiting is too much for business.
This is generally a time when organizations pivot towards building a "Platform Engineering" team to support developer experience and application change agility.
So let's go into it…
Part E — Platform Engineering & Developer Experience
What do developers need?
- Do experimentation easily
- Deploy code/application whenever needed without worrying too much about infrastructure, security details or compliance.
- Don't have to wait for approvals for something that is done frequently/regularly
- Observe the behavior of the code they deployed across the environments
- SLO/SLA are met for the end consumer (people or the code)
- "Hidden" Security setup that does not allow unintended behavior and operations, and enforced consistently across environments. (Guardrails)
- Right level of access is granted
- Budgets are set to avoid unintended costs. Observe the cost of the environment
- Enforced consistency to deploy and manage assets in the cloud
- Identify and fill drift from the intended desired state (this goes beyond what is done in Kubernetes)
- Auditing the resources deployed in the Cloud environments
To make this happen you need "something" to hide the complexity of all the layers of Cloud Cake so that Developers can only focus on the "Cloud Workloads" layer to build the products, applications and APIs they need to deliver for business in an easy way.
That something is where "Platform Engineering" comes to help. Its main focus is to make developers' lives easy and remove friction from the way of building things. More friction it reduces, the better adoption is.
What is Platform Engineering
Platform engineering is the practice of designing, building, and maintaining the tools and workflows that empower software development teams. It essentially paves the way for self-service capabilities, allowing developers to focus on building applications without getting bogged down in the complexities of underlying infrastructure or processes (change management, approvals, compliance, etc.).
Here are some key aspects of platform engineering in cloud:
- Focus on developers: The core objective of Platform Engineering is to streamline the development and deployment workflows for internal developers by providing them with a standardized, secure, and scalable platform. This frees them from repetitive tasks and allows them to concentrate on core functionalities of the applications they're building.
- Internal developer Platform (IDP): IDP acts as a one-stop shop for developers, encompassing various tools and technologies integrated seamlessly. This unified platform fosters efficiency by reducing the need for developers to switch between disparate tools and manage them independently. IDP should enable developers to stay in the development workflow.
- Cloud-native principles: Platform engineering heavily borrows from the principles of cloud-native development, which emphasizes building and deploying applications specifically designed for the cloud environment. This ensures optimal utilization of cloud resources and facilitates scalability.
- Collaboration and automation: Platform engineers work closely with other teams like DevOps and SRE to establish and build efficient workflows. This fosters collaboration and eliminates bottlenecks in the development process.
Platform engineering in cloud plays a pivotal role in software delivery, enabling organizations to enhance developer productivity and expedite development lifecycles, that is to ultimately assist in achieving business goals.
Part F — Pillars of Platform Engineering
Think of Cloud as a layered cake, like the OSI Model. Each layer of this cake serves a different purpose, has different complexity levels and needs different skill sets to work with. All these layers are glued together with the Governance and Operating model. Let's dig into these layers.
- "The Building" Cloud Foundation — This is the layer which consists of some foundational components that makes your cloud environment work. For example, Hybrid Connectivity, Identity Synchronization, VPC structure, Resource Hierarchy/Organization structure, Billing, Organization Policies, IAM, etc.
- "The Ingredients" Cloud Services — These are the services that CSP like Google Cloud provides to its customers. For example; Google Compute Engine, Google Kubernetes Engine, Cloud SQL, Cloud Run, etc.
- "The Recipes" Cloud Workloads — These are the applications/products that can be built in the cloud using one or more services. For example; a developer may create a REST API written in Go on top of GKE. Developers can use these recipes to build internal or customer facing business applications.
Part G — How to start building a Platform to support developers
A general tendency is to start looking for a commercial or open source tool or hire a consulting company to build something. This may not be the right first step.
Step 1: Developer Survey:
It is critical to first understand the current state of developer experience in your organization and the expected future state. So it is important to understand the true landscape. Start with a survey with developers and operators in the organization.
- Understand the current state of developer experience in your organization
- Where are the biggest pain points for Developers?
- What are the most common application patterns used?
- Should assist in identifying the areas of quick wins.
- What are the biggest frictions in adopting cloud?
- How much time do they spend waiting for infrastructure/security/networking to do their things before developers can build/deploy/test/scale the code and create related cloud assets (CloudSQL, PubSub, Pods in GKE, Cloud Run, DataFlow, BigQuery Datasets, etc.)?
- What kind of experience are developers looking for to build/deploy code and related cloud assets? For example, are they looking for a self-service portal (click-ops) to deploy assets in the cloud or more on using the standardized code templates or they want an external team to do this for them.
- How many times developers need to leave their workflows to deploy in the cloud (both code and assets)?
Highly recommend being very transparent in this survey. Questions asked should reflect the true current state and where developers like to see improvements. If needed, keep the answers anonymous and use the Person information like, App Developer-L1, App Developer-Architect, Operations-L2, etc. whichever makes sense for your organization.
This survey should truly be limited to the developers who are hands on the keyboard and who are directly helping those, like product owners, product managers, release engineers, testers, operators, etc. Avoid including people leaders in this survey unless they are directly involved in day to day development activities.
NOTE: Understanding the future state (along with current state) is important to know what is the appetite and what is needed to achieve that state. Is it easy or hard or extremely hard?
DORA Quick check survey could be a good baseline to build your own survey that fits your organization's requirements and environment.
Step 2: Strategic Alignment and Sponsorship
Once there is some clarity on the current state of the developers and how they want to adopt cloud, get a leadership alignment and sponsorship to start the platform engineering. This sponsorship will help define priorities and pull the team resources to build the practice. This will also identify the budgetary requirements and if there is enough appetite to support the initiative. Many times organizations try to hire new talent that already has platform engineering experience, which is great but do look inwards for the existing people who are developer advocates and have enthusiasm to build new things. Many times, a team restructure is required to pull right folks in a team. This step is necessary to avoid confusion and organizational misalignment down the line.
Step 3: Biggest bang for the buck — From the developer survey identify what is the 80:20 rule in development, i.e. what are they building most that can be made easy through the Platform Engineering? For example, the developer survey may reveal that deploying a microservice in GoLang, building/scheduling DataFlow Jobs, creating a CloudSQL Postgres DB/Table, creating a BQ Dataset and executing the query against it, etc. are needed by the developers for their 80% workflows. Making these tasks easy might bring in the most value with relatively lesser investment. This should define what kind of "Workflows" and "Golden Paths" need to be built that IDP will expose to developers.
Step 4: Internal Developer Platform, Workflows and Golden Path
While the capabilities of an IDP can be as wide as possible but be practical in defining the scope of IDP. IDP should focus on "Application View" since it is developer focused. While Cloud Consoles already give a good view of Infrastructure and assets deployed in Cloud but developers generally do not focus on it.
Here are some of the aspects what you should look for in an IDP
- Self-service provisioning: IDPs should empower developers with self-service capabilities, allowing them to request and provision cloud resources like Google Cloud Projects (with all allowed APIs enabled), PubSub, deploy apps in Cloud Run, build and schedule a Dataflow job, a BigQuery Dataset, etc. without relying on external teams. This streamlines the process, eliminates waiting periods, and fosters developer autonomy. Self-Service does not mean that you should skip necessary approvals, rather those approvals should be part of the workflows that this self-service triggers.
- Standardized workflows: IDPs can define and enforce standardized workflows for development, testing, and deployment. These workflows can incorporate best practices, security standards, automate repetitive tasks, approvals and ensure consistency across. This reduces errors, improves code quality, and simplifies collaboration. Do not try to integrate every organizational process into IDP and rethink if it is really necessary in the Cloud? Many times on-prem processes are carried over to Cloud. Rethink and reevaluate; as these could become bottlenecks in making progress and in cloud adoption generally. Focus only on workflows which are pushing developers away from developing and deploying in the cloud and automate those.
- Approval workflows: While promoting self-service, IDPs should also implement approval workflows for critical tasks or resource requests exceeding predefined allowed limits. This ensures proper governance and maintains control while empowering developers within reasonable boundaries. For example; getting a new Google Cloud project should be a standardized workflow with approvals.
Workflow templates are also sometimes referred to as Goldan Paths.
- Integration with version control and CI/CD pipelines: IDP should integrate with continuous integration and continuous delivery (CI/CD) platforms that you are using. Most modern platforms provide capabilities to integrate with external IDP. This enables developers to trigger automated builds, tests, and deployments directly from the IDP and monitor the progress and history, accelerating the software delivery process. Integration with version control systems is critical to allow easy rollback to previous reliable versions of code or infrastructure configurations if needed. This provides a safety net and fosters experimentation without the fear of irreversible mistakes.
- Automated notifications and tracking: IDP should provide notifications to keep developers informed about the status of their requests, approvals, and deployments.
- Portal: A developer portal is essential and frontend the Developer Platform. Developer portal stitches everything which developers interact with. For example; Google Cloud Console is this portal for Google Cloud Platform which developers and operators interact with. Portal plays a critical role in the success of IDP.
- Other Capabilities: IDP should also provide a view into the application ownership, total spend, etc. Using Labels in Google Cloud environment and the Billing data can enable this in your IDP. IDP should also support documentation that should run with the application.
Overall, IDPs act as central hubs for managing development workflows, offering a structured and efficient way for developers to build, test, and deploy applications. They empower developers with self-service capabilities while maintaining governance and fostering collaboration, ultimately leading to faster development cycles and increased productivity.
Make it Easy
Adoption is inversely proportional to Friction. More friction the Platform Engineering can remove the better the adoption will be and vice-versa. Platform Engineering owners in an organization should be the biggest developer advocates with the primary goal of removing friction.
Conclusion
Developers and Engineering in your organization are delivering business facing values (internal or external). Cloud native technology landscape is getting complex and with this developers need to not only do regular development activities but continue to learn things that are not directly contributing to delivering business value. This directly contributes to cognitive overload on developers and can sometimes reduce productivity and outcomes. Platform Engineering is a relatively newer practice and there is a lot of potential where it can directly and indirectly help developer productivity and business outcomes. Platform Engineering's sole focus is developer enablement and hiding the complex processes and practices. There are several components of Platform Engineering and there are ways to start it small.
Further Reading on using KRM as the building block for Platforms, operating model and using Goldan Paths.
- Light the way ahead: Platform Engineering, Golden Paths, and the power of self-service
- The Modernization Imperative: Shifting left is for suckers. Shift down instead
- Build a platform with KRM: Part 1 — What's in a platform?
- Build a platform with KRM: Part 2 — How the Kubernetes resource model works
- A framework to build Cloud Operating Model and Governance — Part I
- A framework to build Cloud Operating Model and Governance — Part II
Happy Platform Engineering