LOTE #13: Dave Sudia on Kubernetes Local Dev, Building a PaaS, and Platform Personas
In the thirteenth episode of the Ambassador Livin’ on the Edge podcast, Dave Sudia, Senior DevOps Engineer at GoSpotCheck, discusses creating an effective local developer experience for Kubernetes, migrating away from Heroku and building a Kubernetes-based platform as a service (PaaS), and how his team developed an understanding of all of the personas involved with creating a platform.
Be sure to check out the additional episodes of the " Livin' on the Edge " podcast.
Key takeaways from the podcast included:
- GoSpotCheck initially ran their field management software applications on Heroku’s platform as a service (PaaS). This platform served the organisation well during the early stages of their business. As the organisation grew and volume of users increased the GoSpotCheck team migrated the underlying platform to Kubernetes for both scalability and cost reasons.
- The Heroku developer experience was very effective for engineers wanting to release code fast. The operational overhead of the PaaS was also minimal.
- Kubernetes is a good foundation on which to build a platform, but the GoSpotCheck team had to assemble various open source components in order to replicate the required PaaS-like developer experience and continuous delivery.
- Treating the platform as a product is essential for success. It must be designed appropriately, with all users identified and their requirements understood. It should also be staffed accordingly, with product managers, user experience, and engineering teams.
- Defining personas of all of the platform stakeholders has helped ensure that the platform has been designed and built in order to be as useful, and usable, as possible.
- Platform teams must realise they are not building a platform for themselves: their “customers” are other application developers and engineers within the organisation.
- After running a survey within the development teams the number one issue raised related to the local development experience: "the way you could make my life more easier is to give me 64 cores on my laptop." The support teams were also unsure of how they could support the results of continuously delivered applications.
- The creation of the Kubernetes-based platform was divided into phases. The first phase explored was the local development experience. When components were identified, the team evaluated whether these could also be used in the next phase, which was building a continuous delivery pipeline.
- The development team is using Cloud Native Buildpacks for building their applications. Local development consists of initially building and testing services in isolation (using mock, stubs, and contracts), and then integration testing within a remote cluster using Skaffold for local-to-remote build and deploy. The Ambassador Edge Stack Service Preview functionality is also of interest.
- Observability is vital for business and operational reasons. The GoSpotCheck team like Charity Majors and the Honeycomb team’s model for collecting everything and “slice and dice it every way you can” to find business-impacting problem.
- The GoSpotCheck CloudOps (platform) team made an early commitment to adopting open standards, particularly CNCF-backed standards such as Prometheus metrics (now OpenMetrics) and OpenTracing (now OpenTelemetry).
- Initially the CloudOps team ran internal services that supported these open standards, but migrated to commercial services as these became available and cost effective.
- When evaluating components in the cloud native ecosystem, "if you can wait six months then do". The landscape evolves rapidly.
- The engineering teams are investing more in progressive delivery, and are exploring the use of feature flags and canary releases. The ability to rapidly experiment in a safe manner can be a competitive advantage.
Subscribe: Apple Podcasts | Spotify | Google Podcasts | Overcast.fm | SoundCloud
This week's guest
David Sudia is a former educator turned developer turned DevOps Engineer. He's passionate about supporting other developers in doing their best work by making sure they have the right tools and environments. In his day to day he's responsible for managing Kubernetes clusters, deploying databases, writing utility apps, and generally being a Swiss-Army knife. David has co-organized a Cloud Native Meetup and does DevOps workshops in the Denver area.
Full transcript
Daniel (00:03):
Hello, everyone. I'm Daniel Bryant. I'd like to welcome you to the Ambassador Livin' on the Edge Podcast, the show that focuses on all things related to cloud-native platforms, creating effective developer workflows, and building modern APIs. Today I'm joined by Dave Sudia, senior DevOps engineer at GoSpotCheck.
Daniel (00:18):
I've been following Dave's work for the past couple of years. I was lucky enough to get to see in person his presentation with colleague, Toni Rib, at KubeCon San Diego last year, where they focused on the GoSpotCheck move from Heroku onto Kubernetes. After the presentation, Dave dropped by the Datawire booth. My colleague Rafi and I had a great chat with him. He clearly understood both the theory and the practicality of building a platform for an organization that is constantly evolving. Rafi and I learned a bunch.
Daniel (00:42):
Today we're talking about how Dave and his team at GoSpotCheck have focused on enabling engineers to develop apps effectively in a local development environment, assembling a PaaS-like Kubernetes platform with open-source components, particularly CNCF components, and enabling self-service around the bigger picture of API contract testing, infrastructure provisioning and deployment. Hello, Dave, and welcome to the Livin' on the Edge Podcast. Thanks for joining us today.
Dave (01:03):
Thanks for having me.
Daniel (01:04):
Could you briefly introduce yourself for the listeners and share a recent career highlight as well, please?
Dave (01:09):
Yeah. I'm Dave Sudia. I'm a senior DevOps engineer at GoSpotCheck. We're a startup based in Denver, Colorado. I think the career highlight was being on Deserted Island DevOps conference that happened in Animal Crossing a couple months ago.
Daniel (01:24):
Awesome.
Dave (01:25):
I don't know that I'm ever going to top that. That was super fun. It was put together by Austin Parker, who's in dev relations at LightStep. We went to his island in Animal Crossing and he had this crazy open broadcast studio set up, where we were talking in Zoom and he overlaid our audio onto the capture from his Switch, so you could watch us presenting in Animal Crossing. If you haven't seen it, I'll send you a link. It was just the most fun, wacky, thing. It was great.
Daniel (01:53):
Fantastic. You and I chatted quite a bit in San Diego last year at KubeCon, where you presented with your colleague, Toni Rib, Balancing Power and Pain: Moving a Startup From a PaaS to Kubernetes. I learned a bunch from that talk. Fantastic talk. We then caught up on the InfoQ Podcast and went a bit deeper into those topics. I'm guessing today really is part three of our discussion, but I was thinking for today, it'd be good to set a context for listeners of the GoSpotCheck platform, but then dive a bit more into the evolution of the platform, because I think you famously said you were assembling components versus building a platform, they're very different things, which I agree with. I think you've got some really great insight there.
Daniel (02:28):
Just to set the context, was running the GoSpotCheck app on a PaaS a good experience in the early years?
Dave (02:36):
Yeah. To recap, we started on Heroku. We ran there for many years. We were one of their largest customers by the time we started moving off. Their product is great. There are many things that we still miss about their product that we're now trying to rebuild internally. That is where I think most people should start, whether it's App Engine or Beanstalk or Heroku or any of those other more PaaS-type offerings. I think the thing I said in the last one is when you start there, you don't need my team. That's where your money is going, you're saving that economy of having to have a bunch of people who really understand how operations work. But at a certain point, you're going to get to a point where you need someone who really understands how operations work. You need features that aren't being offered by those platforms. Then you got to move.
Dave (03:29):
We shifted to Google Kubernetes engine as the primary place our apps landed, but what you lose is a lot of the sugar of a platform as a service. You lose the ease and the convenience. Last November, we had a hackathon. One of our lead developers, who runs a team, came up and said, "I don't really know exactly what I'm going to be doing for this hackathon, but it's going to be making this easier. This is all too hard." I heard that loud and clear. Making Kubernetes Easier was Bryan Liles' keynote at KubeCon last year. It’s not like it's going to get fixed in a hackathon.
Dave (04:13):
That's our push this year. We glued a bunch of tools together. How do we turn them into a more cohesive experience? It's never going to provide the convenience of git push Heroku master for a number of reasons, but how do we provide something that is just better, a little more seamless, a little easier to use?
Daniel (04:32):
Nice. Just take a step back. Initially the reason you looked at moving off Heroku was purely scale, I think, wasn't it?
Dave (04:39):
Yeah. It was scale and because of scale, feature. One of the big drivers was we used all their postgres instances and we just got to a point where the way that our queries were running, the performance characteristics of their databases just could not keep up with what we needed. We needed to shift our databases off. Then it was a cost perspective. When you're paying for my team, you shouldn't also be paying for Heroku. Then you're spending way too much.
Daniel (05:10):
Makes total sense. You mentioned assembling the parts there, which I thought, the KubeCon talk, I'll link that in the show notes, because it was just fantastic talk by you and Toni there. How did you go about choosing the various components that you mashed together on the V1 of what is becoming the platform?
Dave (05:27):
Sure. A lot of it was based on what existed. A great example is we ended up with Harness as our continuous deployment platform as opposed to Spinnaker, which is the open-source one that is part of the continuous deployment foundation now. But a large part of that was we were already on Circle. At the time, Spinnaker only supported Travis. Spinnaker came out of Netflix and so it really supported the AWS stack. We were on Google. We also made an early commitment to go with open-source standards, so Prometheus Metrics, initially Jaeger OpenTracing, now OpenTelemetry. But just those things and so we ran a lot of those things internally initially until more commercial support for them came out. Now we're slowly shifting to vendors that support those standards.
Dave (06:20):
But the nice thing about it is we haven't had to change any engineering effort to suddenly shift to this kind of metric, because we're using Prometheus Metrics and now there are people who will accept Prometheus Metrics and OpenTelemetry traces and that kind of thing. That was the large driver of it. I don't know. The thing I said, I've been saying recently, is infrastructure is an MVP right now. When you talk about we very much hacked together a bunch of stuff and that's what you do when you make the first round of your product.
Daniel (06:57):
Totally.
Dave (06:59):
That's the biggest thing for me. One of the best talks I went to at KubeCon last year was by Pinterest. I'll find it and send you the link so you can post it here. But a team at Pinterest basically built a wrapper around Spinnaker to make deploying easier. They had a product manager, they had a UX designer, they had back-end people, they had front-end people. It was a product team that built this internal continuous deployment platform. That's the difficulty we're facing this year. My company had 120 people until all of this happened around COVID and stuff. We had a 20% layoff. I don't have a product manager for this effort. I don't have a platform team for this effort. My team is two people and 0.1 time from a manager that used to be an individual contributor. We're trying to centralize our processes and build more standardized, opinionated ways of doing things in a decentralized way. It's an experiment. We're going to see how it goes.
Daniel (08:05):
Interesting. What would you say is the most important thing you focused on since we last spoke? Because you mentioned you were moving towards a proper platform now with the constraints you've mentioned, but what was the most important thing you tackled?
Dave (08:18):
The first approach we've had is to break it into phases of development. We started with local development. We figured anything we would go from there would build on tools that we picked from the previous phase. We didn't want to start with the end in mind. We wanted to start with, "What would we use to do it locally? Okay, cool. Are those reusable for the next piece? If we pick a good security scanning tool or something for local development, can that be used in CI and then in CD?" That kind of thinking. Or a really great example is we've landed on Helm. I'll talk about the stack in a sec. We've landed on Helm as the way that we're going to package things, our deployments, and standardize our deployments, so then that informs, "Great. Now we can use Helm in the continuous deployment pipeline. We can swap it in for everything that we've had previously."
Dave (09:13):
We focused on local development first. We've come up with a stack that we think works pretty well for local development. The way we're doing local development is we're not doing it locally. We're going to be doing it in the cluster.
Daniel (09:27):
Interesting.
Dave (09:28):
When we started this, I did have a product manager resource for about two months. What he helped me do was personas. We interviewed a cross section of QA, back-end engineers, front-end engineers, mobile engineers, support, because one of the struggles we've had in the last year and a half as we've moved more into continuous delivery is I had our tier II support manager go, "How do we support continuous delivery? If you're going to be doing experiments that could break prod all the time, what does that do to us?"
Daniel (10:01):
It's a great question.
Dave (10:06):
I'd love an answer from anyone who has one. I put it out on Twitter awhile ago and got a pretty good answer, but we're still figuring it out for ourselves. We've wanted to involve support more in being able to get to the same resources that engineers have. That's not quite all put together, because we're not quite at that phase yet. In doing the persona interviews, one of the number one things I heard back was, "The way you could make my life more easier is to give me 64 cores on my laptop." Because what we had these teams doing was they wrote all their Dockerfiles and then they were used Docker Compose locally to spin up a development environment. That works for the first couple services, but once you get to seven, plus their attendant databases and Kafkas and all the glue between the services, you just hit Docker Compose up and the fans come on on your laptop.
Dave (11:03):
In doing this, we've tried to reimagine what the development process looks like in a cloud native way. Let's not just take all our preconceptions around how you do local development, then I push my image and then it goes through the staging and prod deployment pipeline. I challenge people to question what if we just pushed to prod? Maybe we shouldn't, but why not? Let's question that, because there are people, there are companies right now that do do that. No one was quite ready for that sea change. But what we did get to was, "Let's not have local development. Let's give you a namespace, the service preview product from Datawire in Ambassador is looking really promising in the space as well." That's still something that we're considering down the line, but our stack right now, in trying to get things easier as well as a little more seamless, is we're using Cloud Native Buildpacks in place of Docker.
Daniel (12:05):
Nice.
Dave (12:06):
We're using Heroku. Irony of ironies. We're using Heroku's Cloud Native Buildpacks.
Daniel (12:11):
Nice. Which makes sense right?
Dave (12:13):
Yeah. Totally. Most of our Rails apps transitioned really easily. Then, B, and this is the key one, is we do a lot of Go and we have a lot of private modules. They are the only public Buildpack I can find right now that has an immediately obvious way to pass in a credential to pull private Git repos as a Go module. That was actually a huge driver. A couple of my engineers complained, going, "These images, they're all Ubuntu. They're going to be huge." I was like, "Yeah, but they're all Ubuntu. You only have to pull that part once." Once you got the Docker layer cached, if we're all using the same base images, because we're all using Heroku. There's a reason Heroku does this. Heroku's not wasting bandwidth pulling Ubuntu over and over again. Then we get built-in security updates, because Heroku's managing the Buildpack’s stack.
Daniel (13:08):
That's nice, isn't it?
Dave (13:09):
Yeah, exactly. I trust them, mainly because they've got Ian Coldwater, who I know now pretty decently, in charge of their security, Kubernetes security. I'll pull Heroku's Buildpacks without really thinking twice about it. We're using Buildpacks in place of Docker. We're using Helm. The big paradigm shift we had with Helm, when we started this whole journey, we looked at Helm and we're like, "Okay, well we're going to have to write this Helm chart for every single app." When we went with Harness, Harness had a built-in Kubernetes concept. They had a V1 thing that was a little easier, you checked some boxes, that we immediately had to move past into their advanced mode, because we already had requirements beyond the box-checking.
Dave (13:56):
They came out with a V2 version of Kubernetes, which was basically just write your own YAMLs. We did that. Then we ended up writing our own YAMLs for every single app. One of the things that was too hard and took too long about all this is developers, any time they want to spin up a new app, they had to make a Harness service. They had to go copy the five files we use for every Go service out of the last one, paste it into the new one, change the values. Coming back to the "infrastructure right now is an MVP" concept, that's fine. What we got out of that was we were able to look at the 17 Go apps we'd made and go, "These are all pretty much the same. It seems like we've landed on a consistent way of doing things." Or we found, "If they're not exactly the same, they share 95% of their DNA and we can get the best practices of all of these, so now we're writing a single gse Go Helm chart, that this is the way we do Go apps." That single Helm chart can be used for any new Go app.
Dave (15:04):
That's the way that we're simplifying the process. Instead of having to write a Docker file every time now, you run Pack. Instead of having to go and make all your deployment files, you use our standardized Helm chart. The thing I'm doing after this is I'm going to do review on our first draft of the Rails Helm chart. That one's much more complex, because we have cron jobs and workers and all the stuff that doesn't come from a pretty simple Go web API. Then you have a Helm chart. The way we're facilitating local development right now is via Skaffold. Skaffold just ties Pack and Helm together. Skaffold, if you haven't heard of it, is this really cool tool. You give it a little configuration file basically saying, "Here's how I want you to build my image and here's how I want you to deploy my image," and that can be via a Docker file and kubectl with YAML files or it can be, in our case, with Pack and Helm. Then you run Skaffold Dev and it watches your code. Every time you save a file, it hot-loads your code into the cluster.
Daniel (16:11):
You code in locally, but the changes are happening in the cluster?
Dave (16:14):
Yes.
Daniel (16:18):
Nice.
Dave (16:18):
It's not quite as fast as literal local development, because it has to rebuild the image and redeploy it every time, but rebuilding the image, it's Docker, it caches really well. We're using Buildpacks, but under the hood, it's just Docker. You're building a seven megabyte layer every time and then that pushes up pretty instantly. Then it redeploys very quickly, because it's just pulling that one layer that's not cached. It's been working pretty smooth. I have a public code repo example that I'll send you that you can link in the show notes, that is just on my GitLab and it's https://gitlab.com/thedevelopnik/skaffold-example. You can go see it in action. But it's looking like it's going to work pretty smoothly for us. That's been the biggest piece.
Dave (17:10):
We're still gluing tools together, but we're gluing them together in a more intelligent way. The tools that are available are just infinitesimally better and more powerful. Neither Pack nor Skaffold existed when we started. That's a big piece of it.
Daniel (17:30):
One thing I've learned from you actually, Dave, is that several times you said to me, "If you can wait six months for a tool in the cloud space, do." Because six months' time, all new, right?
Dave (17:38):
Yeah. That was my talk at the Animal Crossing conference.
Daniel (17:41):
Oh, really?
Dave (17:41):
Yeah. I just took that and expanded it to 30 minutes. Within that, my example was service mesh. Three or four years ago, I was going like, "Envoy is amazing, but you got to write your own implementations of the API servers." Now, "Linkerd install. I don't like Linkerd, I want to play with Istio. Great, use superglue." You just superglue, delete, Linkerd and install Istio. It's insane how much better the tooling and more powerful the tooling has become.
Dave (18:22):
The availability of the tooling is a big part of it, but also now, we cannot repeat ourselves. We have enough examples of how we want to do things that we can find those abstraction layers. One of the things I've been working on the last couple of days is this repo called Local Tool Installation and it's just a script that installs kubectl and Pack and Helm and Skaffold and NVM and RVM. This is not a mind-blowing thing. There are many, many places where that is the first thing someone wrote. We had stuff similar to that for Heroku, but when you completely change your entire stack over the course of two years, you just got to rewrite those things and you can't write them until you know what you want to do. That's what we're doing this year. We don't have a platform team. We're still pretty small. We're not writing a whole bunch of internal tooling to create a platform. It's still putting together open-sourced projects, but I think in a much more intelligent and seamless way.
Daniel (19:21):
I like it. That's something I really picked up from you there. It's something I've definitely talked about in the Datawire team and I've talked about it with other folks as well, this notion of understanding the personas. I think that's really important, because you've highlighted there, developers' requirements are different than Ops, than support and so forth. But you really have to treat what you're building as a product, yeah?
Dave (19:39):
Yes. 100%. We've had a platform team internally before and they built some amazing stuff that nobody used.
Daniel (19:48):
Interesting.
Dave (19:48):
Because it was built for them. The people who end up on platform teams are the people who are super into Vim and thoroughly understand their own RC files. I'm one of those people, so please don't take it as I'm making fun of that. But I use Vim bindings and VS Code on my Linux Box now, running a custom distro of Ubuntu. But I'm that person. I can't write tooling for me, because I'm not the person who just wants to ship. That was a big piece of that persona investigation we did, was we don't have platform team. A lot of those people still work for us, but they're distributed. They got pushed out to be often senior engineers on product teams. We even have senior engineers who lead teams who just want to ship it. The guy who came up in the hackathon and said, "I just want this to be easier" is one of those people. Brilliant engineer. Doesn't really want to get deep into the weeds of tooling. Totally fine.
Dave (20:55):
That leads to how we're trying to build this in a distributed way, which is first we had to get the cultural agreement of yes we want an opinionated, centralized way of doing things, because for a long time teams really just wanted to go do things their own way. Then they came back and went, "Why is everything so hard and slow?" We're like, "Because everyone wants to do it their own way." Then you can't expect my team to just immediately know how to do it. But if you have those people, you have the people who want to get deep in the weeds of config, instead of doing it their own or on their own repos, what we got the cultural agreement of is we're going to have a centralized way of doing things, "This is the way." Then if you are a config wonk, you go contribute to those repos. If you're someone who is not, you just use them. Both things are equally valid, but then we're leveraging the developers who are deploying this stuff to help and to build the tooling.
Dave (22:00):
Because that was another thing, is, I think, a common disconnect if you're not doing things in a product-oriented way or if you're doing things the way originally we were, where it's like my team, we've changed our name from DevOps to CloudOps to make it clear we don't do everything but feature development. We're building on this tooling and we're doing it with a lot of communication with developers, but you end up with tooling that doesn't reflect reality of how people work, because the people who write it aren't the people doing the work. That's the whole thing of DevOps.
Daniel (22:32):
Yes. Well said.
Dave (22:34):
The people building the tooling are the people who use it, because they're the people who know what it ought to be. That's been the biggest piece, leveraging the decentralized resources we have in this organization to build centralized tools. I'd say that all sounds great and idealistic and we got six people who are all in for the first round of this. They've all fulfilled their initial commitment. Now we're looking for the next five or six people. That is a much more uphill struggle.
Daniel (23:03):
Interesting.
Dave (23:03):
But we had a buy-in from the executive vice president in charge of engineering, so it's going to happen. But we're definitely still having struggles around enthusiasm.
Daniel (23:14):
No one likes change. Change is hard.
Dave (23:17):
Yeah. It's also there's no actual time allotted for this work, so people are doing it in their spare time. I'd say that given that in six months we have put together the body of work we need to do and we've gotten a phase done with some pretty solid tooling, I'm pretty happy. That, in the middle of an economic downturn and a pandemic-
Daniel (23:39):
Hard times.
Dave (23:40):
-and the craziest overall political and socioeconomic and cultural times that have occurred in my lifetime. There are a lot of distractions going on right now. I'm pretty happy with what we've built.
Daniel (23:54):
Sounds fantastic, Dave. I'm really impressed in ways that you focused on the local developer experience, because I chatted to Gene Kim a couple weeks ago, super privileged to chat to Gene. One thing that he said is developer activity is so, so important, but anecdotally my experience of working with companies is the developer experience is often an afterthought. We build all these fantastic systems. We're doing all the best practice stuff. We're using cloud and so forth. But where the work actually gets done, like where the rubber meets the road, is often, "Yeah, we'll think about that later." But you've clearly invested early on in making sure people can write the code easy, test it easily and get it out to the customers.
Dave (24:33):
Yeah. That's the goal with this round two, because I think round one was very much that. We got all the pipelines set up and we got all the infrastructure set up, exactly what you're saying. We've had to come back and then someone stood up and said, "This is too much. This is too hard." I think the key there is we listened.
Daniel (24:51):
Agreed. Completely.
Dave (24:53):
It's now that is the priority. How do we make this easy? The next two big chunks we're tackling are security, specifically the developer side of security or what we've determined to be the developer side, which is "Are your packages up to date?" We sort of handled Buildpacks, but we're still going to be scanning and doing active work there. But figuring out where that divide is, we are feeling like containers lands on my team. App dependencies lands on the Dev team.
Daniel (25:23):
Makes sense, because they are closer. When I do Java stuff, when I brought in the dependency, that was on me to make sure that that was an up-to-date, valid dependency.
Dave (25:34):
That doesn't have like five critical CVEs in it. We have tooling for that. It's not super easy to use or convenient. The next round is how do we make this super easy to use and convenient? Or as easy to and convenient as possible, since there's always that trade-off. The other big chunk we're tackling is observability. We have six places to go right now for metrics and traces and logs and everything. But this comes back to developer productivity and developer experience. That's literally the name for this effort is Developer Experience. That's the name of the board.
Dave (26:08):
The goal is, you run a script. You get a project generated that the base code is there, but not only the base code, but the Git hooks for running security and linting and test checks and stuff before you commit. Your dashboards for prod are created from the day you make your project. We use Sumo Logic right now for all our observability stuff, so when you run that thing, it hits an API and makes the dashboards you're going to need for your service, at least some generic ones around rater duration or something. That's my end goal. Then you start developing against the cluster. Then you use the same Helm chart for stage and prod, et cetera. But it's to turn it into a true smooth pipeline.
Daniel (27:02):
I like that a lot, Dave. Something I thought about and briefly talking off mic about this is how do you handle the interactions between microservices? Because obviously we aim for highly cohesive, loosely coupled systems, but things have clearly got to talk to each other for it to be a system. When I'm doing that local coding, are you running all of the other services in the cluster or are you stubbing out some APIs, that kind of thing?
Dave (27:29):
This was a collaborative effort. I think a key thing to say here is, everything I'm relating has been the result of multiple people talking about it. Again, getting away from "My team just decides things." We sent out a survey. In fact, I didn't send out a survey. The lead engineer from one of our app teams sent out a survey, because she owned that story around discovering this stuff. She owned a story around what are we currently doing and using and where we feel like the boundaries of our testing are. The goal is maybe real local development development, like you're writing unit tests. Then at a certain level, you're writing contract tests before you start interacting with other services. If you're using gRPC, you're writing against those contracts. Off mic you mentioned Pact. I really love Pact. We had some consultants who came in and used Pact on one service and it didn't really take off, but I keep bringing it up every couple months, because I really deeply believe in contract-based API development.
Dave (28:31):
You write to those contracts. You do as much testing as you can without actually making any calls out to anything else. The way we're envisioning this right now is it depends on the team how this would work, because we have teams that are truly microservice-y small enough, where you really only have one person working on a given service at a time. In that case, it makes total sense to just be swapping the thing that's in Dev with your current version. We also have teams, like our professional services team, where we have six people all working on one thing at the same time. In that case, they all have namespaces. I'm not going to say this is easy. Right now we're literally still in the middle of trying to figure out exactly how much of our entire distributed monolith do we deploy at every namespace.
Daniel (29:20):
It's something I struggle with, too.
Dave (29:24):
It's a tricky problem. You end up cutting the line between every developer needs to keep their entire namespace up to date for every app at all times or a weird hybrid "These apps are common between all the ones that would get deployed to namespaces, but those ones are getting..." I'm not going to say that's easy or that we've solved it. I'll come back in a couple months. For the truly microservice-y oriented teams, you can pretty much just swap what you're working on in Dev. At that point, hopefully you would have pretty high confidence, because you've written your unit test, you've written your contract-level integration test, where you're writing against mocks or stubs. Then you would go into a live environment. That was how people were operating, except the live environment was initially on their laptop. That just ceases to make sense at a certain point. We don't have that fully nailed down, but that's where we're at.
Dave (30:24):
The other big struggle with gRPC in particular is, how do you make that a smooth experience? Because, let me tell you, we have, for a technology that's supposed to make it so that everyone works off of a single set of canonical contracts, we already have a V1 and V2 repo. We have one team that went off on experimenting with V2 before we went to V2. They have their own that they have to migrate over. Then another piece of it, for me, is we use Uber's prototool container to do all the compilation. We have a repo where people commit the contracts. Then it's CI process compiles the gRPC code for all the various languages and commits that over to another repo.
Daniel (31:09):
Interesting.
Dave (31:10):
Now for Go, you just point to that repo and that's your module. But for Java, for Ruby, for Python, people are having to copy paste that code over. There's no really great way of getting that code. npm is a little easier, because npm you can also resolve a package to a Git repository. But even then, it's not great. We'd like it to be in an Artifactory or something similar, where people can pull it as a native npm package. I've been working on a tool for the last year. It's most internal, although I got permission to open-source it last year, so I'm in the process of moving it over to being an open-source repository. That's on GitHub/gospotcheck/protofact. I've got Ruby done. My initial run with Java was actually doing it all with Scala, because we don't have any Java engineers, but we have some data platform-type people who all write in Scala, so he's like, "Yeah, just use sbt and here's this whole complex sbt thing."
Daniel (32:14):
sbt, simple build tool; anything but simple!
Dave (32:20):
I'm trying to convert that over to Maven, but I've worked on Java very little in my life and when I did it was with Gradle, so I'd love help with that. I've got a half-done PR for that. This is actually coming up as a real priority now, because we are started fully consolidating our gRPC stuff, so I have to work on that more in the next couple months. Hopefully in the next couple months that will be more of a fully fleshed thing. Similar to prototool, I want to have pre-built Docker containers that are out there, that are public, that people can just use. Then ideally we have equally-versioned packages of every contract out that people can just pull and be confident that their contracts are all the same version and can talk to each other correctly. The hardest part of that whole stack comes back to the build pipeline and DevOps-y tooling stuff. gRPC is great. It's ironically easier to use if you're a single person or a small team of like three people. At the scale that you actually need it for, it's really hard to use.
Daniel (33:24):
I guess the challenge I've had with a lot of these things is the releasing part of it, the synchronizing. You've got, say, V1 and V2. There's obviously good practices you can do in terms of backwards compatibility, but sometimes you just need to make a break in change. How do you do that in lockstep with all your services? I made a bunch of mistakes with shared entity models, for example, in my first microservice system, where suddenly when we changed the main model, we had to redeploy all the services. I was like, "I've highly coupled that accidentally."
Dave (33:54):
Feature flags? That's another thing that's on the list for this year. We were using LaunchDarkly for a while and never really used the full capability of that platform. I think it's a great platform for what it does. Maybe we even needed feature flags, but we weren't using them to the extent that we should have been or could've been. But I think we're rapidly getting to the point where that is now becoming an issue, especially because we have services talking to each other across responsible teams. We have three products, but there are points of interface between all of them. Then there's been a lot of talk recently from the product side about rapid experimentation. All I heard was, "Feature flags."
Daniel (34:39):
Canary launching helps out a lot. With Ambassador we see a lot of folks talking about canary launching, dark launching, powered runs, this kind of thing.
Dave (34:48):
Progressive deployment is the term I've heard now.
Daniel (34:52):
James Governor and progressive delivery, we talk about a lot. Nice.
Dave (34:55):
I got that off the podcast that I listened to with you. That's right. I walked away from that, "Read James Governor, progressive delivery."
Dave (35:07):
That's a hard problem. I don't have a great answer.
Daniel (35:10):
It's good to hear you say that, too, Dave. That's five years ago I had to do that. I'm still struggling with it now.
Dave (35:17):
Because at some point we do have a bunch of V1 GRPC contracts out there and we're going to have to move to the V2. We have talked about it in depth, about how we would do that. It does involve a lot of cross-team coordination. What we feel like it looks like right now is that basically every team would have to basically do an immutable double implementation of the next set of services. Then probably late on a Saturday night, as magical of a world of continuous delivery and no down time and everything as we are, there's still some thing that I only do late on Saturday night, even if I feel like they are going to seamlessly roll over, because maybe they won't. Probably late on a Saturday night, we flip a feature flag and everything starts routing through V2. That's how we've talked about doing that. I'll get back to you when we do it.
Daniel (36:10):
That's the topic for the next podcast. Super interesting, Dave. Wrapping up, I guess. This has been fantastic. You and I could talk for hours. I've learned so much stuff. With the time constraints we've got, what's next on the agenda for you? What's the most exciting thing you and the team are working on over the next, say, three months until we chat again?
Dave (36:27):
I'm excited about the observability work we're about to do, because we have logs and metrics and traces, but they're all separate from each other. They're hard to correlate and put together. It's not a very seamless experience. Along with it, there's going to be a cultural change of ownership of alerting and stuff right now. Going back to monolith on Heroku days, my team is the only team in pager duty. There's Slack alerts and stuff, but it's not a pure tooling challenge, it's a tooling challenge. Not even a cultural challenge because people resist it or don't want to do it. Most of the time, it's skill not will. It's not knowing how or why. That's the next big chunk. How do we make it easy to get your observability? But then there's a lot of education, because rater duration gets you so far, but what we really want is people to be thinking critically about what needs to be monitored in their application.
Dave (37:28):
We have people to lean on. That's not just going to be me. I'm not the guru of this inside the company. We have a lot of skilled people who know the answer to that, but passing that knowledge has not been a huge priority. It's going to be now. I want to know how deep this queue is. Or taking problems that occur and turning them into actionable metrics with alerts from there on out. That's the education that has to happen. Quite honestly, to not make it sound like I have my act together completely, we have that knowledge on my team, we have not really had time to act on it. Partially because we are responsible for all of the glue. We run Jaeger. We run Ambassador. We run the entire Sumo Logic metric and log export stack. We run Linkerd. Right now, I don't have alerts telling me if Linkerd is down. That's a process my team has to go through over the next couple months. That's a fun challenge, because I didn't write those services. I have to really think critically about what alerts I need off of them, because I don't have inherent knowledge, necessarily, of how all those things, I know how they work, but what would I want to know about this thing if things were going wrong.
Dave (38:51):
There's an additional layer of challenge there, which is if you know that things are going wrong from your Prometheus metrics, how do you know when Prometheus is down?
Daniel (39:00):
Yes. Who watches the watchers?
Dave (39:02):
Exactly. That's led to some fun stuff, like setting alerts in Stackdriver for when Prometheus is down. Cool, I know here that my pod is bad, the pod that tells me if pods are good or bad.
Daniel (39:17):
It's a hard problem, isn't it? I literally just published a podcast with Sam Newman. He made an interesting observation, something I've bumped into, but he crystallized it really nicely in that often when we start, we think about what's going wrong, but as you scale, you actually have to look at what's going right. When it's not right any longer, that's the cue for the action. He talked a lot about semantic monitoring. Can you actually do the business actions? If you can't do the business actions, then it's a cue to actually investigate where in the platform is stuff broken.
Dave (39:50):
If I was starting fresh with something today, I'd been sold by Charity Majors from Honeycomb on the philosophy of you literally collect everything and then you slice and dice it every way you can, so that you can find the 0.001%. The quote from her that I love is "It doesn't matter if you have four 9's if the last 9 is your biggest customer." Or the 0.0001 or whatever. The systems now are so complex that it's very difficult to predict what could go wrong. But it's dangerous to not have anything telling you.
Daniel (40:28):
I agree, yes.
Dave (40:31):
We'll be finding that balance this year. But that's what I'm most excited about coming next.
Daniel (40:36):
Thanks, Dave. I see the next topic for the podcast is going to be progressive delivery and observability, getting that feedback going there. I'm looking forward already to chatting to you about that and seeing now what your team have-
Dave (40:45):
Real quickly, the other thing about that is that's when we bring support in, too, because that's what support cares about the most. There'll be a nice cultural piece of extending our consideration beyond the engineering team.
Daniel (40:57):
The people are the hardest part. The tech is sometimes easy, not always, but the people are the hardest part.
Dave (41:03):
Support wants to be involved, we just haven't been thinking about them as a persona when we build things. We are this year.
Daniel (41:10):
Fascinating topic. I look forward to hearing your thoughts on that one. Thanks for your time today, Dave. Really appreciate it.
Dave (41:15):
Thank you for having me. Always love chatting with you.