Reporting Errors from Control Plane to Applications Using Kubernetes Events

Thursday, January 25, 2018

Reporting Errors from Control Plane to Applications Using Kubernetes Events

At Box, we manage several large scale Kubernetes clusters that serve as an internal platform as a service (PaaS) for hundreds of deployed microservices. The majority of those microservices are applications that power box.com for over 80,000 customers. The PaaS team also deploys several services affiliated with the platform infrastructure as the control plane.

One use case of Box’s control plane is public key infrastructure (PKI) processing. In our infrastructure, applications needing a new SSL certificate also need to trigger some processing in the control plane. The majority of our applications are not allowed to generate new SSL certificates due to security reasons. The control plane has a different security boundary and network access, and is therefore allowed to generate certificates.

| | | Figure1: Block Diagram of the PKI flow |

If an application needs a new certificate, the application owner explicitly adds a Custom Resource Definition (CRD) to the application’s Kubernetes config [1]. This CRD specifies parameters for the SSL certificate: name, common name, and others. A microservice in the control plane watches CRDs and triggers some processing for SSL certificate generation [2]. Once the certificate is ready, the same control plane service sends it to the API server in a Kubernetes Secret [3]. After that, the application containers access their certificates using Kubernetes Secret VolumeMounts [4]. You can see a working demo of this system in our example application on GitHub.

The rest of this post covers the error scenarios in this “triggered” processing in the control plane. In particular, we are especially concerned with user input errors. Because the SSL certificate parameters come from the application’s config file in a CRD format, what should happen if there is an error in that CRD specification? Even a typo results in a failure of the SSL certificate creation. The error information is available in the control plane even though the root cause is most probably inside the application’s config file. The application owner does not have access to the control plane’s state or logs.

Providing the right diagnosis to the application owner so she can fix the mistake becomes a serious productivity problem at scale. Box’s rapid migration to microservices results in several new deployments every week. Numerous first time users, who do not know every detail of the infrastructure, need to succeed in deploying their services and troubleshooting problems easily. As the owners of the infrastructure, we do not want to be the bottleneck while reading the errors from the control plane logs and passing them on to application owners. If something in an owner’s configuration causes an error somewhere else, owners need a fully empowering diagnosis. This error data must flow automatically without any human involvement.

After considerable thought and experimentation, we found that Kubernetes Events work great to automatically communicate these kind of errors. If the error information is placed in a pod’s event stream, it shows up in kubectl describe output. Even beginner users can execute kubectl describe pod and obtain an error diagnosis.

We experimented with a status web page for the control plane service as an alternative to Kubernetes Events. We determined that the status page could update every time after processing an SSL certificate, and that application owners could probe the status page and get the diagnosis from there. After experimenting with a status page initially, we have seen that this does not work as effectively as the Kubernetes Events solution. The status page becomes a new interface to learn for the application owner, a new web address to remember, and one more context switch to a distinct tool during troubleshooting efforts. On the other hand, Kubernetes Events show up cleanly at the kubectl describe output, which is easily recognized by the developers.

Here is a simplified example showing how we used Kubernetes Events for error reporting across distinct services. We have open sourced a sample golang application representative of the previously mentioned control plane service. It watches changes on CRDs and does input parameter checking. If an error is discovered, a Kubernetes Event is generated and the relevant pod’s event stream is updated.

The sample application executes this code to setup the Kubernetes Event generation:

// eventRecorder returns an EventRecorder type that can be  
// used to post Events to different object's lifecycles.  
func eventRecorder(  
   kubeClient \*kubernetes.Clientset) (record.EventRecorder, error) {  
   eventBroadcaster := record.NewBroadcaster()  
   eventBroadcaster.StartLogging(glog.Infof)  
   eventBroadcaster.StartRecordingToSink(  
      &typedcorev1.EventSinkImpl{  
         Interface: kubeClient.CoreV1().Events("")})  
   recorder := eventBroadcaster.NewRecorder(  
      scheme.Scheme,  
      v1.EventSource{Component: "controlplane"})  
   return recorder, nil  
}

After the one-time setup, the following code generates events affiliated with pods:

ref, err := reference.GetReference(scheme.Scheme, &pod)  
if err != nil {  
   glog.Fatalf("Could not get reference for pod %v: %v\n",  
      pod.Name, err)  
}  
recorder.Event(ref, v1.EventTypeWarning, "pki ServiceName error",  
   fmt.Sprintf("ServiceName: %s in pki: %s is not found in"+  
      " allowedNames: %s", pki.Spec.ServiceName, pki.Name,  
      allowedNames))

Further implementation details can be understood by running the sample application.

As mentioned previously, here is the relevant kubectl describe output for the application owner.

Events:  
  FirstSeen   LastSeen   Count   From         SubObjectPath   Type      Reason         Message  
  ---------   --------   -----   ----         -------------   --------   ------     
  ....  
  1d      1m      24   controlplane            Warning      pki ServiceName error   ServiceName: appp1 in pki: app1-pki is not found in allowedNames: [app1 app2]  
  ....

We have demonstrated a practical use case with Kubernetes Events. The automated feedback to programmers in the case of configuration errors has significantly improved our troubleshooting efforts. In the future, we plan to use Kubernetes Events in various other applications under similar use cases. The recently created sample-controller example also utilizes Kubernetes Events in a similar scenario. It is great to see there are more sample applications to guide the community. We are excited to continue exploring other use cases for Events and the rest of the Kubernetes API to make development easier for our engineers.

If you have a Kubernetes experience you’d like to share, submit your story. If you use Kubernetes in your organization and want to voice your experience more directly, consider joining the CNCF End User Community that Box and dozens of like-minded companies are part of.

Special thanks for Greg Lyons and Mohit Soni for their contributions.
Hakan Baba, Sr. Software Engineer, Box

Dynamic Ingress in Kubernetes Jun 7
4 Years of K8s Jun 6
Say Hello to Discuss Kubernetes May 30
Introducing kustomize; Template-free Configuration Customization for Kubernetes May 29
Kubernetes Containerd Integration Goes GA May 24
Getting to Know Kubevirt May 22
Gardener - The Kubernetes Botanist May 17
Docs are Migrating from Jekyll to Hugo May 5
Announcing Kubeflow 0.1 May 4
Current State of Policy in Kubernetes May 2
Developing on Kubernetes May 1
Zero-downtime Deployment in Kubernetes with Jenkins Apr 30
Kubernetes Community - Top of the Open Source Charts in 2017 Apr 25
Kubernetes Application Survey 2018 Results Apr 24
Local Persistent Volumes for Kubernetes Goes Beta Apr 13
Migrating the Kubernetes Blog Apr 11
Container Storage Interface (CSI) for Kubernetes Goes Beta Apr 10
Fixing the Subpath Volume Vulnerability in Kubernetes Apr 4
Kubernetes 1.10: Stabilizing Storage, Security, and Networking Mar 26
Principles of Container-based Application Design Mar 15
Expanding User Support with Office Hours Mar 14
How to Integrate RollingUpdate Strategy for TPR in Kubernetes Mar 13
Apache Spark 2.3 with Native Kubernetes Support Mar 6
Kubernetes: First Beta Version of Kubernetes 1.10 is Here Mar 2
Reporting Errors from Control Plane to Applications Using Kubernetes Events Jan 25
Core Workloads API GA Jan 15
Introducing client-go version 6 Jan 12
Extensible Admission is Beta Jan 11
Introducing Container Storage Interface (CSI) Alpha for Kubernetes Jan 10
Kubernetes v1.9 releases beta support for Windows Server Containers Jan 9
Five Days of Kubernetes 1.9 Jan 8

Creating a Raspberry Pi cluster running Kubernetes, the installation (Part 2) Dec 22
Managing Kubernetes Pods, Services and Replication Controllers with Puppet Dec 17
How Weave built a multi-deployment solution for Scope using Kubernetes Dec 12
Creating a Raspberry Pi cluster running Kubernetes, the shopping list (Part 1) Nov 25
Monitoring Kubernetes with Sysdig Nov 19
One million requests per second: Dependable and dynamic distributed systems at scale Nov 11
Kubernetes 1.1 Performance upgrades, improved tooling and a growing community Nov 9
Kubernetes as Foundation for Cloud Native PaaS Nov 3
Some things you didn’t know about kubectl Oct 28
Kubernetes Performance Measurements and Roadmap Sep 10
Using Kubernetes Namespaces to Manage Environments Aug 28
Weekly Kubernetes Community Hangout Notes - July 31 2015 Aug 4
The Growing Kubernetes Ecosystem Jul 24
Weekly Kubernetes Community Hangout Notes - July 17 2015 Jul 23
Strong, Simple SSL for Kubernetes Services Jul 14
Weekly Kubernetes Community Hangout Notes - July 10 2015 Jul 13
Announcing the First Kubernetes Enterprise Training Course Jul 8
Kubernetes 1.0 Launch Event at OSCON Jul 2
How did the Quake demo from DockerCon Work? Jul 2
The Distributed System ToolKit: Patterns for Composite Containers Jun 29
Slides: Cluster Management with Kubernetes, talk given at the University of Edinburgh Jun 26
Cluster Level Logging with Kubernetes Jun 11
Weekly Kubernetes Community Hangout Notes - May 22 2015 Jun 2
Kubernetes on OpenStack May 19
Weekly Kubernetes Community Hangout Notes - May 15 2015 May 18
Docker and Kubernetes and AppC May 18
Kubernetes Release: 0.17.0 May 15
Resource Usage Monitoring in Kubernetes May 12
Weekly Kubernetes Community Hangout Notes - May 1 2015 May 11
Kubernetes Release: 0.16.0 May 11
AppC Support for Kubernetes through RKT May 4
Weekly Kubernetes Community Hangout Notes - April 24 2015 Apr 30
Borg: The Predecessor to Kubernetes Apr 23
Kubernetes and the Mesosphere DCOS Apr 22
Weekly Kubernetes Community Hangout Notes - April 17 2015 Apr 17
Kubernetes Release: 0.15.0 Apr 16
Introducing Kubernetes API Version v1beta3 Apr 16
Weekly Kubernetes Community Hangout Notes - April 10 2015 Apr 11
Faster than a speeding Latte Apr 6
Weekly Kubernetes Community Hangout Notes - April 3 2015 Apr 4
Paricipate in a Kubernetes User Experience Study Mar 31
Weekly Kubernetes Community Hangout Notes - March 27 2015 Mar 28
Kubernetes Gathering Videos Mar 23
Welcome to the Kubernetes Blog! Mar 20

Kubernetes Blog

Thursday, January 25, 2018

Reporting Errors from Control Plane to Applications Using Kubernetes Events

« Prev

Next >>

2018

2017

2016

2015