Performance Testing for CoreDNS
By Humberto Leal, Emma Genesen
Published 31st October 2023
Introduction
In cloud native environments the infrastructure is dynamic. This is especially true for Kubernetes. Pods can come and go; services can be scaled up or down, and the underlying IP addresses can change frequently. For components to communicate reliably, you need a system that can quickly and accurately provide the correct IP addresses for given service names. KubeDNS plays the role of default DNS server in Kubernetes with CoreDNS often seen as an optional but important upgrade to increase flexibility of your DNS infrastructure. Both of these tools resolve service names to their current IP addresses and facilitate communication between different services.
At Northflank, we rely heavily on a custom version of CoreDNS to support all our internal and external routing requirements and security demands. As a piece of mission-critical infrastructure, it’s essential that our CoreDNS servers are always available, and able to serve requests within an acceptable time frame during both normal and peak hours.
These requirements have led our team to continuously monitor our DNS performance and conduct regular performance tests. Along the way, we’ve learned an array of best practices for evaluating DNS performance for Kubernetes CoreDNS deployments. Today, we’ll take a closer look at performance testing CoreDNS. Then, we’ll walk through how you can test your own CoreDNS setup and how to interpret the results.
What is CoreDNS?
CoreDNS is considered an authoritative DNS server for queries within a Kubernetes cluster. Queries to external domains (like a REST API or database connection string) are sent to external resolvers, and their responses are cached by CoreDNS. The authoritative DNS server behaviour is achieved through the dnsController
structure, which watches for changes in the Services and Endpoints Kubernetes resources. The data returned from those resources is stored in an in-memory structure, which is then utilised when an in-cluster address is queried.
CoreDNS has seen mass adoption, due in part to its performance and pluggability. CoreDNS is written in Go, and like most Go programs, it’s lightweight and efficient. CoreDNS is multi-threaded whereas KubeDNS is single-threaded which is essential when your DNS server is under load. Additionally, unlike many DNS servers, CoreDNS has a plugin architecture. This enables users of CoreDNS to extend its functionality to support their specific use cases.
At Northflank, our CoreDNS service is highly customised with both community and internal plugins, which helps us deliver an exceptional Kubernetes experience to our users. For example, we rely on the firewall plugin to ensure isolation of DNS records across namespaces. We’ve also customised plugins to enable seamless resolution of workload domains, both internally and externally, while avoiding conflicts across clusters in different regions.
But a DNS server is only as good as its ability to handle requests. That’s why we rigorously performance test our CoreDNS setup, to ensure stability and reliability.
Performance Testing for CoreDNS
At Northflank, we measure our CoreDNS service performance via Prometheus metrics monitoring: response time, throughput, latency, and bandwidth. Our performance testing helps us determine if there are enough resources allocated to handle both normal and peak loads. It also helps us determine the point at which we need to scale a service vertically and horizontally.
We also run performance tests to identify any potential regressions or performance drawbacks due to configuration changes, version upgrades, or external and internal plugins. Performance testing comes in handy any time we release a new version, as performance is a high priority for our team and our users.
Today, we’ll be assessing our DNS server performance through QPS (queries per second), response time, success and error rates, and CPU resource usage.
Some common tools for benchmarking DNS server performance include:
DNSPerf – An open-source benchmarking tool for testing authoritative DNS servers created by Nominum.
Resperf – An open-source benchmarking tool specifically built for testing recursive DNS servers and caching capabilities in a lab environment.
Dnsdiag – Contains three DNS testing tools: dnseval, dnsping, and dnstraceroute. It is also useful for testing DNS servers.
Queryperf – Mainly built for benchmarking performance for authoritative DNS servers.
We’ll focus on how to employ DNSPerf to assess the performance of our DNS servers. This is the solution that we use, and it has proven useful and simple-to-use for our performance testing objectives.
To reproduce a production scenario, we will use a combination of both internal domains (svc.cluster.local
) and external domains, including non-existent domains. The aim is to simulate production traffic to ensure the results are relevant.
The final results will include queries per second, average latency, max and min latency, and success/failure rates.
Testing Methodology
We’ll be testing within a controlled environment, using a mix of external, internal, and non-existent domains via the ‘A’ record DNS request type. The goal is to run different configurations of compute resources and evaluate the results reported by the DNSPerf tool. Then, we’ll look at how you can use these results to help improve the reliability of your CoreDNS servers during peak times and determine scalability for future growth.
Today’s tests will be executed on a Google Cloud GKE Kubernetes cluster. As such, it’s recommended that the DNSPerf
pod runs in a separate node to the CoreDNS pod, and both with the same resource specification.
We created a series of service manifests to simulate real world workloads running in the cluster. The domains for those services, along with some external and non-existent domains, will be used as part of the DNSPerf test file. We’re also going to rely on Prometheus to query the behaviour of CoreDNS CPU resource usage throughout the tests.
Experiment Set-up
1. Provisioning and configuring a GKE cluster
A GKE cluster will be used to conduct the performance test for the CoreDNS deployment. The default node pool in the cluster will use one node of the n2-standard-4
machine type (4vCPU 16GiB). Two extra node pools will be created: coredns-pool
and dnsperf-pool
each one with the appropriate coredns-test: coredns
and coredns-test: dnsperf
node labels to ensure pods are scheduled to the relevant worker nodes. Each new node pool consists of only one node - in this case using the n2-standard-8
machine type (8vcpu 32GiB ram). All of the nodes are deployed in the same zone, us-central1-a
, using a boot disk size of 100GB
and SSD persistent disk
disk type. The result from the GCP console is the following:
The next step is to import the Kubernetes cluster credentials on the local machine and start creating the relevant workloads.
2. Deploying CoreDNS
CoreDNS will be deployed to the cluster using helm, provided by this chart.
Assuming helm is installed locally, run the following command to add the CoreDNS helm repository.
helm repo add coredns https://coredns.github.io/helm
Then, install the CoreDNS release on the coredns-test
namespace.
helm --namespace=coredns-test install coredns coredns/coredns --create-namespace --f dns-performance-tests/coredns-values.yaml
The CoreDNS release will be provisioned in the coredns-test
namespace. The install will create a service to access the DNS server, which will need to be configured later in the DNSPerf tool.
The values.yaml
used is relevant to ensure compute resources match the performance test constraints, and that the CoreDNS pod is allocated on the correct node. The following configuration sets compute resources, image tag, and node selector:
nodeSelector:
coredns-test: coredns
resources:
limits:
cpu: 1
memory: 512Mi
requests:
cpu: 1
memory: 512Mi
image:
tag: '1.11.1'
For this performance test run, we used version 1.11.1
.
3. Generating DNSPerf input domains
To properly validate CoreDNS performance, a mix of external, internal, and non-existent domains are used to replicate the variety of responses that would be experienced when running a DNS service in production. The domains to be queried are of A
type. Expected response types include NOERROR
and NXDOMAIN
.
Internal domains were created using a script that relies on the local default KubeConfig file to create the specified number of services. These follow the structure of <svcName>.coredns-services.svc.cluster.local
. For this experiment, we created 500 services in the coredns-services
namespace. Another 500 random service domains were generated but not created on the cluster itself; these will be the ones that should return NXDOMAIN
from CoreDNS.
After that, the script itself takes care of generating the input file using 100 hardcoded external domains. The resulting file contains 100 external domains, 500 internal, and 500 internal non-existent domains to use with the DNSPerf tool.
It’s worth noting that this deployment is scheduled to a dedicated node through the nodeSelector
attribute with the node label defined in previous steps.
The script used to generate the domains is available here.
4. Deploying DNSPerf
The DNSPerf image used is guessi/dnsperf, and the applicable manifests are available here. The manifest will automatically create a deployment and config map for running DNSPerf, however the config map contents will need to be updated with the output generated by the script used to set up the domains. The IP address environment variable should be assigned the value from the coredns-coredns
service which can be obtained through:
kubectl get svc coredns-coredns -n coredns-test -o jsonpath='{.spec.clusterIP}{"\n"}'
Other configuration settings like the MAX_QPS should be set to a large value. In this case, 1000000
has been set to maximise the load placed on the DNS server. It’s expected that some queries will fail due to the CPU limits applied to the CoreDNS deployment.
The MAX_TEST_SECONDS is set to 600. The goal is to run the performance test multiple times with certain compute resources assigned and let it run for 10 minutes to gather relevant data.
Once the above attributes have been updated, the manifest can be applied to the cluster with kubectl:
kubectl apply -f dns-performance-tests/dnsperf.yaml
This will automatically create the pod, which once running will begin the DNS performance test.
To verify the performance test is running, kubectl top pods
can be used to quickly check the CPU usage of the CoreDNS pods:
kubectl top pods -n coredns-test | grep coredns
It’s expected for CPU usage to be near the CPU limit that was set during creation.
Testing Results
The aim of the experiment was to determine how CoreDNS performs under different resource configurations. We ran both the DNS performance tool and CoreDNS with 1, 2 and 4 vCPUs, repeating the test 10 times on each vCPU configuration. The below table details the throughput, average latency, and success rate of each test configuration.
Table 1: Results from DNSPerf
CPU/DNSPerf metrics | Queries Sent | Queries completed | Queries lost | QPS | Avg latency (s) | Success rate (%) | Failure rate (%) |
---|---|---|---|---|---|---|---|
1vCPU CoreDNS | 9,195,050 | 9,195,028 | 21.3 | 15,324.57 | 0.0064973 | 99.9997 | 0.0002 |
2vCPU CoreDNS | 19,428,544 | 19,428,520 | 23.4 | 32,380.13 | 0.0030644 | 99.9998 | 0.0001 |
4vCPU CoreDNS | 39,182,222 | 39,182,220 | 2.7 | 65,303.55 | 0.0015155 | 99.9999 | 0.0000 |
As expected, the higher the CPU limit, the more queries per second (QPS) CoreDNS is able to handle. More processing power allows for faster processing of requests, and as a consequence, it’ll contribute to higher throughput (QPS).
Over all of the test runs, NOERROR
responses made up 54.60% of requests, while NXDOMAIN
came in at 45.40%. This is aligned with the proportion of existing and non-existent domains used in the input file.
The average latency is within an acceptable range, being up to 6.5ms for 1vCPU, 3ms for 2vCPU and 1.5ms for 4vCPU. Take into account that these measures are for a system under very high load.
The failure rate of requests, which in this case refers to timeout responses detected by DNSPerf, is negatively correlated with the amount of CPU allocated. The percentage was very low compared to the total number of requests. It’s worth noting here that the default timeout used by DNSPerf is 5 seconds, after that it will report the query as failed.
The following query was used to fetch the average CPU usage during the execution of the experiment:
sum(rate(container_cpu_usage_seconds_total{namespace="coredns-test",pod=~"coredns-coredns-.+",container="coredns"}[1m]))
The same query was used on different occasions to fetch the CPU based on the experiment configuration: 1, 2 and 4 vCPUs.
Figure: 1vCPU Usage trend
The usage for the 1vCPU configuration can be seen near the 1 core limit. There are some fluctuations that happen very quickly throughout the test execution, at some point going close to 0.5.
Figure: 2vCPU Usage trend
Similar to the 1vCPU setup, for 2vCPU we have the same situation where most of the time usage is near the limit of 2 CPU cores, with some downward spikes.
Figure: 4vCPU Usage trend
Finally, for 4vCPU, we can see a similar trend as the previous samples with 3.95 CPU usage, close to the limit. Downward spikes can also be observed on some occasions. All of these spikes can be attributed to DNSPerf finishing the test run after 10 minutes, DNS timeouts, or CPU throttling due to usage approaching the limit.
For this experiment, we assigned an upper limit for the MAX_QPS parameter of 1,000,000 QPS. It’s notable that DNSPerf was adapting itself to what CoreDNS was able to handle. Occasionally, some of the queries were cancelled as a result of the high load.
However, depending on your SLOs for DNS resolution, these errors may not be acceptable. That’s why continuously monitoring QPS, latency and CPU usage metrics is essential to avoid these types of issues.
Optimising DNS Server Performance Based on Test Results
Now that we have a solid understanding of how CoreDNS should perform according to expected load from a production perspective, we can take things one step further and consider the implementation of different strategies to improve the stability and reliability of our DNS servers.
There are a number of ways we can improve performance, reliability and stability, including:
Provisioning compute resources: According to the performance data we gathered, we can now determine suitable compute resources to match the predicted load in both normal and peak cases.
High availability: This is best achieved by spreading the DNS infrastructure across a pool of nodes. Kubernetes provides specific primitives like affinity settings, that allow pods to be scheduled based on placement of existing pods and node metadata. With this configuration, planned or unplanned maintenance should have a limited impact on the DNS infrastructure.
Vertical scalability considerations: This refers to increasing the compute resources of CoreDNS deployment. We’ve done tests with different configurations and know the impact of more CPU and its advantages. However, as explained on this issue there’s a point at which adding more CPU cores won’t yield higher throughput for the same CoreDNS server block. In this case, we’ve seen 4 vCPU cmes with higher throughput. Tests with higher CPU resources are beyond the scope of this guide.
Horizontal scalability considerations: Kubernetes allows simple horizontal scalability with the deployment and replica set model. When paired with vertical scaling, the replica count should consider the available resources of the underlying infrastructure and the expected max load per DNS server. This can also be extended with horizontal pod autoscaling (HPA), either via the default Kubernetes HPA or a custom setup that monitors the various metrics that CoreDNS exposes. In the end, a balance between proper compute resources and the number of pods that handle production traffic with acceptable latency should be the goal in order to efficiently use the resources provisioned.
DNS Metrics: This considers both metrics exported by CoreDNS and compute metrics like CPU and RAM. These experiments can be extended to also consider metrics coming from CoreDNS and evaluate their behaviour under load, especially request duration, request and response types. That will allow infrastructure teams to have a complete picture and understand better the conditions that influence DNS server behaviour depending on load.
Conclusions
In summary, there are a variety of reasons you might want to test your CoreDNS server including:
Scalability: CoreDNS will likely be handling a significant amount of DNS queries, especially in large Kubernetes clusters with many services and pods. Performance testing helps ensure that as the demand increases, CoreDNS can handle the extra load without noticeable degradation in performance.
Configuration Optimisation: By performance testing CoreDNS, you can identify potential bottlenecks or suboptimal configurations in your setup. For instance, determining the appropriate caching settings can significantly influence performance.
Resilience Under Stress: It's not just about ensuring CoreDNS responds quickly. It's also about making sure it remains stable and doesn't crash under high loads or during spike traffic conditions.
Benchmarking: If you're considering switching to CoreDNS from another DNS solution, or if you're evaluating different plugins or configurations, performance testing gives you empirical data to compare performance and make informed decisions.
Capacity Planning: If you anticipate growth in your infrastructure or application usage, performance testing can help you understand when you might need to scale or modify your CoreDNS setup.
Validating SLAs: If you have service level agreements (SLAs) around response times or uptime, performance testing can help validate that you are meeting these SLAs, even under high load scenarios.
Plugin Impact: CoreDNS's extensible nature means that you can add various plugins to modify its behaviour. Each plugin might have its own performance implications. By testing, you can measure the impact of specific plugins and decide whether their benefits are worth the potential performance trade-offs.
Resource Utilisation: Understanding how CoreDNS utilises underlying resources (CPU, memory, etc.) under different load conditions can guide decisions around infrastructure provisioning and optimisation.
Reliability: Performance tests, especially when combined with chaos engineering principles, can help identify points of failure in a system. By understanding these points, you can make the necessary changes to improve a system's reliability.
DNS monitoring and testing has been critical to the resilience of our platform, ensuring all our customers’ workloads maintain high availability and operate at peak performance 24/7. We hope this deep dive into CoreDNS performance testing has equipped you with the information you need to do your own testing, and to optimise your own service discovery setup.
If you’d like to discuss CoreDNS, performance of other cloud native technologies, or the Northflank platform, we’d love to connect. Send us an email at contact@northflank.com, or drop us a comment on Twitter or LinkedIn.
Thanks for reading, and we’ll see you next time!
Also from the blog