Prerequisites for TiDB in Kubernetes

This document introduces the hardware and software prerequisites for deploying a TiDB cluster in Kubernetes.

Software version

Software NameVersion
DockerDocker CE 18.09.6
Kubernetesv1.12.5+
CentOS7.6 and kernel 3.10.0-957 or later

The configuration of kernel parameters

Configuration ItemValue
net.core.somaxconn32768
vm.swappiness0
net.ipv4.tcp_syncookies0
net.ipv4.ip_forward1
fs.file-max1000000
fs.inotify.max_user_watches1048576
fs.inotify.max_user_instances1024
net.ipv4.conf.all.rp_filter1
net.ipv4.neigh.default.gc_thresh180000
net.ipv4.neigh.default.gc_thresh290000
net.ipv4.neigh.default.gc_thresh3100000
net.bridge.bridge-nf-call-iptables1
net.bridge.bridge-nf-call-arptables1
net.bridge.bridge-nf-call-ip6tables1

When you set net.bridge.bridge-nf-call-* parameters, and if your option reports an error, you can check whether this module is loaded by running the following command:

lsmod|grep br_netfilter

If this module is not loaded, run the following command to load it:

modprobe br_netfilter

You also need to disable swap on each deployed Kubernetes node by running:

swapoff -a

To check whether swap is disabled:

free -m

If the above command shows that the swap column is all 0, then swap is disabled.

In addition, to permanently disable swaps, remove all the swap-related entries in /etc/fstab.

After all above configurations are made, check whether you have configured SMP IRQ Affinity on the machine. This configuration is to assign the interrupt of each device to different CPUs to prevent all interrupts from being sent to the same CPU, avoiding potential performance bottleneck and taking advantage of multiple cores to increase cluster throughput. For the TiDB cluster, the rate at which the network card processes packages has a great impact on the throughput of the cluster.

Follow these steps to check whether you have configured SMP IRQ Affinity on the machine:

  1. Execute the following command to check the interrupt of a network card:

    cat /proc/interrupts|grep <iface-name>|awk '{print $1,$NF}'

    In the output result of the above command, the first column indicates the interrupt and the second column indicates the device name. If it is a multi-queue network card, the above command outputs information in multiple rows and each queue corresponds to an interrupt.

  2. Execute either of the following commands to check this interrupt is assigned to which CPU.

    cat /proc/irq/<ir_num>/smp_affinity

    The above command outputs the hexadecimal value corresponding to the CPU serial number, and the output result is not so intuitive. For the detailed calculation method, refer to SMP IRQ Affinity.

    cat /proc/irq/<ir_num>/smp_affinity_list

    The above command outputs the decimal value corresponding to the CPU serial number. The result is more intuitive.

If all interrupts of a network card are assigned to different CPUs, the SMP IRQ Affinity is correctly configured on the machine and you do not need further operation.

If all interrupts are sent to the same CPU, configure SMP IRQ Affinity by the following steps:

  • For the scenario of multi-queue network card and multiple cores:

    • Method 1: Enable the irqbalance service. Use the following command to enable the service on CentOS 7:

      systemctl start irqbalance
    • Method 2: Disable irqbalance and customize the binding relationship between interrupts and CPUs. Refer to the set_irq_affinity.sh script for more details.

  • For the scenario of single-queue network card and multiple cores:

    To configure SMP IRQ Affinity in this scenario, you can use RPS/RFS to simulate the Receive Side Scaling (RSS) feature of the network card at the software level.

    Do not use the irqbalance service as described in Method 1. Instead, use the script provided in Method 2 to configure RPS. For the configuration of RFS, refer to RFS configuration.

Hardware and deployment requirements

  • 64-bit generic hardware server platform in the Intel x86-64 architecture and 10 Gigabit NIC (network interface card), which are the same as the server requirements for deploying a TiDB cluster using binary. For details, refer to Hardware recommendations.

  • The server's disk, memory and CPU choices depend on the capacity planning of the cluster and the deployment topology. It is recommended to deploy three master nodes, three etcd nodes, and several worker nodes to ensure high availability of the online Kubernetes cluster.

    Meanwhile, the master node often acts as a worker node (that is, load can also be scheduled to the master node) to make full use of resources. You can set reserved resources by kubelet to ensure that the system processes on the machine and the core processes of Kubernetes have sufficient resources to run under high workloads. This ensures the stability of the entire system.

The following text analyzes the deployment plan of three Kubernetes masters, three etcd and several worker nodes. To achieve a highly available deployment of multi-master nodes in Kubernetes, see Kubernetes official documentation.

Kubernetes requirements for system resources

  • It is required on each machine to have a relatively large SAS disk (at least 1T) to store the data directories of Docker and kubelet.

    Note:

    The data from Docker mainly includes image and container logs. The data from kubelet are mainly data used in emptyDir.

  • If you need to deploy a monitoring system for the Kubernetes cluster and store the monitoring data on the disk, consider preparing a large SAS disk for Prometheus and also for the log monitoring system. This is also to guarantee that the purchased machines are homogeneous. For this reason, it is recommended to prepare two large SAS disks for each machine.

    Note:

    In a production environment, it is recommended to use RAID 5 for the two types of disks. You can decide how many disks for which you want to use RAID 5 as needed.

  • It is recommended that the number of etcd nodes be consistent with that of the Kubernetes master nodes, and you store the etcd data on the SSD disk.

TiDB cluster's requirements for resources

The TiDB cluster consists of three components: PD, TiKV and TiDB. The following recommendations on capacity planning is based on a standard TiDB cluster, namely three PDs, three TiKVs and two TiDBs:

  • PD component: 2C 4GB. PD occupies relatively less resources and only a small portion of local disks.

    Note:

    For easier management, you can put the PDs of all clusters on the master node. For example, to support five TiDB clusters, you can deploy five PD instances on each of the 3 master nodes. These PD instances use the same SSD disk (200 to 300 GigaBytes disk) on which you can create five directories as a mount point by means of bind mount. For detailed operation, refer to the documentation.

    If more machines are added to support more TiDB clusters, you can continue to add PD instances in this way on the master. If the resources on the master are exhausted, you can add PDs on other worker nodes in the same way. This method facilitates the planning and management of PD instances, while the downside is that if two machines go down, all TiDB clusters become unavailable due to the concentration of PD instances.

    Therefore, it is recommended to take out an SSD disk from each machine in all clusters and use it to provide PD instances like the master node. If you need to increase the capacity of a cluster by adding machines, you only need to create PD instances on the newly added machines.

  • TiKV component: An NVMe disk of 8C 32GB for each TiKV instance. To deploy multiple TiKV instances on one machine, you must reserve enough buffers when planning capacity.

  • TiDB component: 8C 32 GB capacity. Because the TiDB component does not occupy the disk space, you only need to consider the CPU and memory resources when planning. The following example assumes that the capacity is 8C 32 GB.

A case of planning TiDB clusters

This is an example of deploying five clusters (each cluster has 3 PDs, 3 TiKVs, and 2 TiDBs), where PD is configured as 2C 4GB, TiDB as 8C 32GB, and TiKV as 8C 32GB. There are seven Kubernetes nodes, three of which are both master and worker nodes, and the other four are purely worker nodes. The distribution of components on each node is as follows:

  • Each master node:

    • 1 etcd (2C 4GB) + 2 PDs (2 * 2C 2 * 4GB) + 3 TiKVs (3 * 8C 3 * 32GB) + 1 TiDB (8C 32GB), totalling 38C 140GB
    • Two SSD disks, one for etcd and one for two PD instances
    • The RAID5-applied SAS disk used for Docker and kubelet
    • Three NVMe disks for TiKV instances
  • Each worker node:

    • 3 PDs (3 * 2C 3 * 4GB) + 2 TiKVs (2 * 8C 2 * 32GB) + 2 TiDBs (2 * 8C 2 * 32GB), totalling 38C 140GB
    • One SSD disk for three PD instances
    • The RAID5-applied SAS disk used for Docker and kubelet
    • Two NVMe disks for TiKV instances

From the above analysis, a total of seven physical machines are required to support five sets of TiDB clusters. Three of the machines are master and worker nodes, and the remaining four are worker nodes. The configuration requirements for the machines are as follows:

  • master and worker node: 48C 192GB, two SSD disks, one RAID5-applied SAS disk, three NVMe disks
  • worker node: 48C 192GB, one block SSD disk, one RAID5-applied SAS disk, two NVMe disks

The above recommended configuration leaves plenty of available resources in addition to those taken by the components. If you want to add the monitoring and log components, use the same method to plan and purchase the type of machines with specific configurations.

Note:

In a production environment, avoid deploying TiDB instances on a master node due to the NIC bandwidth. If the NIC of the master node works at full capacity, the heartbeat report between the worker node and the master node will be affected and might lead to serious problems.