You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Kubernetes/Logging: Difference between revisions
No edit summary
(→rsyslogd: Add a note on common issues (terminated mmkubernetes))
|Line 65:||Line 65:|
* pod name
* pod name
* pod labels
* pod labels
=== Exceptions ===
=== Exceptions ===
Revision as of 11:33, 10 January 2022
This page describes the high level overview of how logging works in our services/main production kubernetes cluster. Other kubernetes installations/cluster might re-use this approach or do different things. While this describes the default mode and highly encouraged mode, services can always devise other ways of handling their logs if needed.
By control plane in kubernetes terminology we usual mean the following components
Of the above components, etcd runs on dedicated VMs on Ganeti and uses standard systemd-journal logging practices. kube-apiserver, kube-controller-manager, kube-scheduler ran on the kubernetes master and follow standard systemd-journal practices. kubelet, kube-proxy run on every kubernetes node and follow standard systemd-journal logging practices. Those logs are not yet sent to logstash but are sent to centrallog
Cluster components logging
Cluster components are workloads that the kubernetes cluster relies on for normal operations, but they aren't part of the Control Plane itself. In our case those run either as DaemonSets or Deployments in specific (privileged) kubernetes namespaces. Those, at the time of this writing (2021-09-29) are:
More will be added every now and then in order to accomplish various goals.
All of these components, as far as their logging goes, as treated as usual Workloads/Pods so please refer to that section.
There are 5 different logging schemes described in the kubernetes Logging Architecture docs. Of those we follow for almost all intents and purposes the Node logging agent pattern. In the past, we did follow the Direct Logging so one might find some leftovers from that era (old configurations mostly)
The Node logging agent pattern is best described with the below generic diagram from upstream.
In our infrastructure:
- We don't use logrotate, but rather have docker rotate logs (with no old versions) at 100MB size
- Our logging agent doesn't run in a pod, but rather directly on the node. It's rsyslogd with specific configuration (more below)
- The logging backend is the Logstash#Production Logstash Architecture used across the entire infrastructure (the main reason we adopted that approach is to reuse that infrastructure and not reinvent the wheel).
Kubernetes services all log to stdout/stderr. Kubernetes configures docker to log these using the json-file driver to disk. The end paths are under
Since those are not container runtime engine independent, the kubelet sets up another set of paths
which is a symlink to the above one, maintained by the kubelet.
It also sets up paths of the form
which are symlinks to the /var/log/pods hierarchy.
rsyslog, parses the latter form(
/var/log/containers/*.log), and with the addition of the mmkubernetes plugin, which is able to talk to the kubernetes API, enriches the docker container logs with metadata from the Kubernetes API. Some examples of metadata being added to each log entry are:
- kubernetes_namespace labels
- pod name
- pod labels
The mmkubernetes plugin has sub optimal error handling in a couple of cases (see task T289766). We did already fix an issue arising from simple connection errors to the Kubernetes API. But the plugin might still be terminated during a Kubernetes master restart as that seems to introduce a small time windows in which the Kubernetes API reports HTTP 403:
Nov 24 15:24:19 kubestage2002 rsyslogd: mmkubernetes: Forbidden: no access - check permissions to view url [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] [v8.1901.0]
If you have issues with logs from Kubernetes nodes not showing up in Elasticsearch, check the nodes syslog for mmkubernetes messages and report on task T289766. A restart of rsyslog should be sufficient to clear the situation.