You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Toolforge/Admin/Kubernetes/Certificates: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Arturo Borrero Gonzalez
(draft no longer!)
imported>Arturo Borrero Gonzalez
(→‎General considerations: refresh check-expiration comment)
 
(5 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{tracked | T292238 }}
This page contains information on '''certificates''' (PKI, X.509, etc) for the '''Toolforge Kubernetes''' cluster.
This page contains information on '''certificates''' (PKI, X.509, etc) for the '''Toolforge Kubernetes''' cluster.


= General considerations =
== General considerations ==


[[File:Toolforge_K8s_PKI_Design_in_Simple_Form.png|right|thumb]]
[[File:Toolforge_K8s_PKI_Design_in_Simple_Form.png|right|thumb]]
Line 13: Line 15:
Worth noting that '''etcd''' servers don't use the kubernetes CA, but use the puppetmaster CA instead.
Worth noting that '''etcd''' servers don't use the kubernetes CA, but use the puppetmaster CA instead.


Most certs can be checked for expiration with <code>sudo kubeadm alpha certs check-expiration</code> on a control plane node.
Most certs can be checked for expiration with <code>sudo kubeadm certs check-expiration</code> on a control plane node.
 
= Use cases and operations =
 
Description of the different certificate types we have in the cluster.


== external API access ==
== External API access ==


We have certain entities contacting external the kubernetes API. The authorization/authentication access is managed using a kubernetes ServiceAccount and a x509 certificate.
We have certain entities contacting external the kubernetes API. The authorization/authentication access is managed using a kubernetes ServiceAccount and a x509 certificate.
Line 28: Line 26:
* '''TODO:''' any other example?
* '''TODO:''' any other example?


=== operations ===
=== Operations ===
{{warning|disable puppet fleetwide to make this whole operation more atomic, and no puppet client see the private repo without content}}


Certificates for this use case can be generated using a custom script we have: [[Portal:Toolforge/Admin/Maintenance#wmcs-k8s-get-cert | wmcs-k8s-get-cert ]].
Certificates for this use case can be generated using a custom script we have: [[Portal:Toolforge/Admin/Maintenance#wmcs-k8s-get-cert | wmcs-k8s-get-cert ]].
Line 39: Line 38:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
user@tools-prometheus-03:~$ sudo puppet agent --disable "updating k8s TLS cert"
root@cloud-cumin-03:~# cumin "O{project:tools} AND O{name:tools-prometheus}" 'puppet agent --disable "T12345 refreshing certificates"'


user@tools-k8s-control-3:~$ sudo -i wmcs-k8s-get-cert prometheus
user@tools-k8s-control-3:~$ sudo -i wmcs-k8s-get-cert prometheus
Line 53: Line 52:
-----END RSA PRIVATE KEY-----
-----END RSA PRIVATE KEY-----


root@tools-puppetmaster-02:/var/lib/git/labs/private# stg uncommit -t a706eb28
root@tools-puppetmaster-02:/var/lib/git/labs/private# vim modules/secret/secrets/ssl/toolforge-k8s-prometheus.key
uncommit the patch that modifies 'modules/secret/secrets/ssl/toolforge-k8s-prometheus.key'
# copy paste here the private key
root@tools-puppetmaster-02:/var/lib/git/labs/private# stg pop ; stg push
root@tools-puppetmaster-02:/var/lib/git/labs/private# git commit -a
until you are in the right uncommited patch
# Write the task you are working on in the commit and any details you find relevant
root@tools-puppetmaster-02:/var/lib/git/labs/private# nano modules/secret/secrets/ssl/toolforge-k8s-prometheus.key ; stg refresh
copy paste here the private key
root@tools-puppetmaster-02:/var/lib/git/labs/private# stg push -a ; stg commit -a
you are done!
you are done!


user@laptop:~/git/wmf/operations/puppet$ nano files/ssl/toolforge-k8s-prometheus.crt
user@laptop:~/git/wmf/operations/puppet$ vim files/ssl/toolforge-k8s-prometheus.crt
create a patch similar to https://gerrit.wikimedia.org/r/#/c/601692/
create a patch similar to https://gerrit.wikimedia.org/r/#/c/601692/


user@tools-prometheus-03:~$ sudo puppet agent --enable ; sudo run-puppet-agent
root@cloud-cumin-03:~# cumin "O{project:tools} AND O{name:tools-prometheus}" 'puppet agent --enable'
[...]
</syntaxhighlight>
</syntaxhighlight>


== internal API access ==
== Internal API requests ==
 
Sometimes the Kubernetes API server makes requests to other services. For example:
* custom webhooks
* the internal metrics server (i.e, what ''kubectl top'' uses)
In general Kubernetes requires those requests to be encrypted and verified via TLS certificates. Historically we used to generate certificates for those using the Kubenetes internal CA, because that was possible and the easiest method. However due to changes in the Kubernetes certificates API, that is no longer possible. These days the modern approach for these is to use [https://cert-manager.io/ cert-manager] to generate those certificates.
 
=== cert-manager ===
You will need to generate a certificate for the service. Self-signed certificates will work fine. Something like this:
<syntaxhighlight lang="yaml">
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: metrics-server-api-tls
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: metrics-server-api-tls
spec:
  dnsNames:
    - "metrics-server.metrics.svc"
  secretName: metrics-server-api-tls
  revisionHistoryLimit: 1
  issuerRef:
    name: metrics-server-api-tls
    kind: Issuer
    group: cert-manager.io
</syntaxhighlight>


Some stuff running inside the kubernetes cluster also require a certificate to access the API server and use a ServiceAccount.
This will generate a certificate for that DNS name and save it to a secret called <code>metrics-server-api-tls</code>. To see the status of the certificate, use <code>kubectl describe certificate</code>:
This certificate is usually crafted as a Kubernetes secret for the utility to use it.
<syntaxhighlight lang="shell-session">
taavi@tools-sgebastion-11:~ $ kubectl describe certificate -n metrics metrics-server-api-tls
Name:        metrics-server-api-tls
Namespace:    metrics
Labels:      app=raw
              app.kubernetes.io/managed-by=Helm
              chart=raw-0.3.0
              heritage=Helm
              release=metrics-server-api-certs
Annotations:  meta.helm.sh/release-name: metrics-server-api-certs
              meta.helm.sh/release-namespace: metrics
API Version:  cert-manager.io/v1
Kind:        Certificate
Metadata:
  Creation Timestamp:  2023-02-17T11:15:56Z
  Generation:          1
  Managed Fields:
    API Version:  cert-manager.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:revision:
    Manager:      cert-manager-certificates-issuing
    Operation:    Update
    Time:        2023-02-17T11:15:56Z
    API Version:  cert-manager.io/v1
    Fields Type:  FieldsV1
    Manager:        helm
    Operation:      Update
    Time:            2023-02-17T11:15:56Z
  Resource Version:  1059036361
  UID:              2e6e087e-06d9-4da5-b8ed-6109dbc38d6e
Spec:
  Dns Names:
    metrics-server.metrics.svc
  Issuer Ref:
    Group:                cert-manager.io
    Kind:                  Issuer
    Name:                  metrics-server-api-tls
  Revision History Limit:  1
  Secret Name:            metrics-server-api-tls
Status:
  Conditions:
    Last Transition Time:  2023-02-17T11:15:56Z
    Message:              Certificate is up to date and has not expired
    Observed Generation:  1
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:              2023-05-18T11:15:56Z
  Not Before:              2023-02-17T11:15:56Z
  Renewal Time:            2023-04-18T11:15:56Z
  Revision:                1
Events:
  Type    Reason    Age  From                                      Message
  ----    ------    ----  ----                                      -------
  Normal  Issuing    17m  cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal  Generated  17m  cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "metrics-server-api-tls-hv7gg"
  Normal  Requested  17m  cert-manager-certificates-request-manager  Created new CertificateRequest resource "metrics-server-api-tls-cm8gn"
  Normal  Issuing    17m  cert-manager-certificates-issuing          The certificate has been successfully issued


Some examples of this:
</syntaxhighlight>
* our custom webhook: ingress admission controller
* our custom webhook: registry admission controller
* the internal metrics server (i.e, what ''kubectl top'' uses)


=== operations ===
On the api configuration (so usually either the webhook or APIService object), use cert-manager's CAInjector feature:
<syntaxhighlight lang="yaml">
  annotations:
    # syntax: namespace/secret-name
    cert-manager.io/inject-ca-from: "metrics/metrics-server-api-tls"
</syntaxhighlight>


Cert-manager will automatically renew the certificate when it has 1/3 of its lifetime remaining. If the service does not automatically re-load the changed certificate, you can use [https://github.com/stakater/reloader stakater/reloader] to restart the deployment when the certificates change.
=== Kubernetes CA certificates (legacy method) ===
Certificates for this use case can be generated using a custom script we have: [[Portal:Toolforge/Admin/Maintenance#wmcs-k8s-secret-for-cert | wmcs-k8s-secret-for-cert]].
Certificates for this use case can be generated using a custom script we have: [[Portal:Toolforge/Admin/Maintenance#wmcs-k8s-secret-for-cert | wmcs-k8s-secret-for-cert]].


Line 109: Line 197:
</syntaxhighlight>
</syntaxhighlight>


== node/kubelet certs ==
== Node/kubelet certs ==


Kubelet has two certs:
Kubelet has two certs:
Line 117: Line 205:
At this time the serving certificate is a self-signed one managed by kubelet, which should not need manual rotation. Proper, CA-signed rotating certs are stabilizing as a feature set in Kubernetes 1.17, and we should probably switch to that for consistency and as a general improvement. The client cert of kubelet is signed by the cluster CA and expires in 1 year.  
At this time the serving certificate is a self-signed one managed by kubelet, which should not need manual rotation. Proper, CA-signed rotating certs are stabilizing as a feature set in Kubernetes 1.17, and we should probably switch to that for consistency and as a general improvement. The client cert of kubelet is signed by the cluster CA and expires in 1 year.  


=== operations ===
=== Operations ===
All such client certs are rotated when upgrading Kubernetes, but they can be manually rotated with kubeadm as well. This should be as easy as running <code>kubeadm alpha certs renew</code> on a control plane node as root.
All such client certs are rotated when upgrading Kubernetes, but they can be manually rotated with kubeadm as well. It is possible to configure the kubelet to request upgraded certs on its own when they near expiration. So far, we have not set this flag in the config, expecting our upgrade cycle to be 6 months, roughly.
 
To renew certificates run the <code>wmcs.toolforge.k8s.kubeadm_certs_renew</code> [[cookbook]]:
 
<syntaxhighlight lang="shell-session">
user@laptop:~$ cookbook wmcs.toolforge.k8s.kubeadm_certs_renew -h
usage: cookbooks.wmcs.toolforge.k8s.kubeadm_certs_renew [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] --control-hostname-list
                                                        CONTROL_HOSTNAME_LIST [CONTROL_HOSTNAME_LIST ...]
 
WMCS Toolforge Kubernetes - renew kubeadm certificates
 
Usage example:
    cookbook wmcs.toolforge.k8s.kubeadm_certs_renew \
        --project tools \
        --control-hostname-list tools-k8s-control1 tools-k8s-control2
 
See Also:
    https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal
 
options:
  -h, --help            show this help message and exit
  --project PROJECT    Relevant Cloud VPS openstack project (for operations, dologmsg, etc). If this cookbook is for hardware, this only affects dologmsg
                        calls. Default is 'toolsbeta'.
  --task-id TASK_ID    Id of the task related to this operation (ex. T123456). (default: None)
  --no-dologmsg        To disable dologmsg calls (no SAL messages on IRC). (default: False)
  --control-hostname-list CONTROL_HOSTNAME_LIST [CONTROL_HOSTNAME_LIST ...]
                        List of k8s control nodes to operate on (default: None)
</syntaxhighlight>
 
You would usually run the cookbook for all control nodes in a given cluster. The cookbook is idempotent and can be run at any time safely.


It is possible to configure the kubelet to request upgraded certs on its own when they near expiration. So far, we have not set this flag in the config, expecting our upgrade cycle to be 6 months, roughly.
Example usage:
<syntaxhighlight lang="shell-session">
user@cloud-cumin-03:~$ sudo cumin --force -x 'project:toolsbeta name:toolsbeta-test-k8s-control-*' "kubeadm certs check-expiration"
3 hosts will be targeted:
toolsbeta-test-k8s-control-[4-6].toolsbeta.eqiad1.wikimedia.cloud
[..]
===== NODE GROUP =====                                                                                                                                       
1 toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud                                                                                           
----- OUTPUT of 'kubeadm certs check-expiration' -----                                                                                                       
[check-expiration] Reading configuration from the cluster...                                                                                                
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'                                         


'''TODO:''' elaborate
CERTIFICATE                EXPIRES                  RESIDUAL TIME  CERTIFICATE AUTHORITY  EXTERNALLY MANAGED
admin.conf                Jan 20, 2024 17:23 UTC  338d            ca                      no     
apiserver                  Jan 20, 2024 17:23 UTC  338d            ca                      no     
apiserver-kubelet-client  Jan 20, 2024 17:23 UTC  338d            ca                      no     
controller-manager.conf    Jan 20, 2024 17:23 UTC  338d            ca                      no     
front-proxy-client        Jan 20, 2024 17:23 UTC  338d            front-proxy-ca          no     
scheduler.conf            Jan 20, 2024 17:23 UTC  338d            ca                      no     
 
CERTIFICATE AUTHORITY  EXPIRES                  RESIDUAL TIME  EXTERNALLY MANAGED
ca                      Oct 20, 2029 09:54 UTC  6y              no     
front-proxy-ca          Oct 20, 2029 09:54 UTC  6y              no     
[..]
user@laptop:~$ cookbook wmcs.toolforge.k8s.kubeadm_certs_renew --control-hostname-list toolsbeta-test-k8s-control-4 toolsbeta-test-k8s-control-5 toolsbeta-test-k8s-control-6
START - Cookbook wmcs.toolforge.k8s.kubeadm_certs_renew
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: step 1 -- kubeadm certs renew all
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: step 2 -- restart control plane static pods
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: figured kubelet fileCheckFrequency to be 20s
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/.kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-apiserver-toolsbeta-test-k8s-control-4
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-controller-manager.yaml /etc/kubernetes/manifests/.kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-controller-manager.yaml /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-controller-manager-toolsbeta-test-k8s-control-4
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/.kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-scheduler-toolsbeta-test-k8s-control-4
[DOLOGMSG]: renewed kubeadm certs on toolsbeta-test-k8s-control-4
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: step 1 -- kubeadm certs renew all
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: step 2 -- restart control plane static pods
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: figured kubelet fileCheckFrequency to be 20s
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/.kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-apiserver-toolsbeta-test-k8s-control-5
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-controller-manager.yaml /etc/kubernetes/manifests/.kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-controller-manager.yaml /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-controller-manager-toolsbeta-test-k8s-control-5
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/.kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-scheduler-toolsbeta-test-k8s-control-5
[DOLOGMSG]: renewed kubeadm certs on toolsbeta-test-k8s-control-5
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: step 1 -- kubeadm certs renew all
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: step 2 -- restart control plane static pods
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: figured kubelet fileCheckFrequency to be 20s
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/.kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-apiserver-toolsbeta-test-k8s-control-6
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-controller-manager.yaml /etc/kubernetes/manifests/.kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-controller-manager.yaml /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-controller-manager-toolsbeta-test-k8s-control-6
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/.kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-scheduler-toolsbeta-test-k8s-control-6
[DOLOGMSG]: renewed kubeadm certs on toolsbeta-test-k8s-control-6
END (PASS) - Cookbook wmcs.toolforge.k8s.kubeadm_certs_renew (exit_code=0)
user@cloud-cumin-03:~$ sudo cumin --force -x 'project:toolsbeta name:toolsbeta-test-k8s-control-*' "kubeadm certs check-expiration"
3 hosts will be targeted:
toolsbeta-test-k8s-control-[4-6].toolsbeta.eqiad1.wikimedia.cloud
[..]
===== NODE GROUP =====                                                                                                                                       
1 toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud                                                                                           
----- OUTPUT of 'kubeadm certs check-expiration' -----                                                                                                         
[check-expiration] Reading configuration from the cluster...                                                                                                 
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'                                        
 
CERTIFICATE                EXPIRES                  RESIDUAL TIME  CERTIFICATE AUTHORITY  EXTERNALLY MANAGED
admin.conf                Feb 16, 2024 15:59 UTC  364d            ca                      no     
apiserver                  Feb 16, 2024 15:59 UTC  364d            ca                      no     
apiserver-kubelet-client  Feb 16, 2024 15:59 UTC  364d            ca                      no     
controller-manager.conf    Feb 16, 2024 15:59 UTC  364d            ca                      no     
front-proxy-client        Feb 16, 2024 15:59 UTC  364d            front-proxy-ca          no     
scheduler.conf            Feb 16, 2024 15:59 UTC  364d            ca                      no     
 
CERTIFICATE AUTHORITY  EXPIRES                  RESIDUAL TIME  EXTERNALLY MANAGED
ca                      Oct 20, 2029 09:54 UTC  6y              no     
front-proxy-ca          Oct 20, 2029 09:54 UTC  6y              no
[..]
</syntaxhighlight>
See also upstream docs: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal


== tool certs ==
== Tool certs ==


These certs are automatically generated by the [https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/maintain-kubeusers/ maintain-kubeusers] mechanism. When a new tool is created in Striker, the LDAP change is picked up by a polling loop in the maintain-kubeusers deployment, and the service will:
These certs are automatically generated by the [https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/maintain-kubeusers/ maintain-kubeusers] mechanism. When a new tool is created in Striker, the LDAP change is picked up by a polling loop in the maintain-kubeusers deployment, and the service will:
Line 140: Line 362:
This service runs in Kubernetes in a specialized namespace just for it using a hand-made Docker image, as is documented in the README of the repo. The toolsbeta version runs the <code>maintain-kubeusers:beta</code> tag instead of the <code>:latest</code> tag to facilitate staging and testing live without hurting Toolforge proper. Deploying new code only requires deleting the currently-running pod after refreshing the required image tag.
This service runs in Kubernetes in a specialized namespace just for it using a hand-made Docker image, as is documented in the README of the repo. The toolsbeta version runs the <code>maintain-kubeusers:beta</code> tag instead of the <code>:latest</code> tag to facilitate staging and testing live without hurting Toolforge proper. Deploying new code only requires deleting the currently-running pod after refreshing the required image tag.


=== operations ===
=== Operations ===


If someone has a need to rotate their tool user certs for some reason, run:
If someone has a need to rotate their tool user certs for some reason, run:
<syntaxhighlight lang=shell-session>user@bastion $ sudo -i
<syntaxhighlight lang=shell-session>user@bastion $ sudo become <tool-that-needs-help>
root@bastion # become <tool-that-needs-help>
tools.toolname:~$ kubectl delete cm maintain-kubeusers
tools.toolname:~$ kubectl delete cm maintain-kubeusers
</syntaxhighlight>
</syntaxhighlight>
This will cause maintain-kubeusers to refresh their certs.
This will cause maintain-kubeusers to refresh their certs.


If the certs are deleted, you will need to instead run <code>kubectl delete cm maintain-kubeusers --namespace tool-$toolname</code> as a cluster admin (such as root on a control plane node) since the tool won't be able to authenticate.
If the certs were already deleted, you will need to instead have a cluster admin run <code>kubectl delete cm maintain-kubeusers --namespace tool-$toolname --as-group=system:masters --as=admin</code> since the tool won't be able to authenticate.


In case of a corrupt <code>.kube/config</code> file, the same trick applies except, that <code>maintain-kubeusers</code> will not read invalid YAML. Therefore, you will need to delete the tool's <code>.kube/config</code> and then as a cluster admin, run <code>kubectl delete cm maintain-kubeusers --namespace tool-$toolname</code>.  That will regenerate their credentials.
In case of a corrupt <code>.kube/config</code> file, the same trick applies except, that <code>maintain-kubeusers</code> will not read invalid YAML. Therefore, you will need to delete the tool's <code>.kube/config</code> and then as a cluster admin, run <code>kubectl delete cm maintain-kubeusers --namespace tool-$toolname --as-group=system:masters --as=admin</code>.  Maintain-kubeusers will regenerate their credentials soon after.


== etcd certs ==
== Etcd certs ==


All etcd servers use puppetmaster issued certificates (puppet node certificates). The etcd service will only allow communication from clients presenting a certificate signed by the same CA.
All etcd servers use puppetmaster issued certificates (puppet node certificates). The etcd service will only allow communication from clients presenting a certificate signed by the same CA.
Line 160: Line 381:
In the puppet profile controlling this, we have a mechanism to refresh the certificate and restart the etcd daemon if the puppet node certificate changes (it is reissued or whatever).
In the puppet profile controlling this, we have a mechanism to refresh the certificate and restart the etcd daemon if the puppet node certificate changes (it is reissued or whatever).


= See also =
== See also ==


Some other interesting docs:
Some other interesting docs:

Latest revision as of 10:07, 20 February 2023

This page contains information on certificates (PKI, X.509, etc) for the Toolforge Kubernetes cluster.

General considerations

Toolforge K8s PKI Design in Simple Form.png

Kubernetes includes an internal CA which is the main one we use for cluster operations.

By default, kubernetes issued certificates are valid for 1 year. After that period, they should be renewed.

The internal kubernetes CA, generated at deployment time by kubadm expires after 10 years. The current CA is good until Nov 3 14:13:50 2029 GMT

Worth noting that etcd servers don't use the kubernetes CA, but use the puppetmaster CA instead.

Most certs can be checked for expiration with sudo kubeadm certs check-expiration on a control plane node.

External API access

We have certain entities contacting external the kubernetes API. The authorization/authentication access is managed using a kubernetes ServiceAccount and a x509 certificate. The x509 certificate encodes the ServiceAccount name in the Subject field.

Some examples of this:

  • tools-prometheus uses this external API access to scrape metrics.
  • TODO: any other example?

Operations

Certificates for this use case can be generated using a custom script we have: wmcs-k8s-get-cert .

Usually, the generated cert will be copy&pasted into the private puppet repo to be used as a secret in a puppet module or profile.

Renewing the certificate is just generating a new one and replacing the old one.

Example workflow for replacing tools-prometheus k8s certificate:

root@cloud-cumin-03:~# cumin "O{project:tools} AND O{name:tools-prometheus}" 'puppet agent --disable "T12345 refreshing certificates"'

user@tools-k8s-control-3:~$ sudo -i wmcs-k8s-get-cert prometheus
/tmp/tmp.9k9N7ksn6K/server-cert.pem
/tmp/tmp.9k9N7ksn6K/server-key.pem
user@tools-k8s-control-3:~$ sudo cat /tmp/tmp.9k9N7ksn6K/server-cert.pem
-----BEGIN CERTIFICATE-----
MIIDYTCCA[...]
-----END CERTIFICATE-----
user@tools-k8s-control-3:~$ sudo cat /tmp/tmp.9k9N7ksn6K/server-key.pem
-----BEGIN RSA PRIVATE KEY-----
MIIEpQIBA[...]
-----END RSA PRIVATE KEY-----

root@tools-puppetmaster-02:/var/lib/git/labs/private# vim modules/secret/secrets/ssl/toolforge-k8s-prometheus.key
# copy paste here the private key
root@tools-puppetmaster-02:/var/lib/git/labs/private# git commit -a
# Write the task you are working on in the commit and any details you find relevant
you are done!

user@laptop:~/git/wmf/operations/puppet$ vim files/ssl/toolforge-k8s-prometheus.crt
create a patch similar to https://gerrit.wikimedia.org/r/#/c/601692/

root@cloud-cumin-03:~# cumin "O{project:tools} AND O{name:tools-prometheus}" 'puppet agent --enable'

Internal API requests

Sometimes the Kubernetes API server makes requests to other services. For example:

  • custom webhooks
  • the internal metrics server (i.e, what kubectl top uses)

In general Kubernetes requires those requests to be encrypted and verified via TLS certificates. Historically we used to generate certificates for those using the Kubenetes internal CA, because that was possible and the easiest method. However due to changes in the Kubernetes certificates API, that is no longer possible. These days the modern approach for these is to use cert-manager to generate those certificates.

cert-manager

You will need to generate a certificate for the service. Self-signed certificates will work fine. Something like this:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: metrics-server-api-tls
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: metrics-server-api-tls
spec:
  dnsNames:
    - "metrics-server.metrics.svc"
  secretName: metrics-server-api-tls
  revisionHistoryLimit: 1
  issuerRef:
    name: metrics-server-api-tls
    kind: Issuer
    group: cert-manager.io

This will generate a certificate for that DNS name and save it to a secret called metrics-server-api-tls. To see the status of the certificate, use kubectl describe certificate:

taavi@tools-sgebastion-11:~ $ kubectl describe certificate -n metrics metrics-server-api-tls
Name:         metrics-server-api-tls
Namespace:    metrics
Labels:       app=raw
              app.kubernetes.io/managed-by=Helm
              chart=raw-0.3.0
              heritage=Helm
              release=metrics-server-api-certs
Annotations:  meta.helm.sh/release-name: metrics-server-api-certs
              meta.helm.sh/release-namespace: metrics
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2023-02-17T11:15:56Z
  Generation:          1
  Managed Fields:
    API Version:  cert-manager.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:revision:
    Manager:      cert-manager-certificates-issuing
    Operation:    Update
    Time:         2023-02-17T11:15:56Z
    API Version:  cert-manager.io/v1
    Fields Type:  FieldsV1
    Manager:         helm
    Operation:       Update
    Time:            2023-02-17T11:15:56Z
  Resource Version:  1059036361
  UID:               2e6e087e-06d9-4da5-b8ed-6109dbc38d6e
Spec:
  Dns Names:
    metrics-server.metrics.svc
  Issuer Ref:
    Group:                 cert-manager.io
    Kind:                  Issuer
    Name:                  metrics-server-api-tls
  Revision History Limit:  1
  Secret Name:             metrics-server-api-tls
Status:
  Conditions:
    Last Transition Time:  2023-02-17T11:15:56Z
    Message:               Certificate is up to date and has not expired
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2023-05-18T11:15:56Z
  Not Before:              2023-02-17T11:15:56Z
  Renewal Time:            2023-04-18T11:15:56Z
  Revision:                1
Events:
  Type    Reason     Age   From                                       Message
  ----    ------     ----  ----                                       -------
  Normal  Issuing    17m   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal  Generated  17m   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "metrics-server-api-tls-hv7gg"
  Normal  Requested  17m   cert-manager-certificates-request-manager  Created new CertificateRequest resource "metrics-server-api-tls-cm8gn"
  Normal  Issuing    17m   cert-manager-certificates-issuing          The certificate has been successfully issued

On the api configuration (so usually either the webhook or APIService object), use cert-manager's CAInjector feature:

  annotations:
    # syntax: namespace/secret-name
    cert-manager.io/inject-ca-from: "metrics/metrics-server-api-tls"

Cert-manager will automatically renew the certificate when it has 1/3 of its lifetime remaining. If the service does not automatically re-load the changed certificate, you can use stakater/reloader to restart the deployment when the certificates change.

Kubernetes CA certificates (legacy method)

Certificates for this use case can be generated using a custom script we have: wmcs-k8s-secret-for-cert.

After running the script, the secret should be ready to use.

Renewing the certificate is just generating a new one (running the script again and making sure the pod uses it).

If you want to make sure the old cert is no longer present, just delete it and run the script again. Example session for the metrics-server:

root@tools-k8s-control-3:~# kubectl delete secrets -n metrics metrics-server-certs
secret "metrics-server-certs" deleted
root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server
secret/metrics-server-certs created
root@tools-k8s-control-3:~# kubectl get secrets -n metrics metrics-server-certs -o yaml | grep cert.pem | head -1 | awk -F' ' '{print $2}' | base64 -d | openssl x509 -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            2f:65:a6:cf:2c:16:2f:39:6e:29:95:ee:35:01:b9:d7:75:a1:d2:50
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = kubernetes
        Validity
            Not Before: Jun  2 11:31:00 2020 GMT
            Not After : Jun  2 11:31:00 2021 GMT
        Subject: CN = metrics-server
[..]

Node/kubelet certs

Kubelet has two certs:

  1. A client cert to communicate with the API server
  2. A serving certificate for the Kubelet API

At this time the serving certificate is a self-signed one managed by kubelet, which should not need manual rotation. Proper, CA-signed rotating certs are stabilizing as a feature set in Kubernetes 1.17, and we should probably switch to that for consistency and as a general improvement. The client cert of kubelet is signed by the cluster CA and expires in 1 year.

Operations

All such client certs are rotated when upgrading Kubernetes, but they can be manually rotated with kubeadm as well. It is possible to configure the kubelet to request upgraded certs on its own when they near expiration. So far, we have not set this flag in the config, expecting our upgrade cycle to be 6 months, roughly.

To renew certificates run the wmcs.toolforge.k8s.kubeadm_certs_renew cookbook:

user@laptop:~$ cookbook wmcs.toolforge.k8s.kubeadm_certs_renew -h
usage: cookbooks.wmcs.toolforge.k8s.kubeadm_certs_renew [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] --control-hostname-list
                                                        CONTROL_HOSTNAME_LIST [CONTROL_HOSTNAME_LIST ...]

WMCS Toolforge Kubernetes - renew kubeadm certificates

Usage example:
    cookbook wmcs.toolforge.k8s.kubeadm_certs_renew \
        --project tools \
        --control-hostname-list tools-k8s-control1 tools-k8s-control2

See Also:
    https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal

options:
  -h, --help            show this help message and exit
  --project PROJECT     Relevant Cloud VPS openstack project (for operations, dologmsg, etc). If this cookbook is for hardware, this only affects dologmsg
                        calls. Default is 'toolsbeta'.
  --task-id TASK_ID     Id of the task related to this operation (ex. T123456). (default: None)
  --no-dologmsg         To disable dologmsg calls (no SAL messages on IRC). (default: False)
  --control-hostname-list CONTROL_HOSTNAME_LIST [CONTROL_HOSTNAME_LIST ...]
                        List of k8s control nodes to operate on (default: None)

You would usually run the cookbook for all control nodes in a given cluster. The cookbook is idempotent and can be run at any time safely.

Example usage:

user@cloud-cumin-03:~$ sudo cumin --force -x 'project:toolsbeta name:toolsbeta-test-k8s-control-*' "kubeadm certs check-expiration"
3 hosts will be targeted:
toolsbeta-test-k8s-control-[4-6].toolsbeta.eqiad1.wikimedia.cloud
[..]
===== NODE GROUP =====                                                                                                                                        
1 toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud                                                                                             
----- OUTPUT of 'kubeadm certs check-expiration' -----                                                                                                        
[check-expiration] Reading configuration from the cluster...                                                                                                  
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'                                          

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Jan 20, 2024 17:23 UTC   338d            ca                      no      
apiserver                  Jan 20, 2024 17:23 UTC   338d            ca                      no      
apiserver-kubelet-client   Jan 20, 2024 17:23 UTC   338d            ca                      no      
controller-manager.conf    Jan 20, 2024 17:23 UTC   338d            ca                      no      
front-proxy-client         Jan 20, 2024 17:23 UTC   338d            front-proxy-ca          no      
scheduler.conf             Jan 20, 2024 17:23 UTC   338d            ca                      no      

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Oct 20, 2029 09:54 UTC   6y              no      
front-proxy-ca          Oct 20, 2029 09:54 UTC   6y              no      
[..]
user@laptop:~$ cookbook wmcs.toolforge.k8s.kubeadm_certs_renew --control-hostname-list toolsbeta-test-k8s-control-4 toolsbeta-test-k8s-control-5 toolsbeta-test-k8s-control-6
START - Cookbook wmcs.toolforge.k8s.kubeadm_certs_renew
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: step 1 -- kubeadm certs renew all
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: step 2 -- restart control plane static pods
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: figured kubelet fileCheckFrequency to be 20s
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/.kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-apiserver-toolsbeta-test-k8s-control-4
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-controller-manager.yaml /etc/kubernetes/manifests/.kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-controller-manager.yaml /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-controller-manager-toolsbeta-test-k8s-control-4
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/.kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-scheduler-toolsbeta-test-k8s-control-4
[DOLOGMSG]: renewed kubeadm certs on toolsbeta-test-k8s-control-4
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: step 1 -- kubeadm certs renew all
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: step 2 -- restart control plane static pods
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: figured kubelet fileCheckFrequency to be 20s
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/.kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-apiserver-toolsbeta-test-k8s-control-5
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-controller-manager.yaml /etc/kubernetes/manifests/.kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-controller-manager.yaml /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-controller-manager-toolsbeta-test-k8s-control-5
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/.kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-scheduler-toolsbeta-test-k8s-control-5
[DOLOGMSG]: renewed kubeadm certs on toolsbeta-test-k8s-control-5
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: step 1 -- kubeadm certs renew all
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: step 2 -- restart control plane static pods
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: figured kubelet fileCheckFrequency to be 20s
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/.kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-apiserver.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-apiserver-toolsbeta-test-k8s-control-6
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-controller-manager.yaml /etc/kubernetes/manifests/.kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-controller-manager.yaml /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-controller-manager.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-controller-manager-toolsbeta-test-k8s-control-6
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/.kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: mv /etc/kubernetes/manifests/.kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: waiting 21 secs for kubelet to do filecheck for /etc/kubernetes/manifests/kube-scheduler.yaml
INFO: toolsbeta-test-k8s-control-6.toolsbeta.eqiad1.wikimedia.cloud: reset creationTimestamp: kubectl -n kube-system delete pod kube-scheduler-toolsbeta-test-k8s-control-6
[DOLOGMSG]: renewed kubeadm certs on toolsbeta-test-k8s-control-6
END (PASS) - Cookbook wmcs.toolforge.k8s.kubeadm_certs_renew (exit_code=0)
user@cloud-cumin-03:~$ sudo cumin --force -x 'project:toolsbeta name:toolsbeta-test-k8s-control-*' "kubeadm certs check-expiration"
3 hosts will be targeted:
toolsbeta-test-k8s-control-[4-6].toolsbeta.eqiad1.wikimedia.cloud
[..]
===== NODE GROUP =====                                                                                                                                        
1 toolsbeta-test-k8s-control-5.toolsbeta.eqiad1.wikimedia.cloud                                                                                             
----- OUTPUT of 'kubeadm certs check-expiration' -----                                                                                                          
[check-expiration] Reading configuration from the cluster...                                                                                                  
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'                                          

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Feb 16, 2024 15:59 UTC   364d            ca                      no      
apiserver                  Feb 16, 2024 15:59 UTC   364d            ca                      no      
apiserver-kubelet-client   Feb 16, 2024 15:59 UTC   364d            ca                      no      
controller-manager.conf    Feb 16, 2024 15:59 UTC   364d            ca                      no      
front-proxy-client         Feb 16, 2024 15:59 UTC   364d            front-proxy-ca          no      
scheduler.conf             Feb 16, 2024 15:59 UTC   364d            ca                      no      

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Oct 20, 2029 09:54 UTC   6y              no      
front-proxy-ca          Oct 20, 2029 09:54 UTC   6y              no
[..]

See also upstream docs: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal

Tool certs

These certs are automatically generated by the maintain-kubeusers mechanism. When a new tool is created in Striker, the LDAP change is picked up by a polling loop in the maintain-kubeusers deployment, and the service will:

  • Create the NFS folder for the tool if it isn't already there because of maintain-dbusers
  • Create the necessary folders to set up the KUBECONFIG for the user.
  • Create a tool namespace along with all necessary privileges, restrictions and quotas
  • Generate a private key
  • Request and approve the CSR for the cert to authenticate the new tool with the Kubernetes cluster
  • Write out the cert to the appropriate files along with the KUBECONFIG
  • Create a configmap named maintain-kubeusers in the tool namespace that gives the expiration date of the cert to use for automatically regenerating the cert before it expires
    • Deleting this configmap will cause the cert to be regenerated on the next iteration. This is the safest way to regenerate the certs manually.

Each cert includes a CN, which functions as the user name in Kubernetes, and can include groups as well ("O:" or organization entries). Tool certs currently have the CN of their tool name and one O of "toolforge".

This service runs in Kubernetes in a specialized namespace just for it using a hand-made Docker image, as is documented in the README of the repo. The toolsbeta version runs the maintain-kubeusers:beta tag instead of the :latest tag to facilitate staging and testing live without hurting Toolforge proper. Deploying new code only requires deleting the currently-running pod after refreshing the required image tag.

Operations

If someone has a need to rotate their tool user certs for some reason, run:

user@bastion $ sudo become <tool-that-needs-help>
tools.toolname:~$ kubectl delete cm maintain-kubeusers

This will cause maintain-kubeusers to refresh their certs.

If the certs were already deleted, you will need to instead have a cluster admin run kubectl delete cm maintain-kubeusers --namespace tool-$toolname --as-group=system:masters --as=admin since the tool won't be able to authenticate.

In case of a corrupt .kube/config file, the same trick applies except, that maintain-kubeusers will not read invalid YAML. Therefore, you will need to delete the tool's .kube/config and then as a cluster admin, run kubectl delete cm maintain-kubeusers --namespace tool-$toolname --as-group=system:masters --as=admin. Maintain-kubeusers will regenerate their credentials soon after.

Etcd certs

All etcd servers use puppetmaster issued certificates (puppet node certificates). The etcd service will only allow communication from clients presenting a certificate signed by the same CA. This means kubernetes components that contact etcd should use puppet node certificates.

In the puppet profile controlling this, we have a mechanism to refresh the certificate and restart the etcd daemon if the puppet node certificate changes (it is reissued or whatever).

See also

Some other interesting docs: