You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Kubernetes/Clusters"

From Wikitech-static
Jump to navigation Jump to search
imported>Alexandros Kosiaris
(Add the New cluster guide)
imported>Legoktm
(→‎wikikube (aka eqiad/codfw): describe categories that are *in* scope)
Line 1: Line 1:
{{Kubernetes nav}}
{{Note|content=This page makes use of RFC2119 terminology. See https://datatracker.ietf.org/doc/html/rfc2119}}{{Kubernetes nav}}


We have multiple Kubernetes clusters deployed in "production".
We have multiple Kubernetes clusters deployed in what we call the "production" realm. This page does not describe kubernetes clusters maintained in other realms, e.g. Toolforge which is maintained in WMCS/Labs realm.


== main/services (eqiad & codfw) ==
== wikikube (aka eqiad/codfw) ==
<code>eqiad</code> and <code>codfw</code> are the primary Kubernetes clusters and serve real traffic. It is expected that most services are deployed identically in both in an active/active fashion. These are our older kubernetes clusters and have the historical benefit of using the DC names in short form. They are also known as '''main/services'''
<code>eqiad</code> and <code>codfw</code> are the historical names of the 2 clusters that are named <code>wikikube</code>. They are owned by the Service Operations SRE team. These are our older kubernetes clusters and have the historical benefit of using the DC names in short form. For the same historical reasons, the infrastructure in multiple places treats those as the primary Kubernetes clusters. This is something that is being worked on.
 
=== '''Goal''' ===
'''The goal of these clusters is to serve production MediaWiki and related microservices traffic.'''
 
This also means they serve the bulk of our total traffic (>30k requests per second as of 2021-10-22).
 
Applications that are deployed in these clusters '''SHOULD''' fall into one of the following categories:
 
* MediaWiki itself (appservers, API servers, job runners, etc.)
* Services that MediaWiki relies on internally (EventBus, session store, etc.)
* Services that MediaWiki relies on publicly/client-side (Citoid, Maps, etc.)
* Services that provide an API that solely depends on MediaWiki (mobileapps, wikifeeds, etc.)
 
For reliability reasons (mostly interference with the workloads powering end-user traffic) applications that are not related to the stated goal '''MUST NOT be deployed on this cluster.'''
 
Various miscellaneous that clearly don't fit in the above categories, but also do not fit the stated goal, '''MAY''' be examined in a case-by-case basis with Service Operations which will advise whether an application/service can or can not be deployed in these clusters. Note that there are some legacy applications which don't fit in the newly defined scope of this cluster and are expected to be moved elsewhere.
 
Examples of applications that would be a bad fit for installing in these clusters are:
 
* Monitoring: e.g. Grafana, Kibana, LibreNMS, AlertManager, Icinga, Puppetboard. This restriction also exists because monitoring should be functional even when these clusters are in an outage.
* Critical infrastructure pieces: e.g. Netbox
* Collaboration Tools: Phabricator, Gitlab, gerrit, etherpad, etc
* Continuous Integration applications
* Machine Learning applications
* Analytics applications: e.g. Turnilo, Superset
* Stateful applications/Datastores: MySQL/MariaDB/Postgres/Cassandra/Memcached/Redis/Varnish/ApacheTrafficServer are all bad use cases for this cluster. This has to do with how these applications are designed and created, which would, without significant investment, decrease their overall reliability leading to more outages overall.
 
=== Datacenters ===
Services/applications '''MUST''' be deployed identically (barring some datacenter specific configuration) in both main [[Clusters|Datacenters]] in an active/active fashion. The reasons for that are:
 
* Possibility for a failover of a datacenter in case of emergencies
* Ability to perform maintenance without the need for downtime windows
* Decreased latency for some groups of end users, increasing readability
 
The ability of the above is routinely checked using the [[Switch Datacenter]] procedure. Services that are consistently failing the procedure will be asked to be undeployed and deployed elsewhere.
 
=== Traffic Flow ===
Exposure to end-users or internal applications of these service '''MUST''' happen via [[LVS]] and advertised in the [[DNS/Discovery|DNS Discovery]] records. The [[Global traffic routing]] layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications '''SHOULD''' use the [[Envoy#Services Proxy|Services Proxy]] infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application '''SHOULD''' set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates '''SHOULD''' be generated for these services. Consult with Service Operations.


== staging ==
== staging ==
<code>staging</code>, also known as <code>staging-eqiad</code>, allows developers to deploy and test new versions of their project without affecting user traffic. Typically deployments will only have 1 replica in staging since it has less resources.
<code>staging</code>also known as <code>staging-eqiad</code>, is a sibling cluster to the above clusters.  


In addition, TLS is automatically configured for all services deployed here.
=== Goal ===
This cluster exists to allow developers to deploy and test new versions of their project without affecting user traffic. It is a complement of the above clusters providing a safety net (and nothing more) during a deployment. The idea is that if a deployment fails in <code>staging</code> no effort should be taken to proceed with deploying to the above clusters.  


<code>staging-codfw</code> is intended for SREs to adjust and test the configuration of Kubernetes itself. While developers can deploy there, it's strongly discouraged. The cluster is in a constant rate of change.
'''Other uses of this cluster than the one described above (e.g. as a development environment, a CI runner, a quality assurance platform or a demo scene to name a few) MUST NOT be allowed. This restriction is in addition to the restrictions mentioned for the above clusters.'''
 
=== Datacenters ===
While staging clusters exists in both eqiad and codfw, the primary home of the staging cluster is eqiad. <code>staging-codfw</code> is intended for SREs to adjust and test the configuration of Kubernetes itself. While developers can deploy there, it's strongly discouraged. The cluster is in a constant stance of change. As such, while it can perform the same functions as the <code>staging-eqiad</code> cluster, it is usually not ready for that, nor it should be.
 
Since no real traffic is served by these clusters, there is no need for High Availability mechanisms.
 
=== Traffic Flow ===
There is no traffic flow for these clusters so none of the things mentioned in the above cluster apply. Typically deployments '''SHOULD''' only have 1 replica in staging since it has less resources. TLS is automatically configured for all services deployed here.


== ml-serve-eqiad & ml-serve-codfw ==
== ml-serve-eqiad & ml-serve-codfw ==
ml-serve clusters run the Kubeflow Kfserving stack and they are aimed (as first goal) to replace the ORES infrastructure that serves revision scores. These are mostly managed by the ML team, they are sharing however greatly the infrastructure the main/services clusters have.
ml-serve clusters run the Kubeflow+Kfserve(formerly Kfserving) stack . The owner of these clusters is the [[mw:Machine_Learning|Machine Learning]] team. Despite the different ownership of these clusters, the [[mw:Machine_Learning|ML team]] and Service Operations team are sharing greatly the infrastructure capabilities and processes the other clusters use/have.
 
=== Goal ===
Machine Learning related applications that are being owned or helped into production by the ML team are being deployed here. A first goal is to replace the [[ORES]] infrastructure that serves revision scores. Eventually, 1 more cluster will be created in the eqiad datacenter, to allow for training Machine Learning applications using Kubeflow. Those applications will be using the Kubeflow infrastructure to be deployed in the clusters described here. The name of that project is [[phab:project/view/5020/|Lift Wing]].
 
=== Datacenters ===
The reason for clusters in the 2 main [[Data centers|Datacenters]] is the same as the above clusters.
 
=== Traffic Flow ===
Exposure to end-users or internal applications of these service '''MUST''' happen via [[LVS]] and advertised in the [[DNS/Discovery|DNS Discovery]] records. The [[Global traffic routing]] layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications '''SHOULD''' use the [[Envoy#Services Proxy|Services Proxy]] infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application '''SHOULD''' set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates '''SHOULD''' be generated for these services. Consult with the ML Team or Service Operations.


== Creating a new cluster ==
== Creating a new cluster ==
Creating a new cluster is supported, albeit is a substantial amount of work. SREs should definitely consult with the Service Operations team before proceeding further with the instantiation of a new cluster. Docs are at [[Kubernetes/Clusters/New|Kubernetes/Cluster/New]]
Creating a new cluster is supported, albeit is a substantial amount of work (multiple days even for the fastest SRE team) and investment. SREs '''MUST''' definitely consult with the Service Operations team before proceeding further with the instantiation of a new cluster. Docs are at [[Kubernetes/Clusters/New|Kubernetes/Cluster/New]]

Revision as of 19:01, 22 October 2021

We have multiple Kubernetes clusters deployed in what we call the "production" realm. This page does not describe kubernetes clusters maintained in other realms, e.g. Toolforge which is maintained in WMCS/Labs realm.

wikikube (aka eqiad/codfw)

eqiad and codfw are the historical names of the 2 clusters that are named wikikube. They are owned by the Service Operations SRE team. These are our older kubernetes clusters and have the historical benefit of using the DC names in short form. For the same historical reasons, the infrastructure in multiple places treats those as the primary Kubernetes clusters. This is something that is being worked on.

Goal

The goal of these clusters is to serve production MediaWiki and related microservices traffic.

This also means they serve the bulk of our total traffic (>30k requests per second as of 2021-10-22).

Applications that are deployed in these clusters SHOULD fall into one of the following categories:

  • MediaWiki itself (appservers, API servers, job runners, etc.)
  • Services that MediaWiki relies on internally (EventBus, session store, etc.)
  • Services that MediaWiki relies on publicly/client-side (Citoid, Maps, etc.)
  • Services that provide an API that solely depends on MediaWiki (mobileapps, wikifeeds, etc.)

For reliability reasons (mostly interference with the workloads powering end-user traffic) applications that are not related to the stated goal MUST NOT be deployed on this cluster.

Various miscellaneous that clearly don't fit in the above categories, but also do not fit the stated goal, MAY be examined in a case-by-case basis with Service Operations which will advise whether an application/service can or can not be deployed in these clusters. Note that there are some legacy applications which don't fit in the newly defined scope of this cluster and are expected to be moved elsewhere.

Examples of applications that would be a bad fit for installing in these clusters are:

  • Monitoring: e.g. Grafana, Kibana, LibreNMS, AlertManager, Icinga, Puppetboard. This restriction also exists because monitoring should be functional even when these clusters are in an outage.
  • Critical infrastructure pieces: e.g. Netbox
  • Collaboration Tools: Phabricator, Gitlab, gerrit, etherpad, etc
  • Continuous Integration applications
  • Machine Learning applications
  • Analytics applications: e.g. Turnilo, Superset
  • Stateful applications/Datastores: MySQL/MariaDB/Postgres/Cassandra/Memcached/Redis/Varnish/ApacheTrafficServer are all bad use cases for this cluster. This has to do with how these applications are designed and created, which would, without significant investment, decrease their overall reliability leading to more outages overall.

Datacenters

Services/applications MUST be deployed identically (barring some datacenter specific configuration) in both main Datacenters in an active/active fashion. The reasons for that are:

  • Possibility for a failover of a datacenter in case of emergencies
  • Ability to perform maintenance without the need for downtime windows
  • Decreased latency for some groups of end users, increasing readability

The ability of the above is routinely checked using the Switch Datacenter procedure. Services that are consistently failing the procedure will be asked to be undeployed and deployed elsewhere.

Traffic Flow

Exposure to end-users or internal applications of these service MUST happen via LVS and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with Service Operations.

staging

stagingalso known as staging-eqiad, is a sibling cluster to the above clusters.

Goal

This cluster exists to allow developers to deploy and test new versions of their project without affecting user traffic. It is a complement of the above clusters providing a safety net (and nothing more) during a deployment. The idea is that if a deployment fails in staging no effort should be taken to proceed with deploying to the above clusters.

Other uses of this cluster than the one described above (e.g. as a development environment, a CI runner, a quality assurance platform or a demo scene to name a few) MUST NOT be allowed. This restriction is in addition to the restrictions mentioned for the above clusters.

Datacenters

While staging clusters exists in both eqiad and codfw, the primary home of the staging cluster is eqiad. staging-codfw is intended for SREs to adjust and test the configuration of Kubernetes itself. While developers can deploy there, it's strongly discouraged. The cluster is in a constant stance of change. As such, while it can perform the same functions as the staging-eqiad cluster, it is usually not ready for that, nor it should be.

Since no real traffic is served by these clusters, there is no need for High Availability mechanisms.

Traffic Flow

There is no traffic flow for these clusters so none of the things mentioned in the above cluster apply. Typically deployments SHOULD only have 1 replica in staging since it has less resources. TLS is automatically configured for all services deployed here.

ml-serve-eqiad & ml-serve-codfw

ml-serve clusters run the Kubeflow+Kfserve(formerly Kfserving) stack . The owner of these clusters is the Machine Learning team. Despite the different ownership of these clusters, the ML team and Service Operations team are sharing greatly the infrastructure capabilities and processes the other clusters use/have.

Goal

Machine Learning related applications that are being owned or helped into production by the ML team are being deployed here. A first goal is to replace the ORES infrastructure that serves revision scores. Eventually, 1 more cluster will be created in the eqiad datacenter, to allow for training Machine Learning applications using Kubeflow. Those applications will be using the Kubeflow infrastructure to be deployed in the clusters described here. The name of that project is Lift Wing.

Datacenters

The reason for clusters in the 2 main Datacenters is the same as the above clusters.

Traffic Flow

Exposure to end-users or internal applications of these service MUST happen via LVS and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with the ML Team or Service Operations.

Creating a new cluster

Creating a new cluster is supported, albeit is a substantial amount of work (multiple days even for the fastest SRE team) and investment. SREs MUST definitely consult with the Service Operations team before proceeding further with the instantiation of a new cluster. Docs are at Kubernetes/Cluster/New