Wikimedia Cloud Services team/EnhancementProposals/GridEngine plans and timeline

From Wikitech-static
Jump to navigation Jump to search

This page contains information about WMCS plans and timeline for Toolforge GridEngine.

Toolforge currently uses Son of Grid Engine (a fork of the original GridEngine) to offer job scheduling functionalities for our technical community. This particular grid software and technology is, however, considered deprecated by more modern approaches to handle similar functions.

The ultimate goal of the WMCS team is to stop using this grid software, and leverage Kubernetes instead.

Timeline

This timeline is so badly guesstimated, that the reader could pretty much take it as simple date placeholders. We hope that future edits to this section may introduce more precision.

  • FY21/22 Q2 (Oct-Dec 2021): finish work & release the Toolforge Jobs Framework. Continue working on Toolforge buildpacks. Migrate Son of Grid Engine to Debian Buster.
  • FY21/22 Q3 (Jan-Mar 2022): finish work & release Toolforge buildpacks.
  • FY21/22 Q4 (Apr-Jun 2022): introduce and run formal deprecation process for Son of Grid Engine (community comms, support, etc)
  • FY22/23 Q1 (Jul-Sep 2022): finish deprecation process and shutdown Son of Grid Engine.

Use case continuity

We are aware our technical community relies on the grid for many of the most relevant Toolforge use cases.

In particular, there are a couple of use cases that may need some adaptation work in order to be fully supported on Kubernetes. For some of the current grid workflows, there may be no 1:1 functionality match on Kubernetes.

The following table tracks use case continuity.

Toolforge grid-like features
Feature In our Son of Grid Engine In our Kubernetes Comment
tools job scheduling Native Toolforge Jobs Framework customization* Basically a 1:1 match
mixing tool runtime environments Native Toolforge buildpacks customization* Potentially equivalent solution
tools web services Native + customization* Native Already in place
tool management via ssh Native + customization* Native + customization* Already in place
tool management via web interface Not implemented No plans so far Easier with Kubernetes APIs anyway
multitenancy Native, based on POSIX semantics Native, based on k8s namespaces Already in place
quotas and other cluster-level controls Native Native Already in place
tool development environment local replication None, up to the user Native, docker containers Improvement!
access to data services (toolsdb, wikireplicas, dumps, etc) Yes Yes Basically a 1:1 match
observability for individual tools None ?? Native, based on prometheus Already in place
observabilty service-wide https://sge-status.toolforge.org/ https://k8s-status.toolforge.org/ Basically a 1:1 match

\* In the context of the table, customization means that a significant development effort is required for making it possible.

Reasoning

Some of the reasons why we want to stop using our current grid implementation.

  • there has not been a new release (bugfixes, security patches, or otherwise) since 2016
  • the grid has poor controls and support for important aspects such as high availability, fault tolerance and self-recovery.
  • maintaining a healthy grid requires plenty of manual operations, like manual queue cleanups in case of failures, hand-crafted script for pooling/depooling nodes, etc.
  • there is no good/modern monitoring support for the grid, and we need to craft and maintain several monitoring pieces to able to do proper maintenance.
  • the grid is also strongly tied to the underlying operating system release version. Migrating from one Debian version to the next is often painful.
  • the grid imposes a strong dependency on NFS, another old technology. We would like to reduce dependency on NFS overall, and in the future we will explore NFS-free approaches for Toolforge.
  • in general, the grid is old software, old technology, which can be replaced by more modern approaches for doing the same thing.

Our desire is to cover all our grid-like technology needs with kubernetes, which has several benefits:

  • good high availability, fault tolerance and self-recovery constructs and facilities.
  • maintaining a running kubernetes cluster requires little manual operations.
  • there is good monitoring options for kubernetes deployments.
  • our current approach to deploying and upgrading kubernetes is independent of the underlying operating system.
  • while our current kubernetes deployment uses NFS as a central component, there is support for using other, more modern, approaches for the kind of shared storage we need in Toolforge.
  • in general, kubernetes is a modern technology, with a vibrant and healthy community, that both enables new uses cases and has enough flexibility to adapt legacy ones.

See also