You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Streamlined Service Delivery Design

From Wikitech-static
Revision as of 14:19, 3 October 2017 by imported>Alexandros Kosiaris (→‎Configuration registry)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

We will streamline and integrate the delivery of services, by building a new production platform for integrated development, testing, deployment and hosting of applications.

Technology department's Program 6: Streamlined Service Deliry program pertains to the effort by the department to create tools and processes that will allow developers to easily create services that could run in production infrastructure with minimal (if any) modifications. The

Goal

We will build a new production platform for integrated development, testing, deployment, and hosting of applications. This will greatly reduce the complexity and speed of delivering a service and maintaining it throughout its lifecycle, with fewer dependencies between teams and greater automation and integration. The platform will offer more flexibility through support for automatic high-availability and scaling, abstraction from hardware, and a streamlined path from development through testing to deployment. Services will be isolated from each other for increased reliability and security.

Wikimedia developers, as well as third-party users, benefit from the ability to easily replicate the stack for development or their own use cases.

This work also represents an investment in the future; although this will not yet significantly materialize within FY17-18, this project will eventually result in significant cost savings on both capital expenditure (through consolidation of hardware capacity) and staff time (by streamlining development, testing, deployment and maintenance).

Design

At a very high level the program is about creating a number of systems that will allow developers to create, modify and test old and new applications in a streamlined and unified way, eliminating many of the current roadblocks to application development. The resulting new applications would even be good candidates for running on the main production infrastructure maintained by WMF.

The key systems, first listed as items and later explained in more detail one by one would be:

  • A development environment
  • Testing/building pipeline
  • Artifact Registry
  • Deployment tooling
  • Configuration registry
  • Staging environment
  • Production environment

Development environment

Testing/building pipeline

Artifact Registry

The artifact (or artefact) registry will be a container registry that will be publicly available to everyone in the world. The ability to upload artificats to it will be restricted to the pipeline and Technical Operations members who have to perform security upgrades and need to upload new container images. Every image that has been successfully built by the pipeline will be uploaded to the artifact registry and be ready for deployment.

It is already implemented using the docker provided registry with an swift backend. This allows the storage powering the registry to scale horizontally.

The ability for anyone in the world to download our images means they will be reusable by the development environment which in turns means every developer will be able to reuse the code we run in production as is.

Deployment tooling

Configuration registry

Services running in production WILL have different configuration from services running in other environments like the dev environment. This is expected and welcomed since it allows for greater flexibility. That configuration should be kept in a (probably public) place allowing interested parties to review and propose changes. Let's call that place the configuration registry. The deployment tooling will be using that configuration registry to bootstrap and configure services in production whereas it should either not be used for other environments (e.g. development environments) or a different version of it should be used. The staging environment, since it's host in WMF will probably be using the same configuration registry, perhaps with some differences. For now, the logical choice for this configuration registry is git. Configuration is text, something in which git excels tracking. A git repo is flexible enough to allow supporting more than one environments via branches or different git repos can be used per environment. This section will probably need way more work to be defined in the future.

Staging environment

What it will be

The staging environment is meant to be a functional copy of the production environment, albeit with smaller capacity and availability guarantees. It will not be available in both data centers and the nodes powering it will always be fewer in number. In every other way, it is expected it will be an exact copy at every point in time of production

Mode of operation

This is where the artifacts from the artifact registry will be deployed to after being successfully built by the testing/building pipeline, preferably directly from the pipeline instrumentation (automatically or after human vetting) allowing developers to have a canary deployment environment, allowing catching of errors early before they make it to production.

What it will NOT be

The staging environment SHOULD NOT be considered as a testing environment. It SHOULD be considered as the last step and chance for catching bugs before they impact end-users.

Production environment

What it will be

The production environment will be the one where live traffic from end-users is directed.

Highly Available in all manners possible

The above goal will be implemented via running the containers produced from the above stages in a number of kubernetes clusters (currently 1 per primary DC). Kubernetes implements a lot of the above. By scheduling different workloads and workload types across a fleet of kubernetes nodes and monitoring above said workloads, stopping and starting as necessary according to algorithms, it provides high availability. In case of a sudden failure of an application it will schedule an automatic depooling of said application, scheduled quick restart and repooling of it. Given the nature of above said workloads it is possible to run multiple incarnations of it in parallel (called pods) providing load balancing and high availability.

Present on all primary DCs (2 currently)

Currently we already have 2 clusters, 1 per primary DC and hardware for the future has already thought of.

Adequately filtering traffic between applications as well as the rest of the world

Using the Calico framework and the network policies that it allows, it is possible (and has already been implemented) to very specifically define the outbound network connections allowed to an application or to other applications reaching inbound to it. This is expected to minimize the exposure of services to connections from unwarranted services while also protecting the rest of the fleet from said applications in case of a compromise.

Equipped with Access Control mechanisms for configuration and deployment of applications

TBD

Capable of high levels of inbound/outbound traffic

The inbound mechanisms to the kubernetes clusters are going to be the well tested in WMF infrastructure LVS-DR alongside pybal, allowing high levels of traffic reaching the clusters at specific Node ports (TCP and UDP). The nodes will follow the standard LVS practice across the current implementation of our data centers to return the traffic directly to the caller allowing for high volumes of asymmetrical traffic, which favors the way the Web currently works, with small requests and big in size answers.

Capable of load balancing traffic between various application endpoints

Load balancing happens natively by kubernetes using a stochastic system where traffic is probabilistically routed to an application instance (called a pod) regardless of the point of origin. There's not really much more to add to that.

Capable of responding both automatically and manually to increases/decreases of traffic

Kubernetes also allows employing autoscalers making the ability to respond to sudden increase of inbound traffic automatically. It's of course possible to scale manually, increase/descreasing manually the number of replicas available for each application.

Supporting monitoring, health checks, telemetry and transparent encryption for application communication

TDB

Rolling deploys

Kubernetes allowing natively rolling deploys. This can be done both behind the scenes by kubectl, but it is also implemented by helm. This happens by creating a new Replication Controller in kubernetes that gradually increases the number of replicas available while the number of replicas for the old one is gradually decreased until it reaches 0, at which point the old replication controller is deleted and the entire of the application has been upgraded. The rolling update is also capable of being rolled back quite quickly. Keeping in mind that kubernetes performs health checks of applications and this allows monitoring of applications and pausing, rolling back easily

Relation to the pipeline

The pipeline instrumentation WILL NOT be automatically updating application code in production. This WILL have to be vetted and done by the developers so that surprises are kept to a minimum, the current status quo is maintained, possibly easing adoption of the new infrastructure.

Conclusions