You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Wikimedia Cloud Services team/EnhancementProposals/Toolforge Buildpack Implementation"

From Wikitech-static
Jump to navigation Jump to search
imported>Bstorm
(→‎Components: Add deployment service)
imported>Bstorm
(→‎Storage: dress up the yaml a bit)
Line 40: Line 40:


===Components===
===Components===
=====Diagram of Build Service=====
[[File:Toolforge-build-service.png|thumb|left|alt=Diagram showing relationships between components of Toolforge buildpacks and build service]]
[[File:Toolforge-build-service.png|thumb|left|alt=Diagram showing relationships between components of Toolforge buildpacks and build service]]


Line 47: Line 46:
The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors without breaking the multitenancy model of Toolforge entirely. Effectively, that means it has to be integrated in a form of CI in Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and [[GitLab|Gitlab]] may be on its way, but in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small "CI" system to process buildpacks.
The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors without breaking the multitenancy model of Toolforge entirely. Effectively, that means it has to be integrated in a form of CI in Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and [[GitLab|Gitlab]] may be on its way, but in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small "CI" system to process buildpacks.


The build service is itself the center piece in the setup. The build service must be constructed so that is the only a user-accessible system that has credentials to the artifact (OCI) repository. Otherwise, you don't need to deal with it and can push whatever you want. This is quite simple to accomplish if builds happen in a separate namespace where users (and potentially things like Striker) can only create kpack images objects or builds for Tekton. Tool users have no access to pods or secrets in that namespace. At that point, as long as the namespace has the correct secret placed by admins, the access will be available. An admission webhook could insist that the creating user only be talking about their own project in Harbor, is in group Toolforge, only uses approved builders, etc.
The build service is itself the center piece in the setup. The build service must be constructed so that is the only user-accessible system that has credentials to the artifact (OCI) repository. Otherwise, you don't need to deal with it and can push whatever you want. This is quite simple to accomplish if builds happen in a separate namespace where users (and potentially things like Striker) pipelines for Tekton (in this case). Tool users have no access to pods or secrets in that namespace. At that point, as long as the namespace has the correct secret placed by admins, the access will be available. An admission webhook could insist that the creating user only be talking about their own project in Harbor, is in group Toolforge, only uses approved builders, etc. The namespace used is named <code>image-build</code>.


Tekton CI/CD would provide a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users. Tekton provides tasks maintained by the cloud-native buildpacks organization directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project https://www.openshift.com/learn/topics/pipelines. It is also worth noting that KNative dropped their "build" system in favor of Tekton as well.
Tekton CI/CD provides a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users. Tekton provides tasks maintained by the cloud-native buildpacks organization directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project https://www.openshift.com/learn/topics/pipelines. It is also worth noting that KNative dropped their "build" system in favor of Tekton as well.


Another option is [https://github.com/pivotal/kpack kpack], which is a more single-purpose build service for buildpacks that runs in non-privileged containers. It has the wonderful "Image" Kubernetes CRD that can be set up to poll your repo for changes and simply start builds. If we used that, we'd basically make a dashboard in Grafana, and it would likely be just fine.
Because we are not able to simply consume Auto DevOps from Gitlab (some issues include https://docs.gitlab.com/ce/topics/autodevops/#private-registry-support with the expectation of privileged containers and DinD firmly embedded in the setup), we are pursuing Tekton Pipelines.
 
Because we will not be able to simply consume Auto DevOps from Gitlab (some issues include https://docs.gitlab.com/ce/topics/autodevops/#private-registry-support and the expectation of privileged containers firmly embedded in the setup), we will probably pursue one of the two above-mentioned flexible routes unless we end up with a good reason not to. Gitlab's implementation has since pushed more deeply into buildpacks, but it still requires privileged containers running <code>pack</code> inside Docker. That breaks the most fundamental requirements for us. Since the two services above run inside Kubernetes, we can use webhook controllers to validate all inputs and Kubernetes RBAC to manage permissions.


====Artifact Repository====
====Artifact Repository====
Line 66: Line 63:
====Deployment Service====
====Deployment Service====
If this sounds like "CD", you'd be right. However, initially, it may be enough to just allow you to deploy your own images with <code>webservice</code>. Once you have an image, you can deploy however you want. Tekton is entirely capable of acting as a CD pipeline as well as CI, while kpack is a one-trick pony. We may want something entirely different as well, but when everyone still has shell accounts anyway in Toolforge, this seems like the item that can be pushed down the road.
If this sounds like "CD", you'd be right. However, initially, it may be enough to just allow you to deploy your own images with <code>webservice</code>. Once you have an image, you can deploy however you want. Tekton is entirely capable of acting as a CD pipeline as well as CI, while kpack is a one-trick pony. We may want something entirely different as well, but when everyone still has shell accounts anyway in Toolforge, this seems like the item that can be pushed down the road.
===Build Service Design===
Tekton Pipelines is a general purpose Kubernetes [https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ CRD] system with a controller, another namespace and a whole lot of capability. To use it as a build system, we are consuming a fair amount of what it can do.
====Namespacing====
The controller and mutating webhook live in the <code>tekton-pipelines</code> namespace. Those are part of the upstream system. The webhook has a horizontal pod autoscaler to cope with bursts. The CRDs interact with those components to coordinate pods using the ClusterRole tekton-pipelines-controller-tenant-access (see RBAC).
Actual builds happen in the <code>image-build</code> namespace. Tool users cannot directly interact with any core resources there--only pipelines (where parameters to the git-clone task and buildpacks task are defined), pipelineruns (which actually make pipelines do things), PVCs (to pass the git checkout to the buildpack) and pipelineresources (to define images) will be accessible to Toolforge users. Those resources need further validation (via webhook) and are probably best defined by convenience interfaces (like scripts and webapps).
The required controller for the NFS subdir storage class is deployed into the <code>nfs-provisioner</code> namespace via helm.
Triggers for automated builds from git repos might be defined in tool namespaces.
Deployments are simply a matter of allowing the harbor registry for tool pods. Automated deployment might be added as optional finishes to pipelines when we are ready.
====RBAC====
There is basic RBAC required for deploying Tekton Pipelines, outlined in the upstream documentation. The specific yaml will be committed to git.
====Pod Security Policy (later to be replaced by OPA Gatekeeper)====
Sorted out mostly, but to be added here.
====Storage====
To pass a volume from the git-clone task to the buildpacks task, you need a Tekton <code>workspace</code>. The workspace is backed by a persistent volume claim. It is much simpler to have a StorageClass manage the persistent volumes and automatically provision things than to have admins dole them out. The [https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner NFS subdir provisioner] works quite well. The basic idea is giving it a directory from an NFS share the workers have access to and it will happily create subdirectories and attach them to pods, deleting them when the volume claim is cleaned up. For the most part, storage size provisioning seems to have little effect on unquota'd NFS volumes. As usual, tool users will need to simply be polite about storage use. This can later be replaced by cinder volumes (using the openstack provider) or at least an NFS directory with a quota on it to prevent harm.
The image can be cached in our standard image repo, but an example values file would look like this:
{{Collapse top|values.yaml}}
<source lang="yaml">
replicaCount: 1
strategyType: Recreate
image:
  repository: k8s.gcr.io/sig-storage/nfs-subdir-external-provisioner
  tag: v4.0.2
  pullPolicy: IfNotPresent
nfs:
  server: nfs-tools-project.svc.eqiad.wmnet
  path: /srv/misc/shared/toolsbeta/project/pvs
  mountOptions:
    - noatime
    - nfsvers=4
    - proto=tcp
    - sec=sys
  volumeName: nfs-subdir-external-provisioner-root
storageClass:
  defaultClass: true
  archiveOnDelete: false
podSecurityPolicy:
  enabled: true
resources:
  limits:
  cpu: 100m
  memory: 128Mi
  requests:
  cpu: 100m
  memory: 128Mi
labels:
  app.kubernetes.io/part-of: toolforge-build-service
</source>
{{Collapse bottom}}
It's deployed using (for example) <code>helm --kube-as-user bstorm --kube-as-group system:masters --namespace nfs-provisioner install nfs-provisioner-beta-1 nfs-subdir-external-provisioner/nfs-subdir-external-provisioner -f volume-provisioner-values.yaml</code>
====Secrets====
====Tasks====

Revision as of 00:48, 4 August 2021

Overview

Building on Wikimedia Cloud Services team/EnhancementProposals/Toolforge push to deploy and initial PoC at Portal:Toolforge/Admin/Buildpacks, this is an effort at delivering a design document and plan for the introduction of a buildpack-based workflow.

Toolforge is a Platform-as-a-Service concept inside Cloud VPS, but it is burdened heavily by technical debt from outdated designs and assumptions. Since the latest implementations of Toolforge's most effectively curated services are all cloud native structures that are dependent on containerization and Kubernetes, moving away from shell logins and compute batch processing (like Grid Engine) that is heavily tied to NFS, the clear way forward is a flexible, easy-to-use container system that launches code with minimal effort on the part of the user while having a standardized way to contribute to the system itself. Today, the community-adopted and widely-used solution is Cloud Native Buildpacks, a CNCF project that was originally started by Heroku and Cloud Foundry based on their own work in order to move toward industry standardization (partly because those platforms and similar ones like Deis pre-dated Kubernetes, OCI and similar cloud native standards therefore relying on LXC, tarballs and generally incompatible-with-the-rest-of-the-world frameworks). Since it is a standard and a specification, there are many implementations of that standard. The one used for local development is the pack command line tool, which is also orchestrated by many other integrations such at Gitlab, Waypoint and CircleCI. Separate implementations are more readily consumed if not using those tools such as kpack (which is maintained mostly by VMWare) and a couple tasks built into to Tekton CI/CD that are maintained by the Cloud Native Buildpacks group directly (like pack). Because Tekton is designed to be integrated into other CI/CD solutions and is fairly transparent about it's backing organizations, it is a natural preference for Toolforge as it should work with Jenkins, Gitlab, Github, etc.

To get an idea of what it "looks like" to use a buildpack, it can be thought of as simply an alternative to using a Dockerfile. It's an alternative that cares about the specific OCI image (aka Docker image) layers that are created and exported to prevent not only an explosion of space used by them but to make sure that the lower layers stay in a predictable state that can be rebased on top of when they are updated. The example used is almost always using the pack command to deploy an app locally (see https://buildpacks.io/docs/app-journey/). This looks either like magic (if you are used to creating Dockerfiles) or like just another way to dockerize things. That really doesn't explain a thing about what buildpacks are or why we'd use them. A better example would be to create a Node.js app on Gitlab.com, enable "Auto DevOps" (aka buildpacks), add a Kubernetes cluster to your Gitlab settings and watch it magically produce a CI/CD pipeline all the way to production without any real configuration. That's what it kind of looks like if you get it right. The trickier part is all on our end.

Concepts

Buildpacks are a specification as well as a piece of code you can put in a repository. It is somewhat necessary to separate these two ideas in order to understand the ecosystem. Part of the reason for this is that buildpacks don't do anything without an implementation of the lifecycle in a platform. A "buildpack" applies user code to a "stack" via a "builder" during the "lifecycle" to get code onto a "platform", which is also the term used to describe the full system. So when we talk about "buildpacks" we could be talking about:

This is not made easier to understand by the ongoing development of the standard (which now includes stackpacks, which can take root actions on layers before handing off to buildpacks--such as running package installs).


Design

Requirements

  • Must run in a limited-trust environment
    • Elevated privileges should not be attainable since that breaks multi-tenancy and the ability to use Toolforge as a curated environment
      • This might seem obvious, but it needs to be stated since so many implementations of this kind of thing assume a disposable sandbox or a trusted environment.
  • Storage of artifacts must have quotas
    • Bad actors are not needed to fill disks by mistake
  • Selection of builders must be reviewed or restricted
    • Since the buildpack lifecycle is effectively contained in your builder, the builder selects the stack you use, what you can do and how you do it. Without controlling builder selection, the system is unable to maintain security maintenance, deduplication of the storage via layer control and rebasability during security response.
  • The system must at least be available in toolsbeta as a development environment if not usable or at least testable on local workstations.
  • It must be possible for it to coexist with the current build-on-bastion system.
    • This implies that webservice is aware of it. See task T266901
  • This is really two problems: Build and Deploy
    • Once you have a build, we might need to decide how to deploy it. For now, that could just be a webservice command argument.
    • Deployment is much more easily solved depending on how we want things to roll. It could be automatic.

Nice-to-haves

  • A dashboard (or at least enough prometheus instrumentation to make one)
  • A command line to use from bastions that simplifies manual rebuilds

Components


Build Service

The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors without breaking the multitenancy model of Toolforge entirely. Effectively, that means it has to be integrated in a form of CI in Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and Gitlab may be on its way, but in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small "CI" system to process buildpacks.

The build service is itself the center piece in the setup. The build service must be constructed so that is the only user-accessible system that has credentials to the artifact (OCI) repository. Otherwise, you don't need to deal with it and can push whatever you want. This is quite simple to accomplish if builds happen in a separate namespace where users (and potentially things like Striker) pipelines for Tekton (in this case). Tool users have no access to pods or secrets in that namespace. At that point, as long as the namespace has the correct secret placed by admins, the access will be available. An admission webhook could insist that the creating user only be talking about their own project in Harbor, is in group Toolforge, only uses approved builders, etc. The namespace used is named image-build.

Tekton CI/CD provides a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users. Tekton provides tasks maintained by the cloud-native buildpacks organization directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project https://www.openshift.com/learn/topics/pipelines. It is also worth noting that KNative dropped their "build" system in favor of Tekton as well.

Because we are not able to simply consume Auto DevOps from Gitlab (some issues include https://docs.gitlab.com/ce/topics/autodevops/#private-registry-support with the expectation of privileged containers and DinD firmly embedded in the setup), we are pursuing Tekton Pipelines.

Artifact Repository

Since Docker software is getting dropped from many non-desktop implementations of cloud native containers, it is usually talked about as OCI images and registries that we are working with. Docker is a specific implementation and extension of the standard with its own orchestration layers that are separate from the OCI spec. All that said, we need a registry to work with, and right now, Toolforge uses the same basic Docker registry that is used by production, but we use local disk instead of Swift for storage. This is highly limiting. Since WMCS systems are not limited exclusively to Debian packaging, it is possible we could deploy the vastly popular and much more business-ready Harbor that is maintained mostly by VMWare. Harbor is deployed via docker compose or (in better, more fault tolerant form) helm since it is distributed entirely as OCI images built on VMWare's peculiar Photon OS. That's why it was rejected by main production SRE teams because repackaging it would be a huge piece of work. Despite this odd deployment style, it enjoys wide adoption and contributions now and is at 2.0 maturity. It can also be used to cache and proxy to things like Quay.io. The downside of Harbor is that it is backed by Postgresql, and it doesn't solve the storage problem inherently. While it would be ideal to deploy that with Trove Postgresql and Swift storage. We could deploy it initially with a simple postgres install and cinder storage volumes. Ultimately, only Harbor has no real potential concerns with licensing and has quotas.

Edit: We are testing Harbor as the most complete open source implementation of what we need. It has quotas, multi-tenancy and a solid authentication model that allows for "robot" accounts just like Quay and Docker Hub. We don't need to link it to LDAP, since ideally, users cannot write to it directly, anyway. Users should go through the build system only. Right now it is running on https://harbor.toolsbeta.wmflabs.org for initial build and testing. It should be deployed using helm on Kubernetes in the end, and it has the ability to back up to Quay.io. As long as we are consuming the containerized version directly, this is a solved problem.

Dashboards and UX

This is going to depend on the implementation of build service a bit.


Deployment Service

If this sounds like "CD", you'd be right. However, initially, it may be enough to just allow you to deploy your own images with webservice. Once you have an image, you can deploy however you want. Tekton is entirely capable of acting as a CD pipeline as well as CI, while kpack is a one-trick pony. We may want something entirely different as well, but when everyone still has shell accounts anyway in Toolforge, this seems like the item that can be pushed down the road.

Build Service Design

Tekton Pipelines is a general purpose Kubernetes CRD system with a controller, another namespace and a whole lot of capability. To use it as a build system, we are consuming a fair amount of what it can do.

Namespacing

The controller and mutating webhook live in the tekton-pipelines namespace. Those are part of the upstream system. The webhook has a horizontal pod autoscaler to cope with bursts. The CRDs interact with those components to coordinate pods using the ClusterRole tekton-pipelines-controller-tenant-access (see RBAC).

Actual builds happen in the image-build namespace. Tool users cannot directly interact with any core resources there--only pipelines (where parameters to the git-clone task and buildpacks task are defined), pipelineruns (which actually make pipelines do things), PVCs (to pass the git checkout to the buildpack) and pipelineresources (to define images) will be accessible to Toolforge users. Those resources need further validation (via webhook) and are probably best defined by convenience interfaces (like scripts and webapps).

The required controller for the NFS subdir storage class is deployed into the nfs-provisioner namespace via helm.

Triggers for automated builds from git repos might be defined in tool namespaces.

Deployments are simply a matter of allowing the harbor registry for tool pods. Automated deployment might be added as optional finishes to pipelines when we are ready.

RBAC

There is basic RBAC required for deploying Tekton Pipelines, outlined in the upstream documentation. The specific yaml will be committed to git.

Pod Security Policy (later to be replaced by OPA Gatekeeper)

Sorted out mostly, but to be added here.

Storage

To pass a volume from the git-clone task to the buildpacks task, you need a Tekton workspace. The workspace is backed by a persistent volume claim. It is much simpler to have a StorageClass manage the persistent volumes and automatically provision things than to have admins dole them out. The NFS subdir provisioner works quite well. The basic idea is giving it a directory from an NFS share the workers have access to and it will happily create subdirectories and attach them to pods, deleting them when the volume claim is cleaned up. For the most part, storage size provisioning seems to have little effect on unquota'd NFS volumes. As usual, tool users will need to simply be polite about storage use. This can later be replaced by cinder volumes (using the openstack provider) or at least an NFS directory with a quota on it to prevent harm.

The image can be cached in our standard image repo, but an example values file would look like this:

It's deployed using (for example) helm --kube-as-user bstorm --kube-as-group system:masters --namespace nfs-provisioner install nfs-provisioner-beta-1 nfs-subdir-external-provisioner/nfs-subdir-external-provisioner -f volume-provisioner-values.yaml

Secrets

Tasks