You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Wikimedia Cloud Services team/EnhancementProposals/Toolforge Buildpack Implementation: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
(→‎Build Service Design: Add manifests link)
m (fixes)
Line 109: Line 109:
The image can be cached in our standard image repo, but an example values file would look like this:
The image can be cached in our standard image repo, but an example values file would look like this:
{{Collapse top|values.yaml}}
{{Collapse top|values.yaml}}
<source lang="yaml">
<syntaxhighlight lang="yaml">
replicaCount: 1
replicaCount: 1
strategyType: Recreate
strategyType: Recreate
Line 145: Line 145:
labels: toolforge-build-service toolforge-build-service
{{Collapse bottom}}
{{Collapse bottom}}

Revision as of 19:19, 4 September 2021


Building on Wikimedia Cloud Services team/EnhancementProposals/Toolforge push to deploy and initial PoC at Portal:Toolforge/Admin/Buildpacks, this is an effort at delivering a design document and plan for the introduction of a buildpack-based workflow.

Toolforge is a Platform-as-a-Service concept inside Cloud VPS, but it is burdened heavily by technical debt from outdated designs and assumptions. Since the latest implementations of Toolforge's most effectively curated services are all cloud native structures that are dependent on containerization and Kubernetes, moving away from shell logins and compute batch processing (like Grid Engine) that is heavily tied to NFS, the clear way forward is a flexible, easy-to-use container system that launches code with minimal effort on the part of the user while having a standardized way to contribute to the system itself. Today, the community-adopted and widely-used solution is Cloud Native Buildpacks, a CNCF project that was originally started by Heroku and Cloud Foundry based on their own work in order to move toward industry standardization (partly because those platforms and similar ones like Deis pre-dated Kubernetes, OCI and similar cloud native standards therefore relying on LXC, tarballs and generally incompatible-with-the-rest-of-the-world frameworks). Since it is a standard and a specification, there are many implementations of that standard. The one used for local development is the pack command line tool, which is also orchestrated by many other integrations such at Gitlab, Waypoint and CircleCI. Separate implementations are more readily consumed if not using those tools such as kpack (which is maintained mostly by VMWare) and a couple tasks built into to Tekton CI/CD that are maintained by the Cloud Native Buildpacks group directly (like pack). Because Tekton is designed to be integrated into other CI/CD solutions and is fairly transparent about it's backing organizations, it is a natural preference for Toolforge as it should work with Jenkins, Gitlab, Github, etc.

To get an idea of what it "looks like" to use a buildpack, it can be thought of as simply an alternative to using a Dockerfile. It's an alternative that cares about the specific OCI image (aka Docker image) layers that are created and exported to prevent not only an explosion of space used by them but to make sure that the lower layers stay in a predictable state that can be rebased on top of when they are updated. The example used is almost always using the pack command to deploy an app locally (see This looks either like magic (if you are used to creating Dockerfiles) or like just another way to dockerize things. That really doesn't explain a thing about what buildpacks are or why we'd use them. A better example would be to create a Node.js app on, enable "Auto DevOps" (aka buildpacks), add a Kubernetes cluster to your Gitlab settings and watch it magically produce a CI/CD pipeline all the way to production without any real configuration. That's what it kind of looks like if you get it right. The trickier part is all on our end.


Buildpacks are a specification as well as a piece of code you can put in a repository. It is somewhat necessary to separate these two ideas in order to understand the ecosystem. Part of the reason for this is that buildpacks don't do anything without an implementation of the lifecycle in a platform. A "buildpack" applies user code to a "stack" via a "builder" during the "lifecycle" to get code onto a "platform", which is also the term used to describe the full system. So when we talk about "buildpacks" we could be talking about:

This is not made easier to understand by the ongoing development of the standard (which now includes stackpacks, which can take root actions on layers before handing off to buildpacks--such as running package installs).



  • Must run in a limited-trust environment
    • Elevated privileges should not be attainable since that breaks multi-tenancy and the ability to use Toolforge as a curated environment
      • This might seem obvious, but it needs to be stated since so many implementations of this kind of thing assume a disposable sandbox or a trusted environment.
  • Storage of artifacts must have quotas
    • Bad actors are not needed to fill disks by mistake
  • Selection of builders must be reviewed or restricted
    • Since the buildpack lifecycle is effectively contained in your builder, the builder selects the stack you use, what you can do and how you do it. Without controlling builder selection, the system is unable to maintain security maintenance, deduplication of the storage via layer control and rebasability during security response.
  • The system must at least be available in toolsbeta as a development environment if not usable or at least testable on local workstations.
  • It must be possible for it to coexist with the current build-on-bastion system.
    • This implies that webservice is aware of it. See task T266901
  • This is really two problems: Build and Deploy
    • Once you have a build, we might need to decide how to deploy it. For now, that could just be a webservice command argument.
    • Deployment is much more easily solved depending on how we want things to roll. It could be automatic.


  • A dashboard (or at least enough prometheus instrumentation to make one)
  • A command line to use from bastions that simplifies manual rebuilds


Build Service

The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors without breaking the multitenancy model of Toolforge entirely. Effectively, that means it has to be integrated in a form of CI in Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and Gitlab may be on its way, but in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small "CI" system to process buildpacks.

The build service is itself the center piece in the setup. The build service must be constructed so that is the only user-accessible system that has credentials to the artifact (OCI) repository. Otherwise, you don't need to deal with it and can push whatever you want. This is quite simple to accomplish if builds happen in a separate namespace where users (and potentially things like Striker) pipelines for Tekton (in this case). Tool users have no access to pods or secrets in that namespace. At that point, as long as the namespace has the correct secret placed by admins, the access will be available. An admission webhook could insist that the creating user only be talking about their own project in Harbor, is in group Toolforge, only uses approved builders, etc. The namespace used is named image-build.

Tekton CI/CD provides a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users. Tekton provides tasks maintained by the cloud-native buildpacks organization directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project It is also worth noting that KNative dropped their "build" system in favor of Tekton as well.

Because we are not able to simply consume Auto DevOps from Gitlab (some issues include with the expectation of privileged containers and DinD firmly embedded in the setup), we are pursuing Tekton Pipelines.

Artifact Repository

Since Docker software is getting dropped from many non-desktop implementations of cloud native containers, it is usually talked about as OCI images and registries that we are working with. Docker is a specific implementation and extension of the standard with its own orchestration layers that are separate from the OCI spec. All that said, we need a registry to work with, and right now, Toolforge uses the same basic Docker registry that is used by production, but we use local disk instead of Swift for storage. This is highly limiting. Since WMCS systems are not limited exclusively to Debian packaging, it is possible we could deploy the vastly popular and much more business-ready Harbor that is maintained mostly by VMWare. Harbor is deployed via docker compose or (in better, more fault tolerant form) helm since it is distributed entirely as OCI images built on VMWare's peculiar Photon OS. That's why it was rejected by main production SRE teams because repackaging it would be a huge piece of work. Despite this odd deployment style, it enjoys wide adoption and contributions now and is at 2.0 maturity. It can also be used to cache and proxy to things like The downside of Harbor is that it is backed by Postgresql, and it doesn't solve the storage problem inherently. While it would be ideal to deploy that with Trove Postgresql and Swift storage. We could deploy it initially with a simple postgres install and cinder storage volumes. Ultimately, only Harbor has no real potential concerns with licensing and has quotas.

Edit: We are testing Harbor as the most complete open source implementation of what we need. It has quotas, multi-tenancy and a solid authentication model that allows for "robot" accounts just like Quay and Docker Hub. We don't need to link it to LDAP, since ideally, users cannot write to it directly, anyway. Users should go through the build system only. Right now it is running on for initial build and testing. It should be deployed using helm on Kubernetes in the end, and it has the ability to back up to As long as we are consuming the containerized version directly, this is a solved problem.

Dashboards and UX

This is going to depend on the implementation of build service a bit.

Deployment Service

If this sounds like "CD", you'd be right. However, initially, it may be enough to just allow you to deploy your own images with webservice. Once you have an image, you can deploy however you want. Tekton is entirely capable of acting as a CD pipeline as well as CI, while kpack is a one-trick pony. We may want something entirely different as well, but when everyone still has shell accounts anyway in Toolforge, this seems like the item that can be pushed down the road.

Build Service Design

Tekton Pipelines is a general purpose Kubernetes CRD system with a controller, another namespace and a whole lot of capability. To use it as a build system, we are consuming a fair amount of what it can do.

The currently functional (but not ready for production) manifests that are live in Toolsbeta are reviewable at


The controller and mutating webhook live in the tekton-pipelines namespace. Those are part of the upstream system. The webhook has a horizontal pod autoscaler to cope with bursts. The CRDs interact with those components to coordinate pods using the ClusterRole tekton-pipelines-controller-tenant-access (see RBAC).

Actual builds happen in the image-build namespace. Tool users cannot directly interact with any core resources there--only pipelines (where parameters to the git-clone task and buildpacks task are defined), pipelineruns (which actually make pipelines do things), PVCs (to pass the git checkout to the buildpack) and pipelineresources (to define images) will be accessible to Toolforge users. Those resources need further validation (via webhook) and are probably best defined by convenience interfaces (like scripts and webapps).

The required controller for the NFS subdir storage class is deployed into the nfs-provisioner namespace via helm.

Triggers for automated builds from git repos might be defined in tool namespaces.

Deployments are simply a matter of allowing the harbor registry for tool pods. Automated deployment might be added as optional finishes to pipelines when we are ready.

RBAC by subject

  • tekton-pipelines-controller service account
    • A ClusterRole for cluster access applied to all namespaces
    • A ClusterRole for tenant access (namespaced access) -- this provides access to run Tasks and Pipelines in the image-build namespace
  • tekton-pipelines-webhook service account
    • Uses a ClusterRole for mutating and validating webhook review
  • default service account in image-build namespace
    • git-clone task will run as this service account
  • tool accounts
    • require read access to task, taskruns, pipelines, pipelineresources and conditions in the image-build namespace
    • require read-write access to pipelineruns and persistent volume claims in image-build
  • buildpacks-service-account service account (for buildpack pods)
    • Will need root for an init container so it needs a special PSP as well as other appropriate rights
    • Has access to the basic-user-pass secret in order to push to Harbor.

Pod Security Policy (later to be replaced by OPA Gatekeeper)

Sorted out mostly, but to be added here.


To pass a volume from the git-clone task to the buildpacks task, you need a Tekton workspace. The workspace is backed by a persistent volume claim. It is much simpler to have a StorageClass manage the persistent volumes and automatically provision things than to have admins dole them out. The NFS subdir provisioner works quite well. The basic idea is giving it a directory from an NFS share the workers have access to and it will happily create subdirectories and attach them to pods, deleting them when the volume claim is cleaned up. For the most part, storage size provisioning seems to have little effect on unquota'd NFS volumes. As usual, tool users will need to simply be polite about storage use. This can later be replaced by cinder volumes (using the openstack provider) or at least an NFS directory with a quota on it to prevent harm.

The image can be cached in our standard image repo, but an example values file would look like this:

It's deployed using (for example) helm --kube-as-user bstorm --kube-as-group system:masters --namespace nfs-provisioner install nfs-provisioner-beta-1 nfs-subdir-external-provisioner/nfs-subdir-external-provisioner -f volume-provisioner-values.yaml


  • webhook-certs
    • In the tekton-pipelines namespace. This may need renewal from time to time, and it is populated during installation of Tekton Pipelines. If this doesn't have a self-renewal setup, having a renewal workflow is a requirement for deployment in tools.
  • basic-user-pass
    • This is the main reason the buildpacks-service-account is needed. That's what has access to this secret. It should be the robot account for Harbor.


The tasks needed are both snitched right from the tekton catalog with slight modification in the case of the buildpacks task.

  • git-clone 0.4
  • buildpacks phases 0.2
    • Using the buildpacks-phases task splits each phase into a distinct container. That means the docker credentials are only exposed by Tekton to the export phase, not detect and build (where user code is being operated on).