You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Wikimedia Cloud Services team/EnhancementProposals/Toolforge Buildpack Implementation"

From Wikitech-static
Jump to navigation Jump to search
imported>Bstorm
imported>Bstorm
(→‎Components: Add deployment service)
Line 14: Line 14:
* the buildpacks standard and specification
* the buildpacks standard and specification
* a specific buildpack, like this one for compiling golang https://github.com/paketo-buildpacks/go-build
* a specific buildpack, like this one for compiling golang https://github.com/paketo-buildpacks/go-build
* a platform that implements the specification (like pack, Tekton or Gitlab)
* a platform that implements the specification (like pack, Tekton or kpack)


This is not made easier to understand by the ongoing development of the standard (which now includes stackpacks, which can take root actions on layers before handing off to buildpacks--such as running package installs).
This is not made easier to understand by the ongoing development of the standard (which now includes stackpacks, which can take root actions on layers before handing off to buildpacks--such as running package installs).


TODO: add diagrams and such that I'm working on


==Components==
==Design==
===Requirements===
* Must run in a limited-trust environment
** Elevated privileges should not be attainable since that breaks multi-tenancy and the ability to use Toolforge as a curated environment
*** This might seem obvious, but it needs to be stated since so many implementations of this kind of thing assume a disposable sandbox or a trusted environment.
* Storage of artifacts must have quotas
** Bad actors are not needed to fill disks by mistake
* Selection of builders must be reviewed or restricted
** Since the buildpack lifecycle is effectively contained in your builder, the builder selects the stack you use, what you can do and how you do it. Without controlling builder selection, the system is unable to maintain security maintenance, deduplication of the storage via layer control and rebasability during security response.
* The system must at least be available in toolsbeta as a development environment if not usable or at least testable on local workstations.
* It must be possible for it to coexist with the current build-on-bastion system.
** This implies that webservice is aware of it. See {{PhabT|266901}}
* This is really two problems: Build and Deploy
** Once you have a build, we might need to decide how to deploy it. For now, that could just be a webservice command argument.
** Deployment is much more easily solved depending on how we want things to roll. It could be automatic.


===Implementation in CI===
===Nice-to-haves===
* A dashboard (or at least enough prometheus instrumentation to make one)
* A command line to use from bastions that simplifies manual rebuilds


The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors. Effectively, that means it has to be integrated in CI and deployed with CD into Toolforge Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and [[GitLab|Gitlab]] may be on its way, but it in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small CI system to process buildpacks. Tekton CI/CD should provide a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users, but it all needs evaluation and work. Tekton hosts tasks maintained by the cloud-native buildpacks group directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project https://www.openshift.com/learn/topics/pipelines.
===Components===
=====Diagram of Build Service=====
[[File:Toolforge-build-service.png|thumb|left|alt=Diagram showing relationships between components of Toolforge buildpacks and build service]]


Because we will likely not be able to simply consume Auto DevOps from Gitlab without at least as much work (some issues include https://docs.gitlab.com/ce/topics/autodevops/#private-registry-support), we will probably pursue this more flexible route unless we end up with a good reason not to.
-----
====Build Service====
The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors without breaking the multitenancy model of Toolforge entirely. Effectively, that means it has to be integrated in a form of CI in Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and [[GitLab|Gitlab]] may be on its way, but in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small "CI" system to process buildpacks.


===OCI Registry===
The build service is itself the center piece in the setup. The build service must be constructed so that is the only a user-accessible system that has credentials to the artifact (OCI) repository. Otherwise, you don't need to deal with it and can push whatever you want. This is quite simple to accomplish if builds happen in a separate namespace where users (and potentially things like Striker) can only create kpack images objects or builds for Tekton. Tool users have no access to pods or secrets in that namespace. At that point, as long as the namespace has the correct secret placed by admins, the access will be available. An admission webhook could insist that the creating user only be talking about their own project in Harbor, is in group Toolforge, only uses approved builders, etc.


Since Docker software is getting dropped from many non-desktop implementations of cloud native containers, it is usually talked about as [https://opencontainers.org/ OCI] images and registries that we are working with. Docker is a specific implementation and extension of the standard with its own orchestration layers that are separate from the OCI spec. All that said, we need a registry to work with, and right now, Toolforge uses the same basic Docker registry that is used by production, but we use local disk instead of Swift for storage. This is highly limiting. Since WMCS systems are not limited exclusively to Debian packaging, it is possible we could deploy the vastly popular and much more business-ready [https://goharbor.io/ Harbor] that is maintained mostly by VMWare, Red Hat's cloud-based Quay.io service that is free for public images (and used by PAWS) or something else. Harbor is deployed via docker compose or (in better, more fault tolerant form) helm since it is distributed entirely as OCI images built on VMWare's peculiar Photon OS. That's why it was rejected by main production SRE teams because repackaging it would be a huge piece of work. Despite this odd deployment style, it enjoys wide adoption and contributions now and is at 2.0 maturity. It can also be used to cache and proxy to things like Quay.io. The downside of Harbor is that it is backed by Postgresql, and it doesn't solve the storage problem inherently. While it would be ideal to deploy that with Trove Postgresql and Swift storage. We could deploy it initially with a simple postgres install and cinder storage volumes.
Tekton CI/CD would provide a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users. Tekton provides tasks maintained by the cloud-native buildpacks organization directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project https://www.openshift.com/learn/topics/pipelines. It is also worth noting that KNative dropped their "build" system in favor of Tekton as well.


===Dashboards and UX===
Another option is [https://github.com/pivotal/kpack kpack], which is a more single-purpose build service for buildpacks that runs in non-privileged containers. It has the wonderful "Image" Kubernetes CRD that can be set up to poll your repo for changes and simply start builds. If we used that, we'd basically make a dashboard in Grafana, and it would likely be just fine.


===Repository of buildpacks (and likely builders and stacks)===
Because we will not be able to simply consume Auto DevOps from Gitlab (some issues include https://docs.gitlab.com/ce/topics/autodevops/#private-registry-support and the expectation of privileged containers firmly embedded in the setup), we will probably pursue one of the two above-mentioned flexible routes unless we end up with a good reason not to. Gitlab's implementation has since pushed more deeply into buildpacks, but it still requires privileged containers running <code>pack</code> inside Docker. That breaks the most fundamental requirements for us. Since the two services above run inside Kubernetes, we can use webhook controllers to validate all inputs and Kubernetes RBAC to manage permissions.


We ''could'' consume the well-maintained [https://paketo.io/ paketo] buildpacks, which basically give you most things for free via community maintained and professionally managed sets of these things (backed by CloudFoundry). However, to maintain interoperability with familiar interfaces, internal response to security issues in our stacks and likely customizations we might want, we are likely to want to manage our own. Most buildpack authors seem to use golang binaries for the stages of the buildpack specification in order to get the advantages of "real" software development without including lots of runtime cruft in the layers in order to complete builds (which also could pollute the artifacts). The examples on the buildpacks website all use shell scripts to make it easier to understand, but that's not very extensible, maintainable or performant in the long run. It may be reasonable to begin with shell-based buildpacks for initial demonstrations and workloads and replace them with golang or rust (which has more enthusiasm in our contributor communities) over time.
====Artifact Repository====
Since Docker software is getting dropped from many non-desktop implementations of cloud native containers, it is usually talked about as [https://opencontainers.org/ OCI] images and registries that we are working with. Docker is a specific implementation and extension of the standard with its own orchestration layers that are separate from the OCI spec. All that said, we need a registry to work with, and right now, Toolforge uses the same basic Docker registry that is used by production, but we use local disk instead of Swift for storage. This is highly limiting. Since WMCS systems are not limited exclusively to Debian packaging, it is possible we could deploy the vastly popular and much more business-ready [https://goharbor.io/ Harbor] that is maintained mostly by VMWare. Harbor is deployed via docker compose or (in better, more fault tolerant form) helm since it is distributed entirely as OCI images built on VMWare's peculiar Photon OS. That's why it was rejected by main production SRE teams because repackaging it would be a huge piece of work. Despite this odd deployment style, it enjoys wide adoption and contributions now and is at 2.0 maturity. It can also be used to cache and proxy to things like Quay.io. The downside of Harbor is that it is backed by Postgresql, and it doesn't solve the storage problem inherently. While it would be ideal to deploy that with Trove Postgresql and Swift storage. We could deploy it initially with a simple postgres install and cinder storage volumes. Ultimately, only Harbor has no real potential concerns with licensing and has quotas.
 
Edit: We are testing Harbor as the most complete open source implementation of what we need. It has quotas, multi-tenancy and a solid authentication model that allows for "robot" accounts just like Quay and Docker Hub. We don't need to link it to LDAP, since ideally, users cannot write to it directly, anyway. Users should go through the build system only. Right now it is running on https://harbor.toolsbeta.wmflabs.org for initial build and testing. It should be deployed using helm on Kubernetes in the end, and it has the ability to back up to Quay.io. As long as we are consuming the containerized version directly, this is a solved problem.
 
====Dashboards and UX====
This is going to depend on the implementation of build service a bit.
 
 
====Deployment Service====
If this sounds like "CD", you'd be right. However, initially, it may be enough to just allow you to deploy your own images with <code>webservice</code>. Once you have an image, you can deploy however you want. Tekton is entirely capable of acting as a CD pipeline as well as CI, while kpack is a one-trick pony. We may want something entirely different as well, but when everyone still has shell accounts anyway in Toolforge, this seems like the item that can be pushed down the road.

Revision as of 23:51, 27 July 2021

Overview

Building on Wikimedia Cloud Services team/EnhancementProposals/Toolforge push to deploy and initial PoC at Portal:Toolforge/Admin/Buildpacks, this is an effort at delivering a design document and plan for the introduction of a buildpack-based workflow.

Toolforge is a Platform-as-a-Service concept inside Cloud VPS, but it is burdened heavily by technical debt from outdated designs and assumptions. Since the latest implementations of Toolforge's most effectively curated services are all cloud native structures that are dependent on containerization and Kubernetes, moving away from shell logins and compute batch processing (like Grid Engine) that is heavily tied to NFS, the clear way forward is a flexible, easy-to-use container system that launches code with minimal effort on the part of the user while having a standardized way to contribute to the system itself. Today, the community-adopted and widely-used solution is Cloud Native Buildpacks, a CNCF project that was originally started by Heroku and Cloud Foundry based on their own work in order to move toward industry standardization (partly because those platforms and similar ones like Deis pre-dated Kubernetes, OCI and similar cloud native standards therefore relying on LXC, tarballs and generally incompatible-with-the-rest-of-the-world frameworks). Since it is a standard and a specification, there are many implementations of that standard. The one used for local development is the pack command line tool, which is also orchestrated by many other integrations such at Gitlab, Waypoint and CircleCI. Separate implementations are more readily consumed if not using those tools such as kpack (which is maintained mostly by VMWare) and a couple tasks built into to Tekton CI/CD that are maintained by the Cloud Native Buildpacks group directly (like pack). Because Tekton is designed to be integrated into other CI/CD solutions and is fairly transparent about it's backing organizations, it is a natural preference for Toolforge as it should work with Jenkins, Gitlab, Github, etc.

To get an idea of what it "looks like" to use a buildpack, it can be thought of as simply an alternative to using a Dockerfile. It's an alternative that cares about the specific OCI image (aka Docker image) layers that are created and exported to prevent not only an explosion of space used by them but to make sure that the lower layers stay in a predictable state that can be rebased on top of when they are updated. The example used is almost always using the pack command to deploy an app locally (see https://buildpacks.io/docs/app-journey/). This looks either like magic (if you are used to creating Dockerfiles) or like just another way to dockerize things. That really doesn't explain a thing about what buildpacks are or why we'd use them. A better example would be to create a Node.js app on Gitlab.com, enable "Auto DevOps" (aka buildpacks), add a Kubernetes cluster to your Gitlab settings and watch it magically produce a CI/CD pipeline all the way to production without any real configuration. That's what it kind of looks like if you get it right. The trickier part is all on our end.

Concepts

Buildpacks are a specification as well as a piece of code you can put in a repository. It is somewhat necessary to separate these two ideas in order to understand the ecosystem. Part of the reason for this is that buildpacks don't do anything without an implementation of the lifecycle in a platform. A "buildpack" applies user code to a "stack" via a "builder" during the "lifecycle" to get code onto a "platform", which is also the term used to describe the full system. So when we talk about "buildpacks" we could be talking about:

This is not made easier to understand by the ongoing development of the standard (which now includes stackpacks, which can take root actions on layers before handing off to buildpacks--such as running package installs).


Design

Requirements

  • Must run in a limited-trust environment
    • Elevated privileges should not be attainable since that breaks multi-tenancy and the ability to use Toolforge as a curated environment
      • This might seem obvious, but it needs to be stated since so many implementations of this kind of thing assume a disposable sandbox or a trusted environment.
  • Storage of artifacts must have quotas
    • Bad actors are not needed to fill disks by mistake
  • Selection of builders must be reviewed or restricted
    • Since the buildpack lifecycle is effectively contained in your builder, the builder selects the stack you use, what you can do and how you do it. Without controlling builder selection, the system is unable to maintain security maintenance, deduplication of the storage via layer control and rebasability during security response.
  • The system must at least be available in toolsbeta as a development environment if not usable or at least testable on local workstations.
  • It must be possible for it to coexist with the current build-on-bastion system.
    • This implies that webservice is aware of it. See task T266901
  • This is really two problems: Build and Deploy
    • Once you have a build, we might need to decide how to deploy it. For now, that could just be a webservice command argument.
    • Deployment is much more easily solved depending on how we want things to roll. It could be automatic.

Nice-to-haves

  • A dashboard (or at least enough prometheus instrumentation to make one)
  • A command line to use from bastions that simplifies manual rebuilds

Components

Diagram of Build Service

Build Service

The specification and lifecycle must be implemented in a way that works with our tooling and enables our contributors without breaking the multitenancy model of Toolforge entirely. Effectively, that means it has to be integrated in a form of CI in Kubernetes (as it is or in a slightly different form). Buildpacks are not natively implemented in Jenkins without help, and Gitlab may be on its way, but in order to build out with appropriate access restrictions, security model and full integration into WMCS operations we likely need to operate our own small "CI" system to process buildpacks.

The build service is itself the center piece in the setup. The build service must be constructed so that is the only a user-accessible system that has credentials to the artifact (OCI) repository. Otherwise, you don't need to deal with it and can push whatever you want. This is quite simple to accomplish if builds happen in a separate namespace where users (and potentially things like Striker) can only create kpack images objects or builds for Tekton. Tool users have no access to pods or secrets in that namespace. At that point, as long as the namespace has the correct secret placed by admins, the access will be available. An admission webhook could insist that the creating user only be talking about their own project in Harbor, is in group Toolforge, only uses approved builders, etc.

Tekton CI/CD would provide a good mechanism for this since it is entirely build for Kubernetes and is actively maintained with buildpacks in mind. It does have a dashboard available that may be useful to our users. Tekton provides tasks maintained by the cloud-native buildpacks organization directly and is the direction Red Hat is moving OpenShift as well, which guarantees some level of external contribution to that project https://www.openshift.com/learn/topics/pipelines. It is also worth noting that KNative dropped their "build" system in favor of Tekton as well.

Another option is kpack, which is a more single-purpose build service for buildpacks that runs in non-privileged containers. It has the wonderful "Image" Kubernetes CRD that can be set up to poll your repo for changes and simply start builds. If we used that, we'd basically make a dashboard in Grafana, and it would likely be just fine.

Because we will not be able to simply consume Auto DevOps from Gitlab (some issues include https://docs.gitlab.com/ce/topics/autodevops/#private-registry-support and the expectation of privileged containers firmly embedded in the setup), we will probably pursue one of the two above-mentioned flexible routes unless we end up with a good reason not to. Gitlab's implementation has since pushed more deeply into buildpacks, but it still requires privileged containers running pack inside Docker. That breaks the most fundamental requirements for us. Since the two services above run inside Kubernetes, we can use webhook controllers to validate all inputs and Kubernetes RBAC to manage permissions.

Artifact Repository

Since Docker software is getting dropped from many non-desktop implementations of cloud native containers, it is usually talked about as OCI images and registries that we are working with. Docker is a specific implementation and extension of the standard with its own orchestration layers that are separate from the OCI spec. All that said, we need a registry to work with, and right now, Toolforge uses the same basic Docker registry that is used by production, but we use local disk instead of Swift for storage. This is highly limiting. Since WMCS systems are not limited exclusively to Debian packaging, it is possible we could deploy the vastly popular and much more business-ready Harbor that is maintained mostly by VMWare. Harbor is deployed via docker compose or (in better, more fault tolerant form) helm since it is distributed entirely as OCI images built on VMWare's peculiar Photon OS. That's why it was rejected by main production SRE teams because repackaging it would be a huge piece of work. Despite this odd deployment style, it enjoys wide adoption and contributions now and is at 2.0 maturity. It can also be used to cache and proxy to things like Quay.io. The downside of Harbor is that it is backed by Postgresql, and it doesn't solve the storage problem inherently. While it would be ideal to deploy that with Trove Postgresql and Swift storage. We could deploy it initially with a simple postgres install and cinder storage volumes. Ultimately, only Harbor has no real potential concerns with licensing and has quotas.

Edit: We are testing Harbor as the most complete open source implementation of what we need. It has quotas, multi-tenancy and a solid authentication model that allows for "robot" accounts just like Quay and Docker Hub. We don't need to link it to LDAP, since ideally, users cannot write to it directly, anyway. Users should go through the build system only. Right now it is running on https://harbor.toolsbeta.wmflabs.org for initial build and testing. It should be deployed using helm on Kubernetes in the end, and it has the ability to back up to Quay.io. As long as we are consuming the containerized version directly, this is a solved problem.

Dashboards and UX

This is going to depend on the implementation of build service a bit.


Deployment Service

If this sounds like "CD", you'd be right. However, initially, it may be enough to just allow you to deploy your own images with webservice. Once you have an image, you can deploy however you want. Tekton is entirely capable of acting as a CD pipeline as well as CI, while kpack is a one-trick pony. We may want something entirely different as well, but when everyone still has shell accounts anyway in Toolforge, this seems like the item that can be pushed down the road.