You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Dragonfly"

From Wikitech-static
Jump to navigation Jump to search
imported>JMeybohm
 
imported>JMeybohm
Line 1: Line 1:
{{Kubernetes nav}}
{{Kubernetes nav}}
Dragonfly is an P2P based file distribution system we use for distributing docker image layers to Kubernetes worker nodes.
Dragonfly is an P2P based file distribution system we use for distributing docker image layers to Kubernetes worker nodes. It was added to our infrastructure to overcome the issue of overloaded [[Docker-registry]] nodes when big deployments (in terms of number of replicas) that also use big docker images (in terms of layer size) are rolled out (read: MediaWiki).
 
Dragonfly consists of multiple components:
 
* '''supernode''': The supernode is a service running on dedicated hosts ([[Ganeti|Ganeti VMs]]) that acts as a tracker and scheduler for the P2P network.
* '''dfget''': Is the download client (wget like) that at the same time acts as peer in the P2P network.
* '''dfdaemon''': Is a local HTTP(S) proxy between the docker container engine and the docker registry. It filters out requests for (specific) layers and uses dfget to download those via the P2P network instead.
 
For a more complete documentation about design and implementation of Dragonfly, please refer to: https://github.com/dragonflyoss/Dragonfly/blob/master/docs/design/design.md
 
You may also want to watch the in introduction to drafonfly from KubeCon 2019: https://www.youtube.com/watch?v=LcxBgmmeA80
 
== Operations ==
== Operations ==
{{Fixme|}}
We currently run one supernode in each datacenter (listening on tcp/8002), all Kubernetes nodes (P2P peers) in that datacenter use that supernode to span the P2P network. That means, Dragonfly P2P networks do not span (and should not) cross datacenters.
 
On each Kubernetes node we run dfdaemon (listening on tcp/65001) as a HTTPS poxy between dockerd and the docker registries. dfdaemon is configured to use a TLS certificate that contains the alt name docker-registry.discovery.wmnet, so that connections from dockerd can be transparently hijacked and potentially re-routed through the P2P network. dfdaemon does that by spawning multiple instances of dfget to download from and and one instance to serve parts (4MB chunks of docker image layers) to the P2P network. The later listens on tcp/15001 for connections from other peers (for around 5min, after that time of inactivity the peer unregisters itself and removes the cached chunks from disk).
 
If a supernode fails, dfdeamon on each P2P peer will direct the traffic to the "source" of the data requested (e.g. the docker-registry) directly instead of failing. That means that in case of an issue with the P2P network, all docker daemons will pull (more or less) directly from the docker-registry again - potentially exhausting it's network links.
 
=== Monitoring / Logging ===
The monitoring currently relies on Icinga to watch over the state of the systemd service on supernode as well as on P2P peers (dfdaemon).
 
There is a Grafana dashboard with some metrics at https://grafana-rw.wikimedia.org/d/CmbiPADWz/dragonfly
 
==== Where to look for logs ====
* supernode: <code>/var/lib/dragonfly-supernode/logs/app.log</code>
* peer
** dfdaemon: <code>/var/lib/dragonfly-dfdaemon/logs/dfdaemon.log</code>
** dfget's downloading chunks: <code>/var/lib/dragonfly-dfdaemon/dfget/logs/dfclient.log</code>
** dfget's serving chunks: <code>/var/lib/dragonfly-dfdaemon/dfget/logs/dfserver.log</code>
 
=== Disable the use of dragonfly on a kubernetes node ===
<syntaxhighlight lang="bash">
sudo disable-puppet 'disable dragonfly'
sudo systemctl revert docker.service
sudo systemcrl restart docker.service
</syntaxhighlight>


== Packaging ==
== Packaging ==
Line 49: Line 83:
== Resources ==
== Resources ==
* https://github.com/dragonflyoss/Dragonfly
* https://github.com/dragonflyoss/Dragonfly
* [https://github.com/dragonflyoss/Dragonfly/tree/master/docs Documentation]
* [https://grafana-rw.wikimedia.org/d/CmbiPADWz Grafana Dashboard]
* [https://grafana-rw.wikimedia.org/d/CmbiPADWz Grafana Dashboard]
* https://doc.wikimedia.org/puppet/puppet_classes/dragonfly_3A_3Adfdaemon.html
* https://doc.wikimedia.org/puppet/puppet_classes/dragonfly_3A_3Adfdaemon.html
* https://doc.wikimedia.org/puppet/puppet_classes/dragonfly_3A_3Asupernode.html
* https://doc.wikimedia.org/puppet/puppet_classes/dragonfly_3A_3Asupernode.html
Open issues:
* [https://github.com/dragonflyoss/Dragonfly/issues/1557 dfget may leak credentials (via --header flag) when called by dfdaemon]
* [https://github.com/dragonflyoss/Dragonfly/issues/1559 supernode writes debug logs even if configured with debug: false]
* [https://github.com/dragonflyoss/Dragonfly/issues/1558 Even with cdnPattern: source, dfget connects to the downloadPort]


[[Category:Kubernetes]]
[[Category:Kubernetes]]
[[Category:Services]]
[[Category:Services]]
[[Category:Docker]]
[[Category:Docker]]

Revision as of 18:21, 10 August 2021

Dragonfly is an P2P based file distribution system we use for distributing docker image layers to Kubernetes worker nodes. It was added to our infrastructure to overcome the issue of overloaded Docker-registry nodes when big deployments (in terms of number of replicas) that also use big docker images (in terms of layer size) are rolled out (read: MediaWiki).

Dragonfly consists of multiple components:

  • supernode: The supernode is a service running on dedicated hosts (Ganeti VMs) that acts as a tracker and scheduler for the P2P network.
  • dfget: Is the download client (wget like) that at the same time acts as peer in the P2P network.
  • dfdaemon: Is a local HTTP(S) proxy between the docker container engine and the docker registry. It filters out requests for (specific) layers and uses dfget to download those via the P2P network instead.

For a more complete documentation about design and implementation of Dragonfly, please refer to: https://github.com/dragonflyoss/Dragonfly/blob/master/docs/design/design.md

You may also want to watch the in introduction to drafonfly from KubeCon 2019: https://www.youtube.com/watch?v=LcxBgmmeA80

Operations

We currently run one supernode in each datacenter (listening on tcp/8002), all Kubernetes nodes (P2P peers) in that datacenter use that supernode to span the P2P network. That means, Dragonfly P2P networks do not span (and should not) cross datacenters.

On each Kubernetes node we run dfdaemon (listening on tcp/65001) as a HTTPS poxy between dockerd and the docker registries. dfdaemon is configured to use a TLS certificate that contains the alt name docker-registry.discovery.wmnet, so that connections from dockerd can be transparently hijacked and potentially re-routed through the P2P network. dfdaemon does that by spawning multiple instances of dfget to download from and and one instance to serve parts (4MB chunks of docker image layers) to the P2P network. The later listens on tcp/15001 for connections from other peers (for around 5min, after that time of inactivity the peer unregisters itself and removes the cached chunks from disk).

If a supernode fails, dfdeamon on each P2P peer will direct the traffic to the "source" of the data requested (e.g. the docker-registry) directly instead of failing. That means that in case of an issue with the P2P network, all docker daemons will pull (more or less) directly from the docker-registry again - potentially exhausting it's network links.

Monitoring / Logging

The monitoring currently relies on Icinga to watch over the state of the systemd service on supernode as well as on P2P peers (dfdaemon).

There is a Grafana dashboard with some metrics at https://grafana-rw.wikimedia.org/d/CmbiPADWz/dragonfly

Where to look for logs

  • supernode: /var/lib/dragonfly-supernode/logs/app.log
  • peer
    • dfdaemon: /var/lib/dragonfly-dfdaemon/logs/dfdaemon.log
    • dfget's downloading chunks: /var/lib/dragonfly-dfdaemon/dfget/logs/dfclient.log
    • dfget's serving chunks: /var/lib/dragonfly-dfdaemon/dfget/logs/dfserver.log

Disable the use of dragonfly on a kubernetes node

sudo disable-puppet 'disable dragonfly'
sudo systemctl revert docker.service
sudo systemcrl restart docker.service

Packaging

The code is hosted in operations/debs/dragonfly and uses Git-buildpackage flow.

Importing a new version

The imported upstream tarballs should include the complete vendor directory.

  • Check out the version (git tag) to import
$ ./debian/repack vX.Y.Z
  • This drops you into a shell with the git tag checked out. Do necessary changes here and commit
$ go mod vendor
$ git add -f vendor
# git diff --name-status --cached | grep -v 'vendor/' to make sure you only changed vendor
$ git commit -m "added vendor"
  • Exiting the shell will build a tarball to import
$ gbp import-orig /path/to/tarball.tar.xz
  • Push changes (including the tag crated by gpb) to gerrit
$ git push gerrit --all
$ git push gerrit --tags
  • Add a debian/changelog entry (as CR)
$ gbp dch
# Edit debian/changelog
$ git commit
$ git review

Building a new version

  • Check out the git repo on the build host
$ git clone "https://gerrit.wikimedia.org/r/operations/debs/dragonfly" && cd dragonfly
  • Build the package
$ BACKPORTS=yes gbp buildpackage --git-pbuilder --git-no-pbuilder-autoconf --git-dist=buster -sa -uc -us

Publish a new version

# on apt1001
rsync -vaz deneb.codfw.wmnet::pbuilder-result/buster-amd64/dragonfly* .
sudo -i reprepro -C main include buster-wikimedia /path/to/<PACKAGE>.changes

# Copy the package over to other distros (this is possible because they only contain static binaries)
sudo -i reprepro copysrc stretch-wikimedia buster-wikimedia dragonfly

Patches

If you need to add/update patches, please see: https://honk.sigxcpu.org/projects/git-buildpackage/manual-html/gbp.patches.html

Resources

Open issues: