You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Ganeti is a cluster virtual server management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. It supports both KVM and Xen. At WMF we only have KVM as an enabled hypervisor. Primary Ganeti web page is http://www.ganeti.org/.
At WMF, ganeti is used as a cluster management tool for production-network VPSes with services that help us run our clusters. After an evaluation process of Openstack vs Ganeti, Ganeti was chosen as a more fitting software for the job at hand.
The architecture of ganeti is described below
Ganeti is architected as a shared nothing cluster with job management. There is one master node that receives all jobs to be executed (create a VM, delete a VM, stop/start VMs, etc) that can be swapped between a preconfigured number of master candidates in case of a hardware failure. That allows for no single point of failure for cluster operations. For VMs operations, provided the DRBD backend is used, which we do in WMF, even in the case of catastrophic failure for a hardware node, VMs can be restarted with minimal disruption on their secondary (backup) node. There's the notion of a nodegroup. That's practically a group of nodes. Think of it as nothing more than a subcluster division. Operations are usually constrained within a nodegroup and do not cross the boundaries unless specifically instructed to
A high level overview of the architecture is here http://docs.ganeti.org/ganeti/2.12/html/_images/graphviz-246e5775f608681df9f62dbbe0a5d4120dc75f1c.png and more discussion about it is in http://docs.ganeti.org/ganeti/2.12/html/design-2.0.html
A cluster is identified by:
- The nodes
- An FQDN (e.g ganeti01.svc.eqiad.wmnet), which obviously corresponds to an IPv4 address. That IPv4 address is "floating", meaning that it is owned by the current master.
Administration always happens via the master. It is the only node where all commands can be run and hosts the API. Failover of a master is easy but manual. See below for more information on how to do it.
Connect to a cluster
Just ssh to its FQDN
Init the cluster
An example of a initializing a new cluster:
sudo gnt-cluster init \ --no-ssh-init \ --enabled-hypervisors=kvm \ --vg-name=ganeti \ --master-netdev=private \ --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,migration_bandwidth=64,migration_downtime=500,kernel_path= \ --nic-parameters=link=private \ ganeti01.svc.codfw.wmnet
The above is the way we currently have our clusters configured
Modify the cluster
Modifying the cluster to change defaults, parameters of hypervisors, limits, security model etc is possible. An example of modifying the cluster is given below.
sudo gnt-cluster modify -H kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,kernel_path=
To get an idea of what is actually modifiable do a:
sudo gnt-cluster info
and then lookup in ganeti documentation the various options 
Destroy the cluster
Destroying the cluster is a one way street. Do not do it lightly. An example of destroying a cluster:
sudo gnt-cluster destroy --yes-do-it
do note that various things will be left behind. For example /var/lib/ganeti/queue/ will not be deleted. It's up to you if you want to delete it or not, depending on the case.
Add a node
Adding a new hardware node to the cluster to increase capacity
sudo gnt-node add ganeti1002.eqiad.wmnet
Listing cluster nodes
Listing all hardware nodes in a cluster:
sudo gnt-node list
That should return something like:
Node DTotal DFree MTotal MNode MFree Pinst Sinst ganeti1001.eqiad.wmnet 427.9G 427.9G 63.0G 391M 62.4G 0 0 ganeti1002.eqiad.wmnet 427.9G 427.9G 63.0G 289M 62.5G 0 0 ganeti1003.eqiad.wmnet 427.9G 427.9G 63.0G 288M 62.5G 0 0 ganeti1004.eqiad.wmnet 427.9G 427.9G 63.0G 288M 62.5G 0 0
The columns are respectively: Disk Total, Disk Free, Memory Total, Memory used by node itself, Memory Free, Instances for which this node is primary, instances for which this node is secondary
The nodegroup information you be obtained via a command like
sudo gnt-node list -o name,group
which would return the name and the group of node
Node Group ganeti1001.eqiad.wmnet row_C ganeti1002.eqiad.wmnet row_C ganeti1003.eqiad.wmnet row_C ganeti1004.eqiad.wmnet row_C ganeti1005.eqiad.wmnet row_A ganeti1006.eqiad.wmnet row_A ganeti1007.eqiad.wmnet row_A
Detecting the master node
The master node can be queried by running
sudo gnt-node list -o name,master_candidate,master
View the job queue
Ganeti has a job queue built-in. Most of the times it's working fine but if something is taking too long it might be helpful to check what's going on in the job queue
and getting a job id from the result
gnt-job info #ID
Hardware/software upgrades on a ganeti cluster can happen with 0 downtime to the VMs operations. The procedure to do so is outlined below. In case a shutdown/reboot is needed the procedure to empty to node is described. The rolling
Do the software upgrade (if needed)
throughout the cluster. It should have 0 repercussions to any VM anyway. Barring a Ganeti bug in the upgraded version, the cluster itself should also have 0 problems. Between minor versions (e.g. 2.12 -> 2.15) it may be required to run some upgrade script. Read the changelog/upgrades notes. The Debian maintainer builds the package in a way that both versions will be installed and until you run said script the old version will still be used.
Doing a rolling reboot of the cluster is easy. Empty every node, reboot it, check that it is online, proceed to the next. The one thing to take care is to not reboot the master without failing it over first.
Failover the master
Choose a master candidate that suits you. You can get master candidates by
sudo gnt-node list -o name,master_candidate
sudo gnt-cluster master-failover
The cluster IP will now be served by the new node and the old one is no longer the master.
There might be a time where the cluster will look/actually be unbalanced. That will be true after a rolling reboot of the nodes. Doing a rebalancing is easy and baked into ganeti, all it takes is running a command
sudo hbal -L -X -G <node_group>
Please run it in a screen session. It might take quite a while to finish. The jobs have been submitted so it's fine losing that session but it's still prudent.
The cluster will calculate a current score, run some heuristic algorithms to try and minimize that score and then execute the commands require to reach that state.
Reboot/Shutdown for maintenance a node
Select a node that needs rebooting/shutdown for brief hardware maintenance and empty of primary instances
sudo gnt-node migrate -f ganeti1004.eqiad.wmnet
sudo gnt-node list
should return 0 primary instances for the node. It is safe to reboot it or shut it down for a brief amount of time for hardware maintenance
After reboot, before you migrate the next node run
sudo gnt-cluster verify-disks
It should display "No disks need to be activated." (possibly multiple times, one per ganeti nodegroup) before the next node can be rebooted (this ensures that the DRBD is synced fully)
Shutdown a node for a prolonged period of time
Should the node be going down for an undetermined amount of time, also move the secondary instances
sudo gnt-node migrate -f <node_fqdn> sudo gnt-node evacuate -s <node_fqdn>
The second command means moving around DRBD pairs and syncing disk data. It is bound to take a long time, so find something else to do in the meanwhile
sudo gnt-node list
should return 0 for both primary instances as well as secondary instances. Before powering off the node we need to remove it from the cluster as well
sudo gnt-node evacuate -s <node_fqdn> # Removes it as a secondary as well sudo gnt-node remove <node_fqdn>
NOTE: Do not forget to readd it after it is fixed (if it ever is)
sudo gnt-node add <node_fqdn>
Failed hardware node
When a host is having problem (hardware/kernel/otherwise) and it's unreachable, the are a number of possible avenues to solve the issue, but an empirical good way out is:
- Just powercycle the host. If that works, it's probably the faster way out. Most services should anyway be set up highly available and if we got one that is not we either should set it that way or not care too much when it fails. If this works, you are done, if not keep on reading.
- If the above doesn't work (the node never comes back up), start the VMs on another host. This can be done with
sudo gnt-node failover -f <node_fqdn>
in some cases you may have to ignore the consistency checks (this has never happened in our setup), pass --ignore-consistency. Again all important services are set in high available setups (and could easily reimaged) be so this will only severely bite VMs that are not setup that way.
- Remove the host from the cluster
sudo gnt-node remove <node_fqdn>
- Debug/fix/RMA the node.
Create a VM
Creating a VM is easy. Most of the steps are the same as for production so keep in mind the regular process as well.
Assign a hostname/IP
Same process as for hardware. Assign the IP/hostname and make sure DNS changes are live before going forward. This means you will also get to decide on which row this VM will live in. We don't have ganeti on all rows, make sure you allocate the IP on a row we got ganeti in. You can get the rows via
sudo gnt-group list
which should return something like the output below
Group Nodes Instances AllocPolicy NDParams row_A 3 20 preferred ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1,ovs_name=switch1, oob_program= row_C 4 19 preferred ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1,ovs_name=switch1, oob_program=
This means we can place VMs in rows A and C for this DC. Keep the name of the group (e.g. row_A, row_C) you will need it when creating the VM.
WARNING: Fail to do this and you have a chance of assigning the VM in the wrong row and having to redo it
Create the VM (using makevm)
There is an interactive script called "makevm" which asks you the questions you need to answer before creating a VM and then creates it for you.
You can run it as "makevm" on a ganeti master (currently ganeti1001) and follow the prompt. In the end it will run the needed gnt-instance command and get the MAC address you need for the next step (adding the created VM to DHCP).
Here's an example how this looks:
[ganeti1001:~] $ makevm This is an interactive script to make it easier to create a Ganeti VM. Please see https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM for more details. Are you going to need a public IP? (y/n) n Please enter the correct row. (A, B or C - gnt-group list to show) C How many vCPUs do you need? 4 How much RAM do you need? (Gigabytes) 4 What disk size do you need? (Gigabytes) 20 How do you want to call the instance? (FQDN) analytics-tool1001.eqiad.wmnet Based on your answers this is the full command to create the VM: sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=network -o debootstrap+default --no-install -g row_C -B vcpus=4,memory=4g --disk 0:size=20g analytics-tool1001.eqiad.wmnet Do you want to run it now? (y/n) y Ok, running. Mon Aug 20 21:46:43 2018 - INFO: No-installation mode selected, disabling startup Mon Aug 20 21:46:47 2018 - INFO: Selected nodes for instance analytics-tool1001.eqiad.wmnet via iallocator hail: ganeti1001.eqiad.wmnet, ganeti1003.eqiad.wmnet Mon Aug 20 21:46:48 2018 * creating instance disks... Mon Aug 20 21:46:52 2018 adding instance analytics-tool1001.eqiad.wmnet to cluster config Mon Aug 20 21:46:52 2018 adding disks to cluster config Mon Aug 20 21:46:52 2018 - INFO: Waiting for instance analytics-tool1001.eqiad.wmnet to sync disks .. Mon Aug 20 21:56:12 2018 - INFO: Waiting for instance analytics-tool1001.eqiad.wmnet to sync disks Mon Aug 20 21:56:12 2018 - INFO: Instance analytics-tool1001.eqiad.wmnet's disks are in sync Time to add the new instance to DHCP. Here's the MAC address: NicMAC/0 aa:00:00:ed:b1:fd
At the end you will get a new MAC address which you can then add to DHCP to proceed with the OS install.
Create the VM (private IP)
gnt-instance add \ -t drbd \ -I hail \ --net 0:link=private \ --hypervisor-parameters=kvm:boot_order=network \ -o debootstrap+default \ --no-install \ -g <nodegroup> \ -B vcpus=<x>,memory=<y>g \ --disk 0:size=<z>g \ <fqdn>
Note the the VM will NOT be started. That's on purpose for now. <x>, <y>, <z> on the above are variables. The uni sizes t,g,m denote tera,giga,mega bytes respectively. <nodegroup> is also a variable and it's the rack row you got from the above command. So valid values are row_A, row_B, row_C, row_D.
Create the VM (public IP)
gnt-instance add \ -t drbd \ -I hail \ --net 0:link=public \ --hypervisor-parameters=kvm:boot_order=network \ -o debootstrap+default \ --no-install \ -g <nodegroup> \ -B vcpus=<x>,memory=<y>g \ --disk 0:size=<z>g \ <fqdn>
Note that the only difference between public/private IP is that the link is specified differently (public vs private). Everything else is exactly the same as above
Get the MAC address of the NIC
gnt-instance info <fqdn> | grep MAC
Get the MAC address
Same as usual. Use linux-host-entries.ttyS0-115200 for Ganeti VMs. Otherwise you will not be getting a console
Update autoinstall files
Same as usual. Do however add virtual.cfg to the configuration for a specific VM. Example: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/install-server/files/autoinstall/netboot.cfg;601720be51228f7eae3de17988b1afa8881a5bdb$71
Start the VM
gnt-instance start <fqdn>
and connect to the console
gnt-instance console <fqdn>
Ctrl+] to leave the console
Set boot order to disk
WARNING: Fail to do this and the VM will be stuck in an endless reboot, install, reboot loop
Assuming the installation goes on well but before it finishes, you need to set the boot order back to disk. This is a limitation of the current version of the Ganeti software and is expected to be solved (upstream is aware).
gnt-instance modify \ --hypervisor-parameters=boot_order=disk \ <fqdn>
Note: when the VM has finished installing, it will shutdown automatically. The Ganeti software includes HA checks and will promptly restart it. We rely on this behaviour to have the VM successfully installed. However, if you list the VMs during this phase you will see the VM in ERROR_down. Don't worry, this is expected.
Note 2: If you need to set a boot_order back to PXE to reinstall, it's "boot_order=network" (For KVM the boot order is either "floppy", "cdrom", "disk" or "network".)
Assign role to the VM in puppet
Delete a VM
Irrevocably deleting a VM is done via:
gnt-instance remove <fqdn>
Please remember to clean up DHCP/DNS entries afterwards
Shutdown/startup a VM
gnt-instance startup <fqdn> gnt-instance shutdown <fqdn>
Note: In the shutdown command, ACPI will be used to achieve a graceful shutdown of the VM. A 2 minute timeout exists however, after which the VM will be forcefully shutdown. In case you prefer to not wait those 2 minutes, --timeout exists and can be used like so
gnt-instance shutdown --timeout 0 <fqdn>
Get a console for a VM
You can get log into the "console" for a Ganeti instance via
gnt-instance console <fqdn>
The console can be left with "ctrl + ]"
Resize a VM
Make sure first that the cluster has adequate space for whatever resource you want to increase (if you do want to increase and not decrease a resource). This is done manually by a combination of grafana statitics for CPU/Memory utilization and the output of gnt-node list for disk space utilization. After that you can issue the following command to increase/decrease the memory size and number of Virtual CPUs assigned to a VM
gnt-instance modify -B mem=<X>[gm],vcpus=<N> <fqdn>
where X, N are physical numbers. X can be suffixed by g or m for Gigabytes or Megabytes (please don't do Terabytes ;))
Adding space to an existing disk is possible. But do note that the resizing of partitions and filesystems is up to you, as ganeti can't do it for you. The command would be.
gnt-instance modify --disk #:size=X[gmt] <fqdn>
where # is the number of disk starting from 0. You can get the disks allocated to a VM using gnt-instance info <fqdn>. Again X is a physical number suffixed for Gigabytes/Megabytes/Terabytes.
Adding a disk is also easy if you want to avoid the mess with having to resize partitions/filesystems. The command would be:
gnt-instance modify --disk add:size=X[gmt] <fqdn>
Again X is a physical number suffixed for Gigabytes/Megabytes/Terabytes.
Reinstall / Reimage a VM
Just like a physical server OS reinstall this will destroy the contents of the machine and requires appropriate netboot configs to be in place. Proceed with caution!
Shutdown the VM
gnt-instance shutdown <fqdn>
Set boot device to network
gnt-instance modify --hypervisor-parameters=boot_order=network <fqdn>
Start instance and attach to console (ctrl-] to detach)
gnt-instance start <fqdn> && gnt-instance console <fqdn>
After the OS install has finished (or while it is successfully under way from a separate terminal) set the boot device back to disk
gnt-instance modify --hypervisor-parameters=boot_order=disk <fqdn>
Finally, after the install finishes, boot the system into the fresh OS install and attach to the console (ctrl-] to detach)
gnt-instance start <fqdn> && gnt-instance console <fqdn>
All of the commands that have a Y/N prompt can be forced with a -f. For example the following will spare you the prompt
gnt-instance remove -f <fqdn>
All commands are actually jobs. If you would rather not wait on the prompt --submit will do the trick
gnt-instance shutdown --submit <fqdn>