You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Ganeti: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Alexandros Kosiaris
imported>Volans
m (Updated group naming from row_FOO to FOO)
(80 intermediate revisions by 23 users not shown)
Line 1: Line 1:
= Introduction =
{{ptag|Ganeti}}
Ganeti is a cluster virtual server management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. It supports both KVM and Xen. In WMF we only have KVM as an enabled hypervisor. Primary Ganeti web page is [[http://www.ganeti.org/]].
'''Ganeti''' is a clustered virtual machine management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. It supports both KVM and Xen. At WMF we only have KVM as an enabled hypervisor. Primary Ganeti web page is http://www.ganeti.org/.


In WMF ganeti is used as a cluster management tool for a private VPS cloud installation. After an evaluation process of Openstack vs Ganeti, Ganeti was chosen as a more fitting software for the job at hand.
At WMF, ganeti is used as a cluster management tool for production-network VPSes with services that help us run our clusters. After an evaluation process of Openstack vs Ganeti, Ganeti was chosen as a more fitting software for the job at hand.


= Architecture =
= Architecture Overview =


== Overview ==
Ganeti is architected as a shared nothing virtualization cluster with job management. There is one master node (per-site) that receives all jobs to be executed (create a VM, delete a VM, stop/start VMs, etc).  The master node can be swapped between a preconfigured number of master candidates in case of a hardware failure. This avoids a single point of failure for cluster management operations.
Ganeti is architected as a shared nothing cluster with job management. There is one master node that receives all jobs to be executed (create a VM, delete a VM, stop/start VMs, etc) that can be swapped between a preconfigured number of master candidates in case of a hardware failure. That allows for no single point of failure for cluster operations. For VMs operations, provided the DRBD backend is used, which we do in WMF, even in the case of catastrophic failure for a hardware node, VMs can be restarted with minimal disruption on their secondary (backup) node.


A high level overview of the architecture is here [[http://docs.ganeti.org/ganeti/2.12/html/_images/graphviz-246e5775f608681df9f62dbbe0a5d4120dc75f1c.png]] and more discussion about it is in [[http://docs.ganeti.org/ganeti/2.12/html/design-2.0.html]]
For VM storage we use the DRBD backend. This provides "RAID-1 over the network" block devices which ensure that in the event of failure of a single hardware node, VMs may be restarted with minimal disruption on their secondary (backup) node. This also provides the capability to migrate VMs between hardware nodes (e.g. for hardware maintenance) without requiring a centralized storage backend.


== Specifics ==
Within a cluster, there's the notion of a node group. That's practically a group of hardware nodes. Think of it as nothing more than a subcluster division. Operations are usually constrained within a node group and do not cross the boundaries unless specifically instructed to.  In our deployment, node groups are aligned with the physical row location of the hardware inside the datacenter (where possible).  This allows us to diversify VMs across physical row locations within a single site, taking advantage of the additional of fault tolerance this can provide.  In caching pop locations there is a single, "default" node group.


A cluster is identified by:
A high level overview of the architecture is here https://docs.ganeti.org/docs/ganeti/2.14/html/_images/graphviz-246e5775f608681df9f62dbbe0a5d4120dc75f1c.png and more discussion about it is in https://docs.ganeti.org/docs/ganeti/2.14/html/design-2.0.html


* The nodes
= Ganeti Clusters =
* An FQDN (e.g ganeti01.svc.eqiad.wmnet), which obviously corresponds to an IPv4 address. That IPv4 address is "floating", meaning that it is owned by the current master.
 
A cluster is identified by an FQDN (e.g ganeti01.svc.eqiad.wmnet) which corresponds to an IPv4 address. That IPv4 address is "floating", meaning that it is owned by the current master.


Administration always happens via the master. It is the only node where all commands can be run and hosts the API. Failover of a master is easy but manual. See below for more information on how to do it.
Administration always happens via the master. It is the only node where all commands can be run and hosts the API. Failover of a master is easy but manual. See below for more information on how to do it.


=== Connect to a cluster ===
== Hardware Conventions ==
 
As of Dec 2019 we have standardized on raid-5 host configuration in eqiad/codfw and an SSD backed raid-1 configuration in caching pop sites.
 
== Connect to an existing cluster ==
 
Most operations are run from the Ganeti master node, but you can log into any of the virtualisation nodes via standard SSH. The MOTD of each Ganeti server will print the Ganeti master for the cluster.
 
== Building a new cluster ==
 
=== Host Preparation ===
 
==== Node should be in the insetup role ====
 
New nodes should be initially installed with the puppet role ''insetup''.  At a later point in this procedure, after several manual actions, they'll be switched to the real ''ganeti'' role.
 
==== Ensure hardware virtualization is enabled in BIOS ====
 
Hardware virtualization must be enabled for KVM.  If you encounter errors relating to KVM, double check that hardware virtualization is enabled inside BIOS > System Settings > Processor Settings
 
==== Ensure 'ganeti' Volume Group is present ====
 
Our current ganeti partman layout creates a large raid device and formats it as swap to work around a shortcoming in the debian installer (fixme)
 
When a new ganeti host is built, you must manually remove the large swap device on /dev/md2, create a PV on /dev/md2, and create a VG called ‘ganeti’.  Be sure to remove the stale swap entry from /etc/fstab.
 
==== Ensure network bridge(s) are present ====
 
Bridges are used to connect virtual machines to the network interface of the hardware node.  Currently these are manually configured.  Each bridge corresponds to a network VLAN, on either a physical or tagged interface.
 
Make sure to install ''bridge-utils''. It's a dependency of Ganeti, but not installed by default, so if you're setting up the bridge before the role gets applied, ''bridge-utils'' isn't installed and networking.service will fail with a cryptic/misleading error message.
 
Looking at the simpler case of the caching pop deployment, there is a single bridge called ‘private’ that corresponds with private network interface.  This bridge interface assumes the IP of the node.
 
Example configuration:
 
  #/etc/network/interfaces
 
  # The loopback network interface
  auto lo private
  iface lo inet loopback
 
  iface ens3f0np0 inet manual
 
  iface private inet static
      bridge_ports ens3f0np0
      bridge_stp off
      bridge_maxwait 0
      bridge_fd 0
          address 10.20.0.31/24
          gateway 10.20.0.1
          dns-nameservers 10.3.0.1
          dns-search ulsfo.wmnet
          up /sbin/ip token set ::10:20:0:31 dev private
          up ip addr add 2620:0:862:102:10:20:0:31/64 dev private
 
  ## The original network interface configuration
  #allow-hotplug ens3f0np0
  #iface ens3f0np0 inet static
  # address 10.20.0.31/24
  # gateway 10.20.0.1
  # dns-nameservers 10.3.0.1
  # dns-search esams.wmnet
  #    pre-up /sbin/ip token set ::10:20:0:31 dev ens3f0np0
  #    up ip addr add 2620:0:862:102:10:20:0:31/64 dev ens3f0np0
 
Before editing the interfaces file, it's important to disable the puppet agent and leave it disabled until the role switch in a later step.  If the agent re-runs the ''insetup'' role after the changes above, it will modify the file in a way that breaks networking.
 
After editing the interfaces file, a ''systemctl restart networking.services'' is required in order to take effect.
 
=== Generate an RAPI certificate for the cluster ===
 
Ganeti RAPI TLS certificates are generated and manged with [[Cergen]].  You'll need to...
 
* Add an entry to the ganeti yaml certificate manifest. Don't use a password as ganeti-rapi did not support this (at time of writing)
* Generate the certificate
* Add the key file to the puppet private/secret ssl certificate store
* Add a dummy key file to the labs/private ssl certificate store. !!! do not use the real key here, use a dummy key!!!  See other existing ganeti keys in labs/private for an example
* Add the public certificate to the puppet files/ssl certificate store.
 
=== Create a SVC record for new cluster ===
 
The SVC record is used for the floating IP owned by a Ganeti master of a cluster. You need to create a new record following https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox
 
After that the sre.dns.netbox cookbook needs to be run.
 
 
Note that in eqiad/codfw the Netbox-generated zone is only used for reverse looksup, for forward resolution you need to add an entry to the dns.git repo (In the edges, all zones are managed in Netbox).
 
=== Assign the Ganeti role ===


Just ssh to its FQDN
Assign the "ganeti" role to the servers. The initial puppet run will not complete entirely, but that will be rectified when gnt-cluster is run later.


  ssh ganeti01.svc.eqiad.wmnet
==== Ensure intra-cluster SSH is permitted by the iptables/ferm firewall ====


= Cluster Operations =
Cluster nodes must be able to ssh to one another in order for Ganeti to function.  Double-check that puppet has configured this by attempting to SSH between the Ganeti cluster hosts being deployed.
== Init the cluster ==
 
=== Initialize the new cluster ===
An example of a initializing a new cluster:  
An example of a initializing a new cluster:  


Line 34: Line 124:
   --enabled-hypervisors=kvm \
   --enabled-hypervisors=kvm \
   --vg-name=ganeti \
   --vg-name=ganeti \
   --master-netdev=br0 \
   --master-netdev=private \
   --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,kernel_path= \
   --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,migration_bandwidth=64,migration_downtime=500,kernel_path= \
   --nic-parameters=link=br0 \  
   --nic-parameters=link=private \  
   ganeti01.svc.codfw.wmnet
   ganeti01.svc.codfw.wmnet


The above is the way we currently have our clusters configured
The above is the way we currently have our clusters configured


== Modify the cluster ==
=== Synchronize ssh_host_rsa_key across cluster nodes ===
 
<del>
Since the ganeti service address floats between master candidate nodes, the host key for the cluster service address remain the same across master changes.  To accomplish this we must synchronize /etc/ssh/ssh_host_rsa_key and /etc/ssh/ssh_host_rsa_key.pub across the Ganeti cluster in that site. </del>
 
<del>
Copy the contents of /etc/ssh/ssh_host_rsa_key and /etc/ssh/ssh_host_rsa_key.pub from the master node to the other nodes within the cluster, add a known_hosts entry for this shared host key to /var/lib/ganeti/known_hosts.  Then bounce the sshd on each node.
</del>
 
=== Add nodes to new cluster (or extend an existing one ===
 
Nodes can be added to the cluster using the '''sre.ganeti.addnode''' cookbook. It runs various consistency checks (whether virtualization is enabled in the BIOS, if the ganeti VG is present and if all the bridges (public/private and analytics in eqiad) are correctly set up) before making the change.
 
Don't forget to also register the new node in Hiera (profile::ganeti::nodes), this variable gets read to setup the Ferm rules.
 
===Verify cluster health===
 
Run `gnt-cluster verify` to perform a health check on the cluster.  Address any errors that may occur before moving on.
 
===Rename the Ganeti group ===
 
By default the Ganeti group is created as "default". You can run the following to use a more consistent name: (this name also needs to be added to CLUSTERS_AND_ROWS in spicerack.git/spicerack/ganeti.py)
 
sudo gnt-group rename default D
 
===Ensure microcode flags are passed down to VM instances ===
 
In addition we need to adapt the QEMU settings so that new microcode flags are passed down to the instance:
 
    sudo gnt-cluster modify -H kvm:cpu_type=IvyBridge\\,+pcid\\,+invpcid\\,+spec-ctrl\\,+ssbd\\,+md-clear
 
=== Configure a static KVM machine type ===
 
This makes future OS updates simpler:
 
    sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-5.2
 
===Create a test VM===
 
The cluster should now be ready for VM creation, follow the 'create a VM' steps below and address any errors that may occur.
 
==Modify an existing cluster==


Modifying the cluster to change defaults, parameters of hypervisors, limits, security model etc is possible. An example of modifying the cluster is given below.
Modifying the cluster to change defaults, parameters of hypervisors, limits, security model etc is possible. An example of modifying the cluster is given below.
Line 51: Line 182:
   sudo gnt-cluster info
   sudo gnt-cluster info


and then lookup in ganeti documentation the various options [[http://docs.ganeti.org/ganeti/2.7/html/man-gnt-cluster.html]]
and then lookup in ganeti documentation the various options [http://docs.ganeti.org/ganeti/2.7/html/man-gnt-cluster.html]


== Destroy the cluster ==
==Destroy the cluster==


Destroying the cluster is a one way street. Do not do it lightly. An example of destroying a cluster:
Destroying the cluster is a one way street. Do not do it lightly. An example of destroying a cluster:
Line 61: Line 192:
do note that various things will be left behind. For example /var/lib/ganeti/queue/ will not be deleted. It's up to you if you want to delete it or not, depending on the case.
do note that various things will be left behind. For example /var/lib/ganeti/queue/ will not be deleted. It's up to you if you want to delete it or not, depending on the case.


== Add a node ==
==Listing cluster nodes ==
 
Adding a new hardware node to the cluster to increase capacity
 
  sudo gnt-node add ganeti1002.eqiad.wmnet
 
== Listing cluster nodes ==


Listing all hardware nodes in a cluster:
Listing all hardware nodes in a cluster:
Line 81: Line 206:
   ganeti1004.eqiad.wmnet 427.9G 427.9G  63.0G  288M 62.5G    0    0
   ganeti1004.eqiad.wmnet 427.9G 427.9G  63.0G  288M 62.5G    0    0


The columns are respectively: Disk Total, Disk Free, Memory Toal, Memory used by node itself, Memory Free, Instances for which this node is primary, instances for which this node is secondary
The columns are respectively: Disk Total, Disk Free, Memory Total, Memory used by node itself, Memory Free, Instances for which this node is primary, instances for which this node is secondary
 
The nodegroup information you be obtained via a command like
    sudo gnt-node list -o name,group
 
which would return the name and the group of node
 
    Node                  Group
    ganeti1001.eqiad.wmnet C
    ganeti1002.eqiad.wmnet C
    ganeti1003.eqiad.wmnet C
    ganeti1004.eqiad.wmnet C
    ganeti1005.eqiad.wmnet A
    ganeti1006.eqiad.wmnet A
    ganeti1007.eqiad.wmnet A
To list all the hosts in a given group:


<code>sudo gnt-instance list -o name -F "pnode.group == 'B'"</code>


== View the job queue ==
To check which row is underused and where you should put a new VM:
 
<code>[ganeti1009:~] $ for row in A B C D; do echo "row ${row}: $(sudo gnt-instance list -o name -F "pnode.group == '${row}'" | wc -l) VMs"; done</code>
 
==Detecting the master node==
 
The current master node of a cluster is printed in the MOTD when logging into a Ganeti server. It can also be detected with
 
  sudo gnt-node list -o name,master_candidate,master
 
== View the job queue==


Ganeti has a job queue built-in. Most of the times it's working fine but if something is taking too long it might be helpful to check what's going on in the job queue
Ganeti has a job queue built-in. Most of the times it's working fine but if something is taking too long it might be helpful to check what's going on in the job queue
Line 94: Line 245:
   gnt-job info #ID
   gnt-job info #ID


== Cluster upgrades ==
==Cluster upgrades==


Hardware/software upgrades on a ganeti cluster can happen with 0 downtime to the VMs operations. The procedure to do so is outlined below. In case a shutdown/reboot is needed the procedure to empty to node is described. The rolling  
Hardware/software upgrades on a ganeti cluster can happen with 0 downtime to the VMs operations. The procedure to do so is outlined below. In case a shutdown/reboot is needed the procedure to empty to node is described in the rolling reboot section


=== Do the software upgrade (if needed) ===
===Do the software upgrade (if needed)===


   apt-get safe-upgrade
   apt upgrade, apt install <component>, debdeploy, whatever is the correct method


throughout the cluster. It should have 0 repercussions to any VM anyway. Barring a Ganeti bug in the upgraded version, the cluster itself should also have 0 problems
throughout the cluster. It should have 0 repercussions to any VM anyway. Barring a Ganeti bug in the upgraded version, the cluster itself should also have 0 problems. Between minor versions (e.g. 2.12 -> 2.15) it may be required to run some upgrade script. Read the changelog/upgrades notes. The Debian maintainer builds the package in a way that both versions will be installed and until you run said script the old version will still be used.


=== Rolling reboot ===
===Rolling reboot===


Doing a rolling reboot of the cluster is easy. Empty every node, reboot it, check that it is online, proceed to the next. The one thing to take care is to not reboot the master without failing it over first.
Doing a rolling reboot of the cluster is easy. Empty every node, reboot it, check that it is online, proceed to the next. The one thing to take care is to not reboot the master without failing it over first.


=== Failover the master ===
===Failover the master===


Choose a master candidate that suits you. You can get master candidates by  
Choose a master candidate that suits you. You can get master candidates by  
Line 119: Line 270:


The cluster IP will now be served by the new node and the old one is no longer the master.
The cluster IP will now be served by the new node and the old one is no longer the master.
===Cluster rebalancing===
There might be a time where the cluster will look/actually be unbalanced. That will be true after a rolling reboot of the nodes. Doing a rebalancing is easy and baked into ganeti, all it takes is running a command
  sudo hbal -L -X -G <node_group>
Please run it in a screen session. It might take quite a while to finish. The jobs have been submitted so it's fine losing that session but it's still prudent.
A list of node groups can be displayed with "sudo gnt-group list".
The cluster will calculate a current score, run some heuristic algorithms to try and minimize that score and then execute the commands require to reach that state.
==Cluster certificates==
Ganeti creates and manages an internal Certificate Authority (CA). This CA is used for internal component communication and it's overall rather well hidden in day to day operations. There is a CA and node (aka client) certs that are distributed to the nodes. All nodes have the CA and the key to the CA, it's part of the ability to assume the master functionality. We 've decided to not mess with this CA as it is internal to the software and cluster state and we don't want to expose the key of the Puppet CA to the ganeti nodes.
The CA issues certs for the client nodes, the spice functionality (we don't use that) as well as the RAPI (remote API). Since we treat that CA as a blackbox, we don't want to use those. So in order to allow the rest of the infrastructure we run ganeti-rapid with the -K parameter and instead use a certificate we ship using Puppet. This can be confusing so keep an eye out.
===Renew cluster certificates===
The cluster certificates can be renewed at any point in time. While time consuming, it is a safe command to run (keep in mind that other operations will not be able to be run however during that timeframe). The command to do so is:
  sudo gnt-cluster renew-crypto --new-cluster-certificate --new-node-certificates --new-rapi-certificate --new-spice-certificate --reason="<insert name> renewing crypto"
Note that we renew the cluster CA, the client certs, the spice certificate and the RAPI certificate. Don't be alarmed about renewing the RAPI certificate. As pointed out above, while the CA does issues one, we override the command to start  ganeti-rapid and don't use the file that is created by the CA. So we can safely renew that too.
It will ask you to run gnt-cluster verify. Do so. If you see any errors, try to correct them and repeat.
=== Expired cluster certificates===
So, you 've just met something like
  Error checking node ganeti2004.codfw.wmnet: Error 60: server certificate verification failed. CAfile: /var/lib/ganeti/server.pem
What has happened is that the CA (and client certs) have expired. First, don't panic. While it must be fixed, it's not going to cause issues immediately. Relax and read through this section to understand what needs to be done.
Normally, we would just use the command in the section above to just renew everything but if we are this position we can't right away and need to perform a couple of more actions before that.
Prep work:
There's a couple of things that might interfere with you working. Those are puppet and ganeti cronjobs. Disable both before starting otherwise, you might end up scratching your head as race conditions might interfere with your work. To do so, run disable-puppet "<insert comment>" on all hosts via cumin and move /etc/cron.d/ganeti to some other place temporarily.
#Make sure you are on the master (if you used the ganeti01.svc.<site>.wmnet service url it should be already so).
#Write down the nodes participating in the cluster. Running the following should do it:
sudo gnt-node list
#<li value="3">Stop ganeti across the cluster. A cumin command 'systemctl stop ganeti' should do it. Don't worry, VMs will keep on running.</li>
#Create a new CA. The easiest way to do so is:
chmod 0600 /var/lib/ganeti/server.pem && openssl req -new -newkey rsa:2048 -days 1825 -nodes -x509 -keyout /var/lib/ganeti/server.pem -out /var/lib/ganeti/server.pem -batch && chmod 0400 /var/lib/ganeti/server.pem
#<li value="5">Copy that cert to all hosts. Since the nodes have SSH allowed between them, you can just do that in a for loop</li>
for i in ganetiX001.<site>.wmnet ganetiX002.<site>.wmnet ... ; do scp /var/lib/ganeti/server.pem $i:/var/lib/ganeti/server.pem ; ssh $i chmod 0400 /var/lib/ganeti/server.pem ; done
#<li value="6">Start ganeti across the cluster.</li>
# Run a gnt-node list and note that this time the capacity of the hosts and usages are displayed normally and not with a question mark.
#Now run the gnt-cluster renew-crypto command from above.
# Follow the point about running gnt-cluster verify
#Correct errors if any and repeat.
#You are out of the woods.
Now re-enable puppet and move back /etc/cron.d/ganeti from wherever you placed it.


= Node operations =
= Node operations =


== Reboot/Shutdown for maintenance a node ==
==Reboot/Shutdown for maintenance a node==


Select a node that needs rebooting/shutdown for brief hardware maintenance and empty of primary instances
Select a node that needs rebooting/shutdown for brief hardware maintenance and empty of primary instances
Line 132: Line 341:
     sudo gnt-node list
     sudo gnt-node list


should return 0 primary instances for the node. It is safe to reboot it or shut it down for a brief amount of time for hardware maintenance
should return 0 primary instances for the node. It is safe to reboot it or shut it down for a brief amount of time for hardware maintenance.
 
===VMs without DRBD disk template===
 
Note that there is a chance that this isn't true. We have a few instances that can't be migrated to other nodes. Those are things for which replicated storage (aka DRBD) is either an overkill (e.g. d-i-test) or a liability (e.g. etcd hosts where the extra IO latency by DRBD causes issues). For those VMs, which have the plain disk template, it's safe to just reboot the host, ganeti will DTRT.


== Shutdown a node for a prolonged period of time ==
After reboot, before you migrate the next node run
 
    sudo gnt-cluster verify-disks
 
It should display "No disks need to be activated." (possibly multiple times, one per ganeti nodegroup) before the next node can be rebooted (this ensures that the DRBD is synced fully)
 
==Shutdown a node for a prolonged period of time==


Should the node be going down for an undetermined amount of time, also move the secondary instances
Should the node be going down for an undetermined amount of time, also move the secondary instances


     sudo gnt-node migrate -f <node_fqdn>
     sudo gnt-node migrate -f <node_fqdn>
     sudo gnt-node evacuate -s  <node_fqdn>
     sudo gnt-node evacuate -s  <node_fqdn> # Removes it as a secondary as well


The second command means moving around DRBD pairs and syncing disk data. It is bound to take a long time, so find something else to do in the meanwhile
The second command means moving around DRBD pairs and syncing disk data. It is bound to take a long time, so find something else to do in the meanwhile
Line 147: Line 366:
     sudo gnt-node list  
     sudo gnt-node list  


should return 0 for both primary instances as well as secondary instances. Before poweroff the node we need to remove it from the cluster as well
should return 0 for both primary instances as well as secondary instances. Before powering off the node we need to remove it from the cluster as well


     sudo gnt-node remove <node_fqdn>
     sudo gnt-node remove <node_fqdn>
Line 155: Line 374:
     sudo gnt-node add <node_fqdn>
     sudo gnt-node add <node_fqdn>


= VM operations =
==Failed hardware node==


== Listing VMs ==
When a host is having problem (hardware/kernel/otherwise) and it's unreachable, the are a number of possible avenues to solve the issue, but an empirical good way out is:


    gnt-instance list
*Just powercycle the host. If that works, it's probably the faster way out. Most services should anyway be set up highly available and if we got one that is not we either should set it that way or not care too much when it fails. If this works, you are done, if not keep on reading.
*If the above doesn't work (the node never comes back up), start the VMs on another host. This can be done with


== Create a VM ==
    sudo gnt-node failover -f <node_fqdn>


Creating a VM is easy. Most of the steps are the same as for production so keep in mind the regular process as well.  
in some cases you may have to ignore the consistency checks (this has never happened in our setup), pass --ignore-consistency. Again all important services are set in high available setups (and could easily reimaged) be  so this will only severely bite VMs that are not setup that way.


=== Assign a hostname/IP ===
* Remove the host from the cluster


Same process as for hardware. Assign the IP/hostname and make sure DNS changes are live before going forward.
    sudo gnt-node remove <node_fqdn>


=== Create the VM (private IP) ===
*Debug/fix/RMA the node.
    gnt-instance add \
      -t drbd \
      -I hail \
      --net 0:link=br0 \
      --hypervisor-parameters=kvm:boot_order=network \
      -o debootstrap+default \
      --no-install \
      -B vcpus=<x>,memory=<y>g \
      --disk 0:size=<z>g \
      <fqdn>


Note the the VM will NOT be started. That's on purpose for now. x, y ,z on the above are variables. t,g,m denote tera,giga,mega bytes respectively.
=VM operations=


=== Create the VM (public IP) ===
==Listing VMs==
    gnt-instance add \
      -t drbd \
      -I hail \
      --net 0:link=vlan<VLAN_ID> \
      --hypervisor-parameters=kvm:boot_order=network \
      -o debootstrap+default \
      --no-install \
      -B vcpus=<x>,memory=<y>g \
      --disk 0:size=<z>g \
      <fqdn>


Note that the only difference between public/private IP is the <VLAN_ID> required. This is the VLAN ID used by the network. Here's multiple different ways to find it out:
    sudo gnt-instance list


* The easiest (USE THIS FOR NOW) Use the ganeti nodes themselves: /etc/network/interfaces. All support VLANS apart from the very basic one (the one the ganeti hosts are installed into anyway are there in the form of br0.<VLAN_ID>. For now it is just one as we don't yet support cross row.
==Create a VM==
* Use the switches: This requires knowning just a bit more about the network. Figure out the row the ganeti hosts are racked into, log in into the corresponding switch and get the vlan ID with show vlans <id_name>.


=== Get the MAC address of the NIC ===
Creating a VM is easy. Most of the steps are the [[Server_Lifecycle#Spare_->_Staged|same as for production]] so keep in mind the regular process as well.


    gnt-instance info <fqdn> | grep MAC
{{note|Creating a new VM should generally follow filing a Virtual machine request ticket as discussed in the page [[SRE_Team_requests#Virtual_machine_requests_(Production)]] even if you intend to perform the steps to create the VM itself so that the creation is documented, tracked and others may be aware. See also [[Phabricator#Misc._Production_Virtual_Machine_Requests_Workflow]]}}


Get the MAC address
However, the reimage cookbook works only on physical host with an IPMI interface, so '''it cannot be used for VMs'''. So while the network boot installation will be unattended, VM creation is explained below, and Puppet signing and other steps is done with the [[Server_Lifecycle#Manual_installation|manual procedure]].


=== Update DHCP ===
=== Verify cluster resource availability===


Same as usual. Use linux-host-entries.ttyS0-115200 for Ganeti VMs. Otherwise you will not be getting a console
First check how the Ganeti rows are currently used, to balance out resource allocation. For this run the following on the Ganeti master:


=== Update autoinstall files ===
    sudo gnt-group list


Same as usual. Do however add virtual.cfg to the configuration for a specific VM. Example: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/install-server/files/autoinstall/netboot.cfg;601720be51228f7eae3de17988b1afa8881a5bdb$71
which should return something like the output below


=== Start the VM ===
    Group Nodes Instances AllocPolicy NDParams
    A    3        20 preferred  ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1,ovs_name=switch1, oob_program=
    C    4        19 preferred  ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1,ovs_name=switch1, oob_program=


   gnt-instance start <fqdn>
You can also have a more detailed breakdown of resource usage with e.g.
 
    elukey@ganeti1003:~$ sudo gnt-node list -o +group
    Node                  DTotal  DFree MTotal MNode MFree Pinst Sinst Group
    ganeti1001.eqiad.wmnet 707.4G  9.2G  62.9G 37.6G 24.9G   11    11 C
    ganeti1002.eqiad.wmnet 707.4G  69.2G  62.9G 40.4G 21.5G    10    11 C
    ganeti1003.eqiad.wmnet 707.4G 244.6G  62.9G 39.4G 21.2G    11    11 C
    ganeti1004.eqiad.wmnet 707.4G 285.1G  62.9G 55.8G  6.8G    10    8 C
    ganeti1005.eqiad.wmnet  2.1T 225.8G  62.8G 16.8G 46.0G    6    18 A
    ganeti1006.eqiad.wmnet  2.1T 806.7G  62.8G 55.0G  7.5G    13    5 A
    ganeti1007.eqiad.wmnet  2.1T  1.1T  62.8G 44.2G 16.6G    12    12 A
    ganeti1008.eqiad.wmnet  2.1T 834.4G  62.8G 47.8G 11.3G    13    7 A
 
First thing, check [https://netbox.wikimedia.org/search/?q=ganeti10&obj_type= netbox] to get an idea about what Ganeti nodes are in the various rows. Then check the '''DFree''' (disk free) and '''MFree''' (memory free) columns of the Ganeti nodes in the above table. In theory there should be sufficient disk/memory space on all nodes in the row that you are planning to use, otherwise you might get failures when creating the VM.
 
If you want to confirm the reason why creating a VM fails is really due to lack of resources you can look into the ganeti command.log.  For example in eqiad you would check the current master node ganeti1003:
 
<pre>
[ganeti1003:~] $ sudo grep failure /var/log/ganeti/commands.log
OpPrereqError: ("Can't compute nodes using iallocator 'hail': Request failed: Group C (preferred): No valid allocation solutions, failure reasons: FailMem: 9, FailDisk: 3", 'insufficient_resources')
</pre>
 
===Assign a hostname===
 
You need to select a hostname based on our [[Infrastructure_naming_conventions]]. You should also add Cumin aliases to cover selecting the new hostnames.  See [[Cumin#Global_grammar_host_selection]] and edit modules/profile/templates/cumin/aliases.yaml.erb in operations/puppet.  If you are making a new 'cluster', you'll also need to add a <tt>$cluster_$dc</tt> definitions to hieradata/common/monitoring.yaml.
 
The IP allocation and the generation of DNS records are handled automatically by Netbox.
 
===Create the VM===
Ganeti VMs are created with the sre.ganeti.makevm cookbook. You can use it from any Cumin host like the following:
 
<syntaxhighlight lang="shell-session">
$ sudo cookbook sre.ganeti.makevm -h
usage: cookbook [-h] [--skip-v6] [--vcpus VCPUS] [--memory MEMORY] [--disk DISK] [--network {public,private,analytics}] [--cluster CLUSTER]
                [--group GROUP]
                hostname
 
Create a new Virtual Machine in Ganeti
 
    * Pre-allocate the primary IPs and set their DNS name
    * Update the DNS records
    * Create the VM on Ganeti
    * Force a sync of Ganeti VMs to Netbox in the same DC
    * Update Netbox attaching the pre-allocated IPs to the host's primary interface
 
    Examples:
        Create a Ganeti VM vmname.codfw.wmnet in the codfw Ganeti cluster
        on group B with 1 vCPUs, 3GB of RAM, 100GB of disk in the private network:
 
            makevm --vcpus 1 --memory 3 --disk 100 --cluster codfw --group B vmhostname
 
 
 
positional arguments:
  hostname              The hostname for the VM (not the FQDN).
 
optional arguments:
  -h, --help            show this help message and exit
  --skip-v6            To skip the generation of the IPv6 DNS record. (default: False)
  --vcpus VCPUS        The number of virtual CPUs to assign to the VM. (default: 1)
  --memory MEMORY      The amount of RAM to allocate to the VM in GB. (default: 1)
  --disk DISK          The amount of disk to allocate to the VM in GB. (default: 10)
  --network {public,private,analytics}
                        Specify the type of network to assign to the VM. (default: private)
  --cluster CLUSTER    The Ganeti cluster short name, as reported in Netbox as a Cluster Group:
                        https://netbox.wikimedia.org/virtualization/cluster-groups/ (default: None)
  --group GROUP        The Ganeti group name, as reported in Netbox as a Cluster for the given group:
                        https://netbox.wikimedia.org/virtualization/clusters/ (default: None)
</syntaxhighlight>
 
Please note that creating a new instance takes several minutes, most of the times due to the provisioning of the disk space. The makevm cookbook returns the output of the Ganeti command periodically, in order to get a regular feedback about the status of the VM creation.
 
===Update the DHCP config with the MAC of the new VM ===
 
The makevm cookbook will print the MAC address of the virtual NIC of the freshly created instance at the end. Next you need to add a DHCP entry with that MAC by editing the <code>modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200</code> file.
 
After merging the patch, you can force a puppet run on the installation servers to speed up the installation process;
 
  sudo cumin A:installserver 'run-puppet-agent'
 
==== How to query the MAC address (if you missed the cookbook output) ====
If you missed the MAC address when running the cookbook, you can query it after the machine has been created like so:
<pre>
razzi@ganeti1009:~$ sudo gnt-instance show datahubsearch1001.eqiad.wmnet | grep -A 2 NIC
  NICs:
    - nic/0:
      MAC: aa:00:00:35:3d:f1
</pre>
===Update partman config (if needed)===
 
The netboot.cfg files configures the partition setup of a server. For all Ganeti VMs that's consistently flat.cfg virtual.cfg. When adding a new instance you need to verify whether a partman config is assigned. Most server types use a globbing scheme like idp* and if you're adding a further IDP host you can simply skip this step. If you however add a server with a new standalone name, then you need to merge a change puppet for <code>modules/install_server/files/autoinstall/netboot.cfg</code>.
 
After that you need to run Puppet on the apt* servers:
 
sudo cumin apt* 'run-puppet-agent'
 
===Start the VM===
 
sudo gnt-instance start <fqdn>


and connect to the console
and connect to the console


  gnt-instance console <fqdn>
sudo gnt-instance console <fqdn>


Ctrl+] to leave the console
Ctrl+] to leave the console


=== Set boot order to disk ===
===Set boot order to disk===
{{note| WARNING: Fail to do this and the VM will be stuck in an endless reboot, install, reboot loop}}
{{note|WARNING: Fail to do this and the VM will be stuck in an endless reboot, install, reboot loop}}
Assuming the installation goes on well but before it finishes, you need to set the boot order back to disk. This is a limitation of the current version of the Ganeti software and is expected to be solved (upstream is aware).  
Assuming the installation goes on well but before it finishes, you need to set the boot order back to disk. This is a limitation of the current version of the Ganeti software and is expected to be solved (upstream is aware<sup>[link?]</sup>}).


   gnt-instance modify \
   gnt-instance modify --hypervisor-parameters=boot_order=disk <fqdn>
  --hypervisor-parameters=boot_order=disk \
  <fqdn>


Note: when the VM has finished installing, it will shutdown automatically. The Ganeti software includes HA checks and will promptly restart it. We rely on this behaviour to have the VM successfully installed. However, if you list the VMs during this phase you will see the VM in ERROR_down. Don't worry, this is expected.
Note: when the VM has finished installing, it will shutdown automatically. The Ganeti software includes HA checks and will promptly restart it. We rely on this behaviour to have the VM successfully installed. However, if you list the VMs during this phase you will see the VM in ERROR_down. Don't worry, this is expected.


=== Assign role to the VM in puppet ===
Note 2: If you need to set a boot_order back to PXE to reinstall, it's "boot_order=network" (For KVM the boot order is either "floppy", "cdrom", "disk" or "network".)
 
===Sign Puppet certs===
 
For baremetal hosts, this part is handled by the reimage script (wmf-auto-reimage). As of 2020, VMs are not supported, so we need to do it as in the [[Server_Lifecycle#Manual_installation|manual installation instructions]] for Ganeti VMs.
 
The process is as follows:
 
On a puppetmaster, run the following command (which will enable you obtain/access root shell of the targeted VM, identified by FQDN provided)
 
<pre>
install_console FQDN
</pre>
 
You will be dropped into a root shell belonging to your newly created  VM instance. On the VM root shell, initiate a puppet run by executing the following:
 
<pre>
puppet agent -tv
</pre>


As usual.
The first Puppet run triggers a request for the certificate, it should show up when running "sudo puppet cert -l" on puppetmaster1001. Next approve the certificate with


== Delete a VM ==
<pre>
sudo puppet cert -s FQDN
</pre>Once the first puppet run has completed, the [https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ interface_automation/ImportPuppetDB] Netbox script will be automatically run by the <code>netbox_ganeti_$DC_sync</code> timers that run on the Netbox hosts to to synchronize the VM interfaces to Netbox.


Irrevocably deleting a VM is done via:
===Assign role to the VM in puppet===
 
As usual. If the VM will not be installed immediately turn it off (see shutdown below). Running VMs that are not in Puppet will alarm.
 
==Delete a VM==
 
The preferred method is to run the decommissioning cookbook [[Decom_script]] that will take care of everything, including deleting the VM itself. See also the actions performed by the [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py|sre.hosts.decommission]] cookbook.
 
To instead irrevocably delete manually a VM run:


   gnt-instance remove <fqdn>
   gnt-instance remove <fqdn>


Please remember to clean up DHCP/DNS entries afterwards
In all cases please remember to clean up Puppet/DHCP/DNS entries afterwards.


== Shutdown/startup a VM ==
==Shutdown/startup a VM ==


   gnt-instance startup <fqdn>
   gnt-instance startup <fqdn>
Line 255: Line 580:
   gnt-instance shutdown --timeout 0 <fqdn>
   gnt-instance shutdown --timeout 0 <fqdn>


= Notes =
==Get a console for a VM==
 
You can get log into the "console" for a Ganeti instance via
 
  gnt-instance console <fqdn>
 
The console can be left with "ctrl + ]"
 
If the above fails then it is possible to connect to the vnc port directly.  first you need to get the vnc port to connect to
 
<syntaxhighlight lang="bash">
$ sudo gnt-instance info netbox2002.codfw.wmnet | grep 'console connection'           
  console connection: vnc to ganeti2030.codfw.wmnet:11136 (display 5236)
</syntaxhighlight>
 
We then need to set up port forwarding to the vnc port over ssh
<syntaxhighlight lang="bash">
$ ssh -11136:localhost:11136 ganeti2030.codfw.wmnet
</syntaxhighlight>
 
finally start your vnc client
<syntaxhighlight lang="bash">
$ remote-viewer vnc://localhost:11136
</syntaxhighlight>
 
==Resize a VM==
 
===Increase/Decrease CPU/RAM===
Make sure first that the cluster has adequate space for whatever resource you want to increase (if you do want to increase and not decrease a resource). This is done manually by a combination of [https://grafana.wikimedia.org/d/000000545/ganeti grafana statistics] for CPU/Memory utilization and the output of gnt-node list for disk space utilization. After that you can issue the following command to increase/decrease the memory size and number of Virtual CPUs assigned to a VM
 
  gnt-instance modify -B memory=<X>[gm],vcpus=<N> <fqdn>
 
where X, N are physical numbers. X can be suffixed by g or m for Gigabytes or Megabytes (please don't do Terabytes ;))
 
===Adding disk space===
Adding space to an existing disk is possible. But it is not advisable. Only do so sparingly and when you know it's the best course of action. A good example is if you know that the VM will be reimaged after the process. The reason is that growing a disk with a partition table in it means that the partition table will need to be updated as well (and after that the filesystem as well). This is a non automated process, error prone and requires at least 2 reboots (one for the VM to see the new disk size, and one for the VM to safely use the new partition table - that latter one can be avoided manually, but it's usually cheaper to just reboot the VM).
 
But do note that the resizing of partitions and filesystems is up to you, as ganeti can't do it for you. The command would be:
 
  gnt-instance grow-disk <fqdn> <#> X[gmt]
 
where # is the index of the disk (it's 0 indexed). You can get the disks allocated to a VM using gnt-instance info <fqdn>. Again X is a physical number suffixed for Gigabytes/Megabytes/Terabytes.
 
===Adding a disk===
 
Adding a disk is also easy if you want to avoid the mess that will result from having to resize partitions/filesystems. Thus it is recommended over resizing a disk. The command would be:
 
  gnt-instance modify --disk add:size=X[gmt] <fqdn>
 
Again X is a physical number suffixed for Gigabytes/Megabytes/Terabytes.
 
 
'''WARNING'''!! After adding a device you have to reboot the VM (from ganeti level, not just reboot within the VM) and it might happen that your device names change and you get no networking.
 
If the VM does not come back from reboot, login via console and root password from pwstore and replace the device name with the "next" one in /etc/network/interfaces. For example replace all occurrences of "ens5" with "ens6" and reboot the machine one more time. also see [[phab:T272555]]. If you want to be sure about the "next" device name run <code>ifconfig -a</code> to double check.
 
After the disk has been created and the VM rebooted you need to create a partition table on it (fdisk), create a file system (mkfs.ext4) and mount it. Finally don't forget to add that to /etc/fstab to make
sure it survives a reboot.
 
Hint: if you know beforehand that a VM will need an extra disk, add it before the install phase. In that case you only need to add the disk, and no other step.
 
===Removing a disk===
 
In case a VM no longer requires a disk, it can be removed. The command would be:
 
  gnt-instance modify --disk <#>:remove <fqdn>
 
where # is the index of the disk (it's 0 indexed). You can get the disks allocated to a VM using gnt-instance info <fqdn>
 
==Reinstall / Reimage a VM==
 
Just like a physical server '''OS reinstall this will destroy the contents of the machine''' and requires appropriate netboot configs to be in place.  Proceed with caution!
 
Also note the '''wmf-auto-reimage script does not work for virtual machines''', so a manual procedure has to be done, as explained below.
<syntaxhighlight lang="bash">
$ FQDN=<fqdn>
</syntaxhighlight>
Shutdown the VM
<syntaxhighlight lang="bash">
$ sudo gnt-instance shutdown $FQDN
</syntaxhighlight>
Set boot device to network
<syntaxhighlight lang="bash">
$ sudo gnt-instance modify --hypervisor-parameters=boot_order=network $FQDN
</syntaxhighlight>
Start instance and attach to console (ctrl-] to detach)
<syntaxhighlight lang="bash">
$ sudo gnt-instance start $FQDN && sudo gnt-instance console $FQDN
</syntaxhighlight>
After the OS install has finished (or while it is successfully under way from a separate terminal) set the boot device back to disk
<syntaxhighlight lang="bash">
$ sudo gnt-instance modify --hypervisor-parameters=boot_order=disk $FQDN
</syntaxhighlight>
Finally, after the install finishes, boot the system into the fresh OS install and attach to the console (ctrl-] to detach)
<syntaxhighlight lang="bash">
$ sudo gnt-instance start $FQDN && sudo gnt-instance console $FQDN
</syntaxhighlight>
After you get the server login you need to proceed with the [[Server Lifecycle#Manual installation|manual installation]] procedure.
 
==Renumber (aka change network) a VM==
{{note|You generally should avoid the process described here}}
 
Say you want to change the networking configuration of a VM. It's usually better to create a new VM in the correct row instead. However there are some benefits to renumbering a VM like avoiding to have to re instantiate some database or some other lengthy to setup process. Some other indicative reasons for that would be:
 
* Spreading a cluster of VMs offering a service across more network rows.
*Rebalancing VMs across rows to free up capacity in a problematic row.
 
Steps are:
 
#Make sure the VM isn't powering some service (e.g. depool from pybal, if the VM is in a cluster, remove from the rest of the cluster's configuration)
#Choose a new row in the DC. Same rules are for creating a VM apply.
# Schedule downtime.
#Update DNS. Make sure you don't forget updating IPv6 :-)
#Update the networking configuration in /etc/network/interfaces. This is an error prone process with no review. Make sure you get it right otherwise you are in for some painful debugging
# Update the /etc/hosts referencing the host with the new IPv4 address
#If possible, power off the VM.
#Run the following command
 
sudo gnt-instance change-group --to=<row> <instance>
 
Wait it out.
 
Power on or reboot the VM
 
{{warn|Double/triple check that networking is working fine. Make sure /etc/networking/interfaces is still correct. Make sure to run puppet once. Especially IPv6 configuration, due to our puppetization of it end up being wrong and changing on the next puppet run causing confusion and difficult to debug issues}}
 
 
==Change disk template for a VM (aka drop DBRD)==
{{note|Advanced customization, use sparingly}}
 
In some cases (very few), it makes sense that the disk of a VM is not replicated to another node. That will only be true for VMs where replication is treated internally by the application hosted in the VM. That could be things like a replicated Redis or a replicated MySQL or a distributed datastore like cassandra, etcd, zookeeper.
 
In some of the above cases, DRBD will not only be a waste but also cause issues. The one we 've seen is etcd, where the IO latency added by DRBD causes etcd to periodically complain about lag and syncing issues. This is usually seen via icinga alerts. The fix for this is to switch from DRBD to the plain template. The following commands will do that
 
Stop first the VM as the disk template change can't happen with the VM up
 
  $ sudo gnt-instance shutdown foobar.eqiad.wmnet
 
With the VM stopped, change the disk template
 
  $ sudo gnt-instance modify -t plain foobar.eqiad.wmnet
  Tue Feb  9 13:53:05 2021 Converting disk template from 'drbd' to 'plain'
  Tue Feb  9 13:53:05 2021 Removing volumes on the secondary node...
  Tue Feb  9 13:53:06 2021 Removing unneeded volumes on the primary node...
  Modified instance foobar.eqiad.wmnet
  - disk_template -> plain
  Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti;
  restarting from within the instance will not be enough.
 
Start now the VM again
 
  $ sudo gnt-instance startup foobar.eqiad.wmnet
 
==Memory pressure ==
We have memory pressure icinga checks. Ganeti nodes are heavily memory bound and rather inflexible (that is, memory does not change a lot over time, once a VM is started it obtains that memory and uses it. We don't do ballooning, there is little gain in that currently). When a memory pressure icinga check happens, the best bet is to just rebalance the cluster or at least the nodegroup (representing a rack row currently). See [[Ganeti#Cluster rebalancing]]
 
=Notes=


All of the commands that have a Y/N prompt can be forced with a -f. For example the following will spare you the prompt
All of the commands that have a Y/N prompt can be forced with a -f. For example the following will spare you the prompt
Line 265: Line 746:
   gnt-instance shutdown --submit <fqdn>
   gnt-instance shutdown --submit <fqdn>


= API =
=API=
 
[[Category:SRE Infrastructure Foundations]]

Revision as of 07:57, 6 July 2022

Ganeti is a clustered virtual machine management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. It supports both KVM and Xen. At WMF we only have KVM as an enabled hypervisor. Primary Ganeti web page is http://www.ganeti.org/.

At WMF, ganeti is used as a cluster management tool for production-network VPSes with services that help us run our clusters. After an evaluation process of Openstack vs Ganeti, Ganeti was chosen as a more fitting software for the job at hand.

Architecture Overview

Ganeti is architected as a shared nothing virtualization cluster with job management. There is one master node (per-site) that receives all jobs to be executed (create a VM, delete a VM, stop/start VMs, etc). The master node can be swapped between a preconfigured number of master candidates in case of a hardware failure. This avoids a single point of failure for cluster management operations.

For VM storage we use the DRBD backend. This provides "RAID-1 over the network" block devices which ensure that in the event of failure of a single hardware node, VMs may be restarted with minimal disruption on their secondary (backup) node. This also provides the capability to migrate VMs between hardware nodes (e.g. for hardware maintenance) without requiring a centralized storage backend.

Within a cluster, there's the notion of a node group. That's practically a group of hardware nodes. Think of it as nothing more than a subcluster division. Operations are usually constrained within a node group and do not cross the boundaries unless specifically instructed to. In our deployment, node groups are aligned with the physical row location of the hardware inside the datacenter (where possible). This allows us to diversify VMs across physical row locations within a single site, taking advantage of the additional of fault tolerance this can provide. In caching pop locations there is a single, "default" node group.

A high level overview of the architecture is here https://docs.ganeti.org/docs/ganeti/2.14/html/_images/graphviz-246e5775f608681df9f62dbbe0a5d4120dc75f1c.png and more discussion about it is in https://docs.ganeti.org/docs/ganeti/2.14/html/design-2.0.html

Ganeti Clusters

A cluster is identified by an FQDN (e.g ganeti01.svc.eqiad.wmnet) which corresponds to an IPv4 address. That IPv4 address is "floating", meaning that it is owned by the current master.

Administration always happens via the master. It is the only node where all commands can be run and hosts the API. Failover of a master is easy but manual. See below for more information on how to do it.

Hardware Conventions

As of Dec 2019 we have standardized on raid-5 host configuration in eqiad/codfw and an SSD backed raid-1 configuration in caching pop sites.

Connect to an existing cluster

Most operations are run from the Ganeti master node, but you can log into any of the virtualisation nodes via standard SSH. The MOTD of each Ganeti server will print the Ganeti master for the cluster.

Building a new cluster

Host Preparation

Node should be in the insetup role

New nodes should be initially installed with the puppet role insetup. At a later point in this procedure, after several manual actions, they'll be switched to the real ganeti role.

Ensure hardware virtualization is enabled in BIOS

Hardware virtualization must be enabled for KVM. If you encounter errors relating to KVM, double check that hardware virtualization is enabled inside BIOS > System Settings > Processor Settings

Ensure 'ganeti' Volume Group is present

Our current ganeti partman layout creates a large raid device and formats it as swap to work around a shortcoming in the debian installer (fixme)

When a new ganeti host is built, you must manually remove the large swap device on /dev/md2, create a PV on /dev/md2, and create a VG called ‘ganeti’. Be sure to remove the stale swap entry from /etc/fstab.

Ensure network bridge(s) are present

Bridges are used to connect virtual machines to the network interface of the hardware node. Currently these are manually configured. Each bridge corresponds to a network VLAN, on either a physical or tagged interface.

Make sure to install bridge-utils. It's a dependency of Ganeti, but not installed by default, so if you're setting up the bridge before the role gets applied, bridge-utils isn't installed and networking.service will fail with a cryptic/misleading error message.

Looking at the simpler case of the caching pop deployment, there is a single bridge called ‘private’ that corresponds with private network interface. This bridge interface assumes the IP of the node.

Example configuration:

 #/etc/network/interfaces
 
 # The loopback network interface
 auto lo private
 iface lo inet loopback
 
 iface ens3f0np0 inet manual
 
 iface private inet static
     bridge_ports 	ens3f0np0
     bridge_stp	off
     bridge_maxwait	0
     bridge_fd	0
         address 10.20.0.31/24
         gateway 10.20.0.1
         dns-nameservers 10.3.0.1
         dns-search ulsfo.wmnet
         up /sbin/ip token set ::10:20:0:31 dev private
         up ip addr add 2620:0:862:102:10:20:0:31/64 dev private
 
 ## The original network interface configuration
 #allow-hotplug ens3f0np0
 #iface ens3f0np0 inet static
 #	address 10.20.0.31/24
 #	gateway 10.20.0.1
 #	dns-nameservers 10.3.0.1
 #	dns-search esams.wmnet
 #     pre-up /sbin/ip token set ::10:20:0:31 dev ens3f0np0
 #     up ip addr add 2620:0:862:102:10:20:0:31/64 dev ens3f0np0

Before editing the interfaces file, it's important to disable the puppet agent and leave it disabled until the role switch in a later step. If the agent re-runs the insetup role after the changes above, it will modify the file in a way that breaks networking.

After editing the interfaces file, a systemctl restart networking.services is required in order to take effect.

Generate an RAPI certificate for the cluster

Ganeti RAPI TLS certificates are generated and manged with Cergen. You'll need to...

  • Add an entry to the ganeti yaml certificate manifest. Don't use a password as ganeti-rapi did not support this (at time of writing)
  • Generate the certificate
  • Add the key file to the puppet private/secret ssl certificate store
  • Add a dummy key file to the labs/private ssl certificate store. !!! do not use the real key here, use a dummy key!!! See other existing ganeti keys in labs/private for an example
  • Add the public certificate to the puppet files/ssl certificate store.

Create a SVC record for new cluster

The SVC record is used for the floating IP owned by a Ganeti master of a cluster. You need to create a new record following https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox

After that the sre.dns.netbox cookbook needs to be run.


Note that in eqiad/codfw the Netbox-generated zone is only used for reverse looksup, for forward resolution you need to add an entry to the dns.git repo (In the edges, all zones are managed in Netbox).

Assign the Ganeti role

Assign the "ganeti" role to the servers. The initial puppet run will not complete entirely, but that will be rectified when gnt-cluster is run later.

Ensure intra-cluster SSH is permitted by the iptables/ferm firewall

Cluster nodes must be able to ssh to one another in order for Ganeti to function. Double-check that puppet has configured this by attempting to SSH between the Ganeti cluster hosts being deployed.

Initialize the new cluster

An example of a initializing a new cluster:

  sudo gnt-cluster init \
  --no-ssh-init \ 
  --enabled-hypervisors=kvm \
  --vg-name=ganeti \
  --master-netdev=private \
  --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,migration_bandwidth=64,migration_downtime=500,kernel_path= \
  --nic-parameters=link=private \ 
  ganeti01.svc.codfw.wmnet

The above is the way we currently have our clusters configured

Synchronize ssh_host_rsa_key across cluster nodes

Since the ganeti service address floats between master candidate nodes, the host key for the cluster service address remain the same across master changes. To accomplish this we must synchronize /etc/ssh/ssh_host_rsa_key and /etc/ssh/ssh_host_rsa_key.pub across the Ganeti cluster in that site.

Copy the contents of /etc/ssh/ssh_host_rsa_key and /etc/ssh/ssh_host_rsa_key.pub from the master node to the other nodes within the cluster, add a known_hosts entry for this shared host key to /var/lib/ganeti/known_hosts. Then bounce the sshd on each node.

Add nodes to new cluster (or extend an existing one

Nodes can be added to the cluster using the sre.ganeti.addnode cookbook. It runs various consistency checks (whether virtualization is enabled in the BIOS, if the ganeti VG is present and if all the bridges (public/private and analytics in eqiad) are correctly set up) before making the change.

Don't forget to also register the new node in Hiera (profile::ganeti::nodes), this variable gets read to setup the Ferm rules.

Verify cluster health

Run `gnt-cluster verify` to perform a health check on the cluster. Address any errors that may occur before moving on.

Rename the Ganeti group

By default the Ganeti group is created as "default". You can run the following to use a more consistent name: (this name also needs to be added to CLUSTERS_AND_ROWS in spicerack.git/spicerack/ganeti.py)

sudo gnt-group rename default D

Ensure microcode flags are passed down to VM instances

In addition we need to adapt the QEMU settings so that new microcode flags are passed down to the instance:

   sudo gnt-cluster modify -H kvm:cpu_type=IvyBridge\\,+pcid\\,+invpcid\\,+spec-ctrl\\,+ssbd\\,+md-clear

Configure a static KVM machine type

This makes future OS updates simpler:

   sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-5.2

Create a test VM

The cluster should now be ready for VM creation, follow the 'create a VM' steps below and address any errors that may occur.

Modify an existing cluster

Modifying the cluster to change defaults, parameters of hypervisors, limits, security model etc is possible. An example of modifying the cluster is given below.

  sudo gnt-cluster modify -H kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,kernel_path=

To get an idea of what is actually modifiable do a:

  sudo gnt-cluster info

and then lookup in ganeti documentation the various options [1]

Destroy the cluster

Destroying the cluster is a one way street. Do not do it lightly. An example of destroying a cluster:

 sudo gnt-cluster destroy  --yes-do-it

do note that various things will be left behind. For example /var/lib/ganeti/queue/ will not be deleted. It's up to you if you want to delete it or not, depending on the case.

Listing cluster nodes

Listing all hardware nodes in a cluster:

  sudo gnt-node list

That should return something like:

  Node                   DTotal  DFree MTotal MNode MFree Pinst Sinst
  ganeti1001.eqiad.wmnet 427.9G 427.9G  63.0G  391M 62.4G     0     0
  ganeti1002.eqiad.wmnet 427.9G 427.9G  63.0G  289M 62.5G     0     0
  ganeti1003.eqiad.wmnet 427.9G 427.9G  63.0G  288M 62.5G     0     0
  ganeti1004.eqiad.wmnet 427.9G 427.9G  63.0G  288M 62.5G     0     0

The columns are respectively: Disk Total, Disk Free, Memory Total, Memory used by node itself, Memory Free, Instances for which this node is primary, instances for which this node is secondary

The nodegroup information you be obtained via a command like

   sudo gnt-node list -o name,group

which would return the name and the group of node

   Node                   Group
   ganeti1001.eqiad.wmnet C
   ganeti1002.eqiad.wmnet C
   ganeti1003.eqiad.wmnet C
   ganeti1004.eqiad.wmnet C
   ganeti1005.eqiad.wmnet A
   ganeti1006.eqiad.wmnet A
   ganeti1007.eqiad.wmnet A

To list all the hosts in a given group:

sudo gnt-instance list -o name -F "pnode.group == 'B'"

To check which row is underused and where you should put a new VM:

[ganeti1009:~] $ for row in A B C D; do echo "row ${row}: $(sudo gnt-instance list -o name -F "pnode.group == '${row}'" | wc -l) VMs"; done

Detecting the master node

The current master node of a cluster is printed in the MOTD when logging into a Ganeti server. It can also be detected with

 sudo gnt-node list -o name,master_candidate,master

View the job queue

Ganeti has a job queue built-in. Most of the times it's working fine but if something is taking too long it might be helpful to check what's going on in the job queue

  gnt-job list

and getting a job id from the result

  gnt-job info #ID

Cluster upgrades

Hardware/software upgrades on a ganeti cluster can happen with 0 downtime to the VMs operations. The procedure to do so is outlined below. In case a shutdown/reboot is needed the procedure to empty to node is described in the rolling reboot section

Do the software upgrade (if needed)

  apt upgrade, apt install <component>, debdeploy, whatever is the correct method

throughout the cluster. It should have 0 repercussions to any VM anyway. Barring a Ganeti bug in the upgraded version, the cluster itself should also have 0 problems. Between minor versions (e.g. 2.12 -> 2.15) it may be required to run some upgrade script. Read the changelog/upgrades notes. The Debian maintainer builds the package in a way that both versions will be installed and until you run said script the old version will still be used.

Rolling reboot

Doing a rolling reboot of the cluster is easy. Empty every node, reboot it, check that it is online, proceed to the next. The one thing to take care is to not reboot the master without failing it over first.

Failover the master

Choose a master candidate that suits you. You can get master candidates by

  sudo gnt-node list -o name,master_candidate

Login and

  sudo gnt-cluster master-failover

The cluster IP will now be served by the new node and the old one is no longer the master.

Cluster rebalancing

There might be a time where the cluster will look/actually be unbalanced. That will be true after a rolling reboot of the nodes. Doing a rebalancing is easy and baked into ganeti, all it takes is running a command

  sudo hbal -L -X -G <node_group>

Please run it in a screen session. It might take quite a while to finish. The jobs have been submitted so it's fine losing that session but it's still prudent. A list of node groups can be displayed with "sudo gnt-group list".

The cluster will calculate a current score, run some heuristic algorithms to try and minimize that score and then execute the commands require to reach that state.

Cluster certificates

Ganeti creates and manages an internal Certificate Authority (CA). This CA is used for internal component communication and it's overall rather well hidden in day to day operations. There is a CA and node (aka client) certs that are distributed to the nodes. All nodes have the CA and the key to the CA, it's part of the ability to assume the master functionality. We 've decided to not mess with this CA as it is internal to the software and cluster state and we don't want to expose the key of the Puppet CA to the ganeti nodes.

The CA issues certs for the client nodes, the spice functionality (we don't use that) as well as the RAPI (remote API). Since we treat that CA as a blackbox, we don't want to use those. So in order to allow the rest of the infrastructure we run ganeti-rapid with the -K parameter and instead use a certificate we ship using Puppet. This can be confusing so keep an eye out.

Renew cluster certificates

The cluster certificates can be renewed at any point in time. While time consuming, it is a safe command to run (keep in mind that other operations will not be able to be run however during that timeframe). The command to do so is:

  sudo gnt-cluster renew-crypto --new-cluster-certificate --new-node-certificates --new-rapi-certificate --new-spice-certificate --reason="<insert name> renewing crypto"

Note that we renew the cluster CA, the client certs, the spice certificate and the RAPI certificate. Don't be alarmed about renewing the RAPI certificate. As pointed out above, while the CA does issues one, we override the command to start ganeti-rapid and don't use the file that is created by the CA. So we can safely renew that too.

It will ask you to run gnt-cluster verify. Do so. If you see any errors, try to correct them and repeat.

Expired cluster certificates

So, you 've just met something like

 Error checking node ganeti2004.codfw.wmnet: Error 60: server certificate verification failed. CAfile: /var/lib/ganeti/server.pem

What has happened is that the CA (and client certs) have expired. First, don't panic. While it must be fixed, it's not going to cause issues immediately. Relax and read through this section to understand what needs to be done.

Normally, we would just use the command in the section above to just renew everything but if we are this position we can't right away and need to perform a couple of more actions before that.

Prep work:

There's a couple of things that might interfere with you working. Those are puppet and ganeti cronjobs. Disable both before starting otherwise, you might end up scratching your head as race conditions might interfere with your work. To do so, run disable-puppet "<insert comment>" on all hosts via cumin and move /etc/cron.d/ganeti to some other place temporarily.

  1. Make sure you are on the master (if you used the ganeti01.svc.<site>.wmnet service url it should be already so).
  2. Write down the nodes participating in the cluster. Running the following should do it:
sudo gnt-node list 
  1. Stop ganeti across the cluster. A cumin command 'systemctl stop ganeti' should do it. Don't worry, VMs will keep on running.
  2. Create a new CA. The easiest way to do so is:
chmod 0600 /var/lib/ganeti/server.pem && openssl req -new -newkey rsa:2048 -days 1825 -nodes -x509 -keyout /var/lib/ganeti/server.pem -out /var/lib/ganeti/server.pem -batch && chmod 0400 /var/lib/ganeti/server.pem
  1. Copy that cert to all hosts. Since the nodes have SSH allowed between them, you can just do that in a for loop
for i in ganetiX001.<site>.wmnet ganetiX002.<site>.wmnet ... ; do scp /var/lib/ganeti/server.pem $i:/var/lib/ganeti/server.pem ; ssh $i chmod 0400 /var/lib/ganeti/server.pem ; done
  1. Start ganeti across the cluster.
  2. Run a gnt-node list and note that this time the capacity of the hosts and usages are displayed normally and not with a question mark.
  3. Now run the gnt-cluster renew-crypto command from above.
  4. Follow the point about running gnt-cluster verify
  5. Correct errors if any and repeat.
  6. You are out of the woods.

Now re-enable puppet and move back /etc/cron.d/ganeti from wherever you placed it.

Node operations

Reboot/Shutdown for maintenance a node

Select a node that needs rebooting/shutdown for brief hardware maintenance and empty of primary instances

   sudo gnt-node migrate -f ganeti1004.eqiad.wmnet

Now a

   sudo gnt-node list

should return 0 primary instances for the node. It is safe to reboot it or shut it down for a brief amount of time for hardware maintenance.

VMs without DRBD disk template

Note that there is a chance that this isn't true. We have a few instances that can't be migrated to other nodes. Those are things for which replicated storage (aka DRBD) is either an overkill (e.g. d-i-test) or a liability (e.g. etcd hosts where the extra IO latency by DRBD causes issues). For those VMs, which have the plain disk template, it's safe to just reboot the host, ganeti will DTRT.

After reboot, before you migrate the next node run

    sudo gnt-cluster verify-disks

It should display "No disks need to be activated." (possibly multiple times, one per ganeti nodegroup) before the next node can be rebooted (this ensures that the DRBD is synced fully)

Shutdown a node for a prolonged period of time

Should the node be going down for an undetermined amount of time, also move the secondary instances

   sudo gnt-node migrate -f <node_fqdn>
   sudo gnt-node evacuate -s  <node_fqdn> # Removes it as a secondary as well

The second command means moving around DRBD pairs and syncing disk data. It is bound to take a long time, so find something else to do in the meanwhile

Now a

   sudo gnt-node list 

should return 0 for both primary instances as well as secondary instances. Before powering off the node we need to remove it from the cluster as well

   sudo gnt-node remove <node_fqdn>

NOTE: Do not forget to readd it after it is fixed (if it ever is)

   sudo gnt-node add <node_fqdn>

Failed hardware node

When a host is having problem (hardware/kernel/otherwise) and it's unreachable, the are a number of possible avenues to solve the issue, but an empirical good way out is:

  • Just powercycle the host. If that works, it's probably the faster way out. Most services should anyway be set up highly available and if we got one that is not we either should set it that way or not care too much when it fails. If this works, you are done, if not keep on reading.
  • If the above doesn't work (the node never comes back up), start the VMs on another host. This can be done with
   sudo gnt-node failover -f <node_fqdn>

in some cases you may have to ignore the consistency checks (this has never happened in our setup), pass --ignore-consistency. Again all important services are set in high available setups (and could easily reimaged) be so this will only severely bite VMs that are not setup that way.

  • Remove the host from the cluster
   sudo gnt-node remove <node_fqdn>
  • Debug/fix/RMA the node.

VM operations

Listing VMs

   sudo gnt-instance list 

Create a VM

Creating a VM is easy. Most of the steps are the same as for production so keep in mind the regular process as well.

However, the reimage cookbook works only on physical host with an IPMI interface, so it cannot be used for VMs. So while the network boot installation will be unattended, VM creation is explained below, and Puppet signing and other steps is done with the manual procedure.

Verify cluster resource availability

First check how the Ganeti rows are currently used, to balance out resource allocation. For this run the following on the Ganeti master:

   sudo gnt-group list

which should return something like the output below

   Group Nodes Instances AllocPolicy NDParams
   A     3        20 preferred   ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1,ovs_name=switch1, oob_program=
   C     4        19 preferred   ovs=False, ssh_port=22, ovs_link=, spindle_count=1, exclusive_storage=False, cpu_speed=1,ovs_name=switch1, oob_program=

You can also have a more detailed breakdown of resource usage with e.g.

   elukey@ganeti1003:~$ sudo gnt-node list -o +group
   Node                   DTotal  DFree MTotal MNode MFree Pinst Sinst Group
   ganeti1001.eqiad.wmnet 707.4G   9.2G  62.9G 37.6G 24.9G    11    11 C
   ganeti1002.eqiad.wmnet 707.4G  69.2G  62.9G 40.4G 21.5G    10    11 C
   ganeti1003.eqiad.wmnet 707.4G 244.6G  62.9G 39.4G 21.2G    11    11 C
   ganeti1004.eqiad.wmnet 707.4G 285.1G  62.9G 55.8G  6.8G    10     8 C
   ganeti1005.eqiad.wmnet   2.1T 225.8G  62.8G 16.8G 46.0G     6    18 A
   ganeti1006.eqiad.wmnet   2.1T 806.7G  62.8G 55.0G  7.5G    13     5 A
   ganeti1007.eqiad.wmnet   2.1T   1.1T  62.8G 44.2G 16.6G    12    12 A
   ganeti1008.eqiad.wmnet   2.1T 834.4G  62.8G 47.8G 11.3G    13     7 A

First thing, check netbox to get an idea about what Ganeti nodes are in the various rows. Then check the DFree (disk free) and MFree (memory free) columns of the Ganeti nodes in the above table. In theory there should be sufficient disk/memory space on all nodes in the row that you are planning to use, otherwise you might get failures when creating the VM.

If you want to confirm the reason why creating a VM fails is really due to lack of resources you can look into the ganeti command.log. For example in eqiad you would check the current master node ganeti1003:

[ganeti1003:~] $ sudo grep failure /var/log/ganeti/commands.log
OpPrereqError: ("Can't compute nodes using iallocator 'hail': Request failed: Group C (preferred): No valid allocation solutions, failure reasons: FailMem: 9, FailDisk: 3", 'insufficient_resources')

Assign a hostname

You need to select a hostname based on our Infrastructure_naming_conventions. You should also add Cumin aliases to cover selecting the new hostnames. See Cumin#Global_grammar_host_selection and edit modules/profile/templates/cumin/aliases.yaml.erb in operations/puppet. If you are making a new 'cluster', you'll also need to add a $cluster_$dc definitions to hieradata/common/monitoring.yaml.

The IP allocation and the generation of DNS records are handled automatically by Netbox.

Create the VM

Ganeti VMs are created with the sre.ganeti.makevm cookbook. You can use it from any Cumin host like the following:

$ sudo cookbook sre.ganeti.makevm -h
usage: cookbook [-h] [--skip-v6] [--vcpus VCPUS] [--memory MEMORY] [--disk DISK] [--network {public,private,analytics}] [--cluster CLUSTER]
                [--group GROUP]
                hostname

Create a new Virtual Machine in Ganeti

    * Pre-allocate the primary IPs and set their DNS name
    * Update the DNS records
    * Create the VM on Ganeti
    * Force a sync of Ganeti VMs to Netbox in the same DC
    * Update Netbox attaching the pre-allocated IPs to the host's primary interface

    Examples:
        Create a Ganeti VM vmname.codfw.wmnet in the codfw Ganeti cluster
        on group B with 1 vCPUs, 3GB of RAM, 100GB of disk in the private network:

            makevm --vcpus 1 --memory 3 --disk 100 --cluster codfw --group B vmhostname



positional arguments:
  hostname              The hostname for the VM (not the FQDN).

optional arguments:
  -h, --help            show this help message and exit
  --skip-v6             To skip the generation of the IPv6 DNS record. (default: False)
  --vcpus VCPUS         The number of virtual CPUs to assign to the VM. (default: 1)
  --memory MEMORY       The amount of RAM to allocate to the VM in GB. (default: 1)
  --disk DISK           The amount of disk to allocate to the VM in GB. (default: 10)
  --network {public,private,analytics}
                        Specify the type of network to assign to the VM. (default: private)
  --cluster CLUSTER     The Ganeti cluster short name, as reported in Netbox as a Cluster Group:
                        https://netbox.wikimedia.org/virtualization/cluster-groups/ (default: None)
  --group GROUP         The Ganeti group name, as reported in Netbox as a Cluster for the given group:
                        https://netbox.wikimedia.org/virtualization/clusters/ (default: None)

Please note that creating a new instance takes several minutes, most of the times due to the provisioning of the disk space. The makevm cookbook returns the output of the Ganeti command periodically, in order to get a regular feedback about the status of the VM creation.

Update the DHCP config with the MAC of the new VM

The makevm cookbook will print the MAC address of the virtual NIC of the freshly created instance at the end. Next you need to add a DHCP entry with that MAC by editing the modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 file.

After merging the patch, you can force a puppet run on the installation servers to speed up the installation process;

 sudo cumin A:installserver 'run-puppet-agent'

How to query the MAC address (if you missed the cookbook output)

If you missed the MAC address when running the cookbook, you can query it after the machine has been created like so:

razzi@ganeti1009:~$ sudo gnt-instance show datahubsearch1001.eqiad.wmnet | grep -A 2 NIC
  NICs:
    - nic/0:
      MAC: aa:00:00:35:3d:f1

Update partman config (if needed)

The netboot.cfg files configures the partition setup of a server. For all Ganeti VMs that's consistently flat.cfg virtual.cfg. When adding a new instance you need to verify whether a partman config is assigned. Most server types use a globbing scheme like idp* and if you're adding a further IDP host you can simply skip this step. If you however add a server with a new standalone name, then you need to merge a change puppet for modules/install_server/files/autoinstall/netboot.cfg.

After that you need to run Puppet on the apt* servers:

sudo cumin apt* 'run-puppet-agent'

Start the VM

sudo gnt-instance start <fqdn>

and connect to the console

sudo gnt-instance console <fqdn>

Ctrl+] to leave the console

Set boot order to disk

Assuming the installation goes on well but before it finishes, you need to set the boot order back to disk. This is a limitation of the current version of the Ganeti software and is expected to be solved (upstream is aware[link?]}).

  gnt-instance modify --hypervisor-parameters=boot_order=disk <fqdn>

Note: when the VM has finished installing, it will shutdown automatically. The Ganeti software includes HA checks and will promptly restart it. We rely on this behaviour to have the VM successfully installed. However, if you list the VMs during this phase you will see the VM in ERROR_down. Don't worry, this is expected.

Note 2: If you need to set a boot_order back to PXE to reinstall, it's "boot_order=network" (For KVM the boot order is either "floppy", "cdrom", "disk" or "network".)

Sign Puppet certs

For baremetal hosts, this part is handled by the reimage script (wmf-auto-reimage). As of 2020, VMs are not supported, so we need to do it as in the manual installation instructions for Ganeti VMs.

The process is as follows:

On a puppetmaster, run the following command (which will enable you obtain/access root shell of the targeted VM, identified by FQDN provided)

install_console FQDN

You will be dropped into a root shell belonging to your newly created VM instance. On the VM root shell, initiate a puppet run by executing the following:

puppet agent -tv

The first Puppet run triggers a request for the certificate, it should show up when running "sudo puppet cert -l" on puppetmaster1001. Next approve the certificate with

sudo puppet cert -s FQDN

Once the first puppet run has completed, the interface_automation/ImportPuppetDB Netbox script will be automatically run by the netbox_ganeti_$DC_sync timers that run on the Netbox hosts to to synchronize the VM interfaces to Netbox.

Assign role to the VM in puppet

As usual. If the VM will not be installed immediately turn it off (see shutdown below). Running VMs that are not in Puppet will alarm.

Delete a VM

The preferred method is to run the decommissioning cookbook Decom_script that will take care of everything, including deleting the VM itself. See also the actions performed by the sre.hosts.decommission cookbook.

To instead irrevocably delete manually a VM run:

  gnt-instance remove <fqdn>

In all cases please remember to clean up Puppet/DHCP/DNS entries afterwards.

Shutdown/startup a VM

  gnt-instance startup <fqdn>
  gnt-instance shutdown <fqdn>

Note: In the shutdown command, ACPI will be used to achieve a graceful shutdown of the VM. A 2 minute timeout exists however, after which the VM will be forcefully shutdown. In case you prefer to not wait those 2 minutes, --timeout exists and can be used like so

  gnt-instance shutdown --timeout 0 <fqdn>

Get a console for a VM

You can get log into the "console" for a Ganeti instance via

  gnt-instance console <fqdn>

The console can be left with "ctrl + ]"

If the above fails then it is possible to connect to the vnc port directly. first you need to get the vnc port to connect to

$ sudo gnt-instance info netbox2002.codfw.wmnet | grep 'console connection'            
  console connection: vnc to ganeti2030.codfw.wmnet:11136 (display 5236)

We then need to set up port forwarding to the vnc port over ssh

$ ssh -11136:localhost:11136 ganeti2030.codfw.wmnet

finally start your vnc client

$ remote-viewer vnc://localhost:11136

Resize a VM

Increase/Decrease CPU/RAM

Make sure first that the cluster has adequate space for whatever resource you want to increase (if you do want to increase and not decrease a resource). This is done manually by a combination of grafana statistics for CPU/Memory utilization and the output of gnt-node list for disk space utilization. After that you can issue the following command to increase/decrease the memory size and number of Virtual CPUs assigned to a VM

 gnt-instance modify -B memory=<X>[gm],vcpus=<N> <fqdn>

where X, N are physical numbers. X can be suffixed by g or m for Gigabytes or Megabytes (please don't do Terabytes ;))

Adding disk space

Adding space to an existing disk is possible. But it is not advisable. Only do so sparingly and when you know it's the best course of action. A good example is if you know that the VM will be reimaged after the process. The reason is that growing a disk with a partition table in it means that the partition table will need to be updated as well (and after that the filesystem as well). This is a non automated process, error prone and requires at least 2 reboots (one for the VM to see the new disk size, and one for the VM to safely use the new partition table - that latter one can be avoided manually, but it's usually cheaper to just reboot the VM).

But do note that the resizing of partitions and filesystems is up to you, as ganeti can't do it for you. The command would be:

 gnt-instance grow-disk <fqdn> <#> X[gmt]

where # is the index of the disk (it's 0 indexed). You can get the disks allocated to a VM using gnt-instance info <fqdn>. Again X is a physical number suffixed for Gigabytes/Megabytes/Terabytes.

Adding a disk

Adding a disk is also easy if you want to avoid the mess that will result from having to resize partitions/filesystems. Thus it is recommended over resizing a disk. The command would be:

 gnt-instance modify --disk add:size=X[gmt] <fqdn>

Again X is a physical number suffixed for Gigabytes/Megabytes/Terabytes.


WARNING!! After adding a device you have to reboot the VM (from ganeti level, not just reboot within the VM) and it might happen that your device names change and you get no networking.

If the VM does not come back from reboot, login via console and root password from pwstore and replace the device name with the "next" one in /etc/network/interfaces. For example replace all occurrences of "ens5" with "ens6" and reboot the machine one more time. also see phab:T272555. If you want to be sure about the "next" device name run ifconfig -a to double check.

After the disk has been created and the VM rebooted you need to create a partition table on it (fdisk), create a file system (mkfs.ext4) and mount it. Finally don't forget to add that to /etc/fstab to make sure it survives a reboot.

Hint: if you know beforehand that a VM will need an extra disk, add it before the install phase. In that case you only need to add the disk, and no other step.

Removing a disk

In case a VM no longer requires a disk, it can be removed. The command would be:

 gnt-instance modify --disk <#>:remove <fqdn>

where # is the index of the disk (it's 0 indexed). You can get the disks allocated to a VM using gnt-instance info <fqdn>

Reinstall / Reimage a VM

Just like a physical server OS reinstall this will destroy the contents of the machine and requires appropriate netboot configs to be in place. Proceed with caution!

Also note the wmf-auto-reimage script does not work for virtual machines, so a manual procedure has to be done, as explained below.

$ FQDN=<fqdn>

Shutdown the VM

$ sudo gnt-instance shutdown $FQDN

Set boot device to network

$ sudo gnt-instance modify --hypervisor-parameters=boot_order=network $FQDN

Start instance and attach to console (ctrl-] to detach)

$ sudo gnt-instance start $FQDN && sudo gnt-instance console $FQDN

After the OS install has finished (or while it is successfully under way from a separate terminal) set the boot device back to disk

$ sudo gnt-instance modify --hypervisor-parameters=boot_order=disk $FQDN

Finally, after the install finishes, boot the system into the fresh OS install and attach to the console (ctrl-] to detach)

$ sudo gnt-instance start $FQDN && sudo gnt-instance console $FQDN

After you get the server login you need to proceed with the manual installation procedure.

Renumber (aka change network) a VM

Say you want to change the networking configuration of a VM. It's usually better to create a new VM in the correct row instead. However there are some benefits to renumbering a VM like avoiding to have to re instantiate some database or some other lengthy to setup process. Some other indicative reasons for that would be:

  • Spreading a cluster of VMs offering a service across more network rows.
  • Rebalancing VMs across rows to free up capacity in a problematic row.

Steps are:

  1. Make sure the VM isn't powering some service (e.g. depool from pybal, if the VM is in a cluster, remove from the rest of the cluster's configuration)
  2. Choose a new row in the DC. Same rules are for creating a VM apply.
  3. Schedule downtime.
  4. Update DNS. Make sure you don't forget updating IPv6 :-)
  5. Update the networking configuration in /etc/network/interfaces. This is an error prone process with no review. Make sure you get it right otherwise you are in for some painful debugging
  6. Update the /etc/hosts referencing the host with the new IPv4 address
  7. If possible, power off the VM.
  8. Run the following command
sudo gnt-instance change-group --to=<row> <instance>

Wait it out.

Power on or reboot the VM


Change disk template for a VM (aka drop DBRD)

In some cases (very few), it makes sense that the disk of a VM is not replicated to another node. That will only be true for VMs where replication is treated internally by the application hosted in the VM. That could be things like a replicated Redis or a replicated MySQL or a distributed datastore like cassandra, etcd, zookeeper.

In some of the above cases, DRBD will not only be a waste but also cause issues. The one we 've seen is etcd, where the IO latency added by DRBD causes etcd to periodically complain about lag and syncing issues. This is usually seen via icinga alerts. The fix for this is to switch from DRBD to the plain template. The following commands will do that

Stop first the VM as the disk template change can't happen with the VM up

 $ sudo gnt-instance shutdown foobar.eqiad.wmnet

With the VM stopped, change the disk template

 $ sudo gnt-instance modify -t plain foobar.eqiad.wmnet
 Tue Feb  9 13:53:05 2021 Converting disk template from 'drbd' to 'plain'
 Tue Feb  9 13:53:05 2021 Removing volumes on the secondary node...
 Tue Feb  9 13:53:06 2021 Removing unneeded volumes on the primary node...
 Modified instance foobar.eqiad.wmnet
  - disk_template -> plain
 Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti; 
 restarting from within the instance will not be enough.

Start now the VM again

 $ sudo gnt-instance startup foobar.eqiad.wmnet

Memory pressure

We have memory pressure icinga checks. Ganeti nodes are heavily memory bound and rather inflexible (that is, memory does not change a lot over time, once a VM is started it obtains that memory and uses it. We don't do ballooning, there is little gain in that currently). When a memory pressure icinga check happens, the best bet is to just rebalance the cluster or at least the nodegroup (representing a rack row currently). See Ganeti#Cluster rebalancing

Notes

All of the commands that have a Y/N prompt can be forced with a -f. For example the following will spare you the prompt

  gnt-instance remove -f <fqdn>

All commands are actually jobs. If you would rather not wait on the prompt --submit will do the trick

  gnt-instance shutdown --submit <fqdn>

API