You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "User:Jhedden/CephTesting"

From Wikitech
Jump to navigation Jump to search
imported>Jhedden
imported>Jhedden
Line 323: Line 323:
 
=== OpenStack ===
 
=== OpenStack ===
 
IO rate limiting can also be managed using a flavor's metadata. This will trigger libvirt to apply `iotune` limits on the ephemeral disk.
 
IO rate limiting can also be managed using a flavor's metadata. This will trigger libvirt to apply `iotune` limits on the ephemeral disk.
 +
 +
;Available disk tuning options:
 +
* disk_read_bytes_sec
 +
* disk_read_iops_sec
 +
* disk_write_bytes_sec
 +
* disk_write_iops_sec
 +
* disk_total_bytes_sec
 +
* disk_total_iops_sec
  
 
'''''NOTE''': Updating a flavors metadata does not have any effect on existing virtual machines.''
 
'''''NOTE''': Updating a flavors metadata does not have any effect on existing virtual machines.''
  
Example commands to create or modify flavors metadata with rate limiting options:
+
Example commands to create or modify flavors metadata with rate limiting options roughly equal to a 7200RPM SATA Disk:
  
 
  openstack flavor create \
 
  openstack flavor create \
Line 359: Line 367:
 
* [https://grafana.wikimedia.org/d/-xyV8KCiz/cloudvps-ceph-pool-details Ceph Pool details]
 
* [https://grafana.wikimedia.org/d/-xyV8KCiz/cloudvps-ceph-pool-details Ceph Pool details]
 
* [https://grafana.wikimedia.org/d/z99hzWtmk/cloudvps-ceph-pools-overview Ceph Pool overview]
 
* [https://grafana.wikimedia.org/d/z99hzWtmk/cloudvps-ceph-pools-overview Ceph Pool overview]
 +
 +
=== Icinga alerts ===
 +
 +
==== Ceph Cluster Health ====
 +
*; Description : Ceph storage cluster health check
 +
 +
*; Status Codes
 +
*:* 0 - healthy, all services are healthy
 +
*:* 1 - warn, cluster is running in a degraded state, data is still accessible
 +
*:* 2 - critical, cluster is failed, some or all data is inaccessible
 +
 +
*; Next steps
 +
*:* On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command <code>sudo ceph --status</code>. Example output from a healthy cluster:
 +
cloudcephmon1001:~$ sudo ceph --status
 +
  cluster:
 +
    id:    5917e6d9-06a0-4928-827a-f489384975b1
 +
    health: HEALTH_OK
 +
 +
  services:
 +
    mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 3w)
 +
    mgr: cloudcephmon1002(active, since 10d), standbys: cloudcephmon1003, cloudcephmon1001
 +
    osd: 24 osds: 24 up (since 3w), 24 in (since 3w)
 +
 +
  data:
 +
    pools:  1 pools, 256 pgs
 +
    objects: 3 objects, 19 B
 +
    usage:  25 GiB used, 42 TiB / 42 TiB avail
 +
    pgs:    256 active+clean
 +
 +
*; References
 +
*:* https://docs.ceph.com/docs/master/rados/operations/monitoring/#monitoring-health-checks
 +
 +
----
 +
 +
==== Ceph Monitor Quorum ====
 +
*; Description : Verify there are enough Ceph monitor daemons running for proper quorum
 +
 +
*; Status Codes
 +
*:* 0 - healthy, 3 or more Ceph Monitors are running
 +
*:* 2 - critical, Less than 3 Ceph monitors are running
 +
 +
*; Next steps
 +
*:* On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command <code>sudo ceph mon stat</code>. Example output from a healthy cluster:
 +
cloudcephmon1001:~$ sudo ceph mon stat
 +
e1: 3 mons at {cloudcephmon1001=[v2:208.80.154.148:3300/0,v1:208.80.154.148:6789/0],cloudcephmon1002=[v2:208.80.154.149:3300/0,v1:208.80.154.149:6789/0],cloudcephmon1003=[v2:208.80.154.150:3300/0,v1:208.80.154.150:6789/0]}, election epoch 24, leader 0 cloudcephmon1001, quorum 0,1,2 cloudcephmon1001,cloudcephmon1002,cloudcephmon1003
 +
 +
*; References
 +
*:* https://docs.ceph.com/docs/master/rados/operations/monitoring/#checking-monitor-status
 +
 +
----
  
 
== Performance Testing ==
 
== Performance Testing ==

Revision as of 22:52, 14 January 2020

CloudVPS use cases

Phase 1

Cloudvps-ceph-phase1-2.png

Block Storage

CloudVPS hypervisors using libvirtd and QEMU can attach to Ceph block devices using librbd (user space implementation of the Ceph block device).

Utilizing Ceph block devices will allow for fast virtual machine live migrations, persistent volume attachments through Cinder and copy-on-write snapshot capabilities.

Important notes: Ceph doesn’t support QCOW2 for hosting a virtual machine disk. Thus if you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), the Glance image format must be RAW.

Future

NFS replacement

A potential solution could be CephFS mounted directly on the clients, or CephFS with Ganesha providing NFS services to clients

Bare-metal host configuration

BIOS

Base settings

  • Ensure "Boot mode" is set to "BIOS" and not "UEFI". (This is required for the netboot process)

PXE boot settings

All Ceph hosts are equipped with a 10Gb Broadcom BCM57412 NIC and are not using the embedded onboard NIC.

  1. During the system boot, when prompted to configure the second Broadcom BCM57412 NIC device press "ctrl + s"
  2. On the main menu select "MBA Configuration" and toggle the "Boot Protocol" setting to "Preboot Execution Environment (PXE)"
  3. Press escape, then select "Exit and Save Configurations"
  4. After the system reboots, press "F2" to enter "System Setup"
  5. Navigate to "System BIOS > Boot Settings > BIOS Boot Settings"
  6. Select "Boot Sequence" and change the boot order to: "Hard dive C:", "NIC in Slot 2 Port 1..", "Embedded NIC 1 Port 1..."
  7. Exit System Setup, saving your changes and rebooting the system

Alternatively steps 4 through 7 can be replaced with racadm, but you will still need to enable the PXE boot protocol in the option ROM.

/admin1-> racadm set BIOS.BiosBootSettings.bootseq HardDisk.List.1-1,NIC.Slot.2-1-1,NIC.Embedded.1-1-1
/admin1-> racadm jobqueue create BIOS.Setup.1-1
/admin1-> racadm serveraction hardreset

Hardware RAID configuration (OSD hosts only)

The OSD hosts are configured with 10 SSD drives. Using the 2 240GB SSD drives create a single virtual disk protected by RAID1. This virtual disk will be used for the operating system mount points: "/", "/boot" and "/srv". The remaining 1.9TB SSD drives will be standalone physical devices configured by Rook and used for Ceph data.

Local storage

CloudVPS-ceph-disk-layout-2.png

Operating System

All Ceph hosts will be using Debian 10 (Buster). There are some packaging concerns as the upstream deb packages are a few versions behind.

Test environment installation notes

Rook

Hieradata

profile::ceph::docker::settings:
 log-driver: json-file
profile::ceph::docker::version: '5:19.03.0~3-0~debian-stretch'
profile::ceph::etcd::bootstrap: true
profile::ceph::k8s::apiserver: 'jeh-cephmon01.testlabs.eqiad.wmflabs'
profile::ceph::k8s::node_token: '<MASKED>.<MASKED>'
profile::ceph::k8s::pause_image: 'docker-registry.tools.wmflabs.org/pause:3.1'
profile::ceph::k8s::pod_subnet: '192.168.0.0/16'
profile::ceph::k8s::version: '1.15.5'
profile::ceph::k8s::pkg_release: '00'
profile::ceph::mon_hosts:
- jeh-cephmon01.testlabs.eqiad.wmflabs
- jeh-cephmon02.testlabs.eqiad.wmflabs
- jeh-cephmon03.testlabs.eqiad.wmflabs
puppetmaster: jeh-puppetmaster.testlabs.eqiad.wmflabs

Puppet Roles

role::wmcs::ceph::mon
role::wmcs::ceph::osd

ETCD

After applying the roles with puppet check etcd health

etcdctl --endpoints https://$(hostname -f):2379 --key-file /var/lib/puppet/ssl/private_keys/$(hostname -f).pem --cert-file /var/lib/puppet/ssl/certs/$(hostname -f).pem cluster-health
member 559a8dc863a539a is healthy: got healthy result from https://jeh-cephmon02.testlabs.eqiad.wmflabs:2379
member 60c132b2786c6c2 is healthy: got healthy result from https://jeh-cephmon01.testlabs.eqiad.wmflabs:2379
member d01fd114808eb37b is healthy: got healthy result from https://jeh-cephmon03.testlabs.eqiad.wmflabs:2379
cluster is healthy

remove the ` profile::ceph::etcd::bootstrap: true` hiera key and re-run puppet

REBOOT to clear up iptables rules

Kubeadmin

Initialize kubernetes and configure kubectl

kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Apply the base pod security profiles and calico manifests

kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
kubectl apply -f /etc/kubernetes/calico.yaml

Add the other cephmon0[2-3] nodes to the control plane

kubeadm join jeh-cephmon01.testlabs.eqiad.wmflabs:6443 --token <TOKEN> \
   --discovery-token-ca-cert-hash <CERT HASH> \
   --control-plane --certificate-key <CERT KEY>

Add all the cephosd0[1-3] nodes as workers

cephosd01: ~# kubeadm join jeh-cephmon01.testlabs.eqiad.wmflabs:6443 --token <TOKEN> \
   --discovery-token-ca-cert-hash <CERT HASH>

Untaint the cephmon0[1-3] nodes to allow pod workloads

kubectl taint nodes jeh-cephmon01 node-role.kubernetes.io/master-
kubectl taint nodes jeh-cephmon02 node-role.kubernetes.io/master-
kubectl taint nodes jeh-cephmon03 node-role.kubernetes.io/master-

Rook operator

Label nodes with rook roles

kubectl label nodes jeh-cephmon01 role=storage-mon
kubectl label nodes jeh-cephosd01 role=storage-osd

Apply the rook manifests

kubectl create -f /etc/rook/common.yaml
kubectl create -f /etc/rook/operator.yaml
kubectl create -f /etc/rook/cluster.yaml
kubectl create -f /etc/rook/toolbox.yaml

View the operator logs

kubectl -n rook-ceph logs -l "app=rook-ceph-operator"

Connect to the toolbox container and run ceph commands

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

Ceph-Deploy

Ceph deploy requires password-less SSH authentication between the storage cluster nodes. Based on this requirement the ceph-deploy utility was not evaluated. https://docs.ceph.com/docs/master/start/quick-start-preflight/#ceph-deploy-setup

Debian Packages

Prebuilt debs

Debian does not have buster packages for ceph available. More details at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907123

The full list of pre-built official upstream packages are available at https://download.ceph.com

Unofficial packages are available at https://croit.io/2019/07/07/2019-07-07-debian-mirror https://github.com/croit/ceph. These packages may include patches and/or enhancements made by Croit.

Building debs

Ceph includes support for building Debian Buster packages. If we plan to go this route, these packages will be uploaded to our apt repository. Packages can be built using the following process:

$ git clone https://github.com/ceph/ceph.git /srv/ceph
$ cd /srv/ceph
$ git checkout tags/v14.2.4 -b nautilus_latest
$ ./install-deps.sh
$ ./make-deb.sh

By default the Ceph packages will be located in `/tmp/release/Debian/WORKDIR`. NOTE the build process requires ~150GB of free space in /tmp.

List of packages used for testing

Dependencies

apt install \
  binutils \
  cryptsetup \
  cryptsetup-bin \
  libgoogle-perftools4 \
  libibverbs1 \
  libleveldb1d \
  liblttng-ust0 \
  liboath0 \
  librabbitmq4 \
  librdmacm1 \
  python-bcrypt \
  python-cherrypy3 \
  python-pecan \
  python-werkzeug

Ceph libraries

dpkg -i \
  libcephfs2_14.2.4-1_amd64.deb \
  librbd1_14.2.4-1_amd64.deb \
  librgw2_14.2.4-1_amd64.deb \
  librados2_14.2.4-1_amd64.deb \
  libradosstriper1_14.2.4-1_amd64.deb

Ceph common and python modules

dpkg -i \
  ceph-common_14.2.4-1_amd64.deb \
  python-ceph-argparse_14.2.4-1_all.deb \
  python-cephfs_14.2.4-1_amd64.deb \
  python-rbd_14.2.4-1_amd64.deb \
  python-rgw_14.2.4-1_amd64.deb \
  python-rados_14.2.4-1_amd64.deb

Ceph Base and Services

dpkg -i \
  ceph-base_14.2.4-1_amd64.deb \
  ceph-mgr_14.2.4-1_amd64.deb \
  ceph-mon_14.2.4-1_amd64.deb \
  ceph-osd_14.2.4-1_amd64.deb


Puppet installation

Puppet modules have been built based on the manual installation procedures defined at https://docs.ceph.com/docs/master/install/

Puppet

Roles

  • wmcs::ceph::mon Deploys the Ceph monitor and manager daemon to support CloudVPS hypervisors
  • wmcs::ceph::osd Deploys the Ceph object storage daemon to support CloudVPS hypervisors
  • role::wmcs::openstack::eqiad1::virt_ceph Deploys nova-compute configured with RBD based virtual machines

Profiles

  • profile::ceph::client::rbd Install and configure a Ceph RBD client
  • profile::ceph::osd Install and configure Ceph object storage daemon
  • profile::ceph::mon Install and configure Ceph monitor and manager daemon

Modules

  • ceph Install and configure the base Ceph installation used by all services and clients
  • ceph::admin Configures the Ceph administrator keyring
  • ceph::mgr Install and configure the Ceph manager daemon
  • ceph::mon Install and configure the Ceph monitor daemon
  • ceph::keyring Defined resource that manages access control and keyrings

Hieradata

Initial Ceph configuration

# Ceph configuration for testing RADOS block devices in CloudVPS
# using filestore backend on virtual machines
[global]
   auth cluster required = cephx
   auth service required = cephx
   auth client required = cephx
   fsid = 078d40c4-2f87-4f9b-9e61-1da3053bc925
   mon initial members = cloudcephmon1001,cloudcephmon1002,cloudcephmon1003
[mon.cloudcephmon1001]
   host = cloudcephmon1001
   mon addr = 208.80.154.148
[mon.cloudcephmon1002]
   host = cloudcephmon1002
   mon addr = 208.80.154.149
[mon.cloudcephmon1003]
   host = cloudcephmon1003
   mon addr = 208.80.154.150
[client]
   rbd cache = true
   rbd cache writethrough until flush = true
   admin socket = /var/run/ceph/guests/$cluster-$type.$id.$pid.$cctid.asok
   log file = /var/log/ceph/qemu/qemu-guest-$pid.log
   rbd concurrent management ops = 20

Post puppet procedures

Adding OSDs

Locate available disks with lsblk

cloudcephosd1001:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                                                     8:0    0 223.6G  0 disk
├─sda1                                                                                                  8:1    0  46.6G  0 part
│ └─md0                                                                                                 9:0    0  46.5G  0 raid1 /
├─sda2                                                                                                  8:2    0   954M  0 part
│ └─md1                                                                                                 9:1    0   953M  0 raid1 [SWAP]
└─sda3                                                                                                  8:3    0 176.1G  0 part
  └─md2                                                                                                 9:2    0   176G  0 raid1
    └─cloudcephosd1001--vg-data                                                                       253:2    0 140.8G  0 lvm   /srv
sdb                                                                                                     8:16   0 223.6G  0 disk
├─sdb1                                                                                                  8:17   0  46.6G  0 part
│ └─md0                                                                                                 9:0    0  46.5G  0 raid1 /
├─sdb2                                                                                                  8:18   0   954M  0 part
│ └─md1                                                                                                 9:1    0   953M  0 raid1 [SWAP]
└─sdb3                                                                                                  8:19   0 176.1G  0 part
  └─md2                                                                                                 9:2    0   176G  0 raid1
    └─cloudcephosd1001--vg-data                                                                       253:2    0 140.8G  0 lvm   /srv
sdc                                                                                                     8:80   0   1.8T  0 disk
sdd                                                                                                     8:96   0   1.8T  0 disk
sde                                                                                                     8:80   0   1.8T  0 disk
sdf                                                                                                     8:80   0   1.8T  0 disk
sdg                                                                                                     8:96   0   1.8T  0 disk
sdh                                                                                                     8:112  0   1.8T  0 disk
sdi                                                                                                     8:128  0   1.8T  0 disk
sdj                                                                                                     8:144  0   1.8T  0 disk


To prepare a disk for Ceph first zap the disk

cloudcephosd1001:~# ceph-volume lvm zap /dev/sdc
--> Zapping: /dev/sdc
--> --destroy was not specified, but zapping a whole device will remove the partition table
Running command: /bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10
 stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.00357845 s, 2.9 GB/s
--> Zapping successful for: <Raw Device: /dev/sdc>

Then prepare, activate and start the new OSD

cloudcephosd1001:~# ceph-volume lvm create --bluestore --data /dev/sde

Creating Pools

To create a new storage pool you will first need to determine the number of placement groups that will be assigned to the new pool. You can use the calculator at https://ceph.io/pgcalc/ to help identify the starting point (not you can easily increase, but not decrease this value):

sudo ceph osd pool create eqiad1-compute 512

Enable the RBD application for the new pool

sudo ceph osd pool application enable eqiad1-compute rbd

Rate limiting

Native

Native RBD rate limiting is supported in the Ceph Nautilus release. Due to upstream availability and multiple Debian releases we will likely have a mixture of older Ceph client versions during phase1.

Available rate limiting options and their defaults in the Nautilus release:

$ rbd config pool ls <pool> | grep qos
rbd_qos_bps_burst                                   0         config
rbd_qos_bps_limit                                   0         config
rbd_qos_iops_burst                                  0         config
rbd_qos_iops_limit                                  0         config
rbd_qos_read_bps_burst                              0         config
rbd_qos_read_bps_limit                              0         config
rbd_qos_read_iops_burst                             0         config
rbd_qos_read_iops_limit                             0         config
rbd_qos_schedule_tick_min                           50        config
rbd_qos_write_bps_burst                             0         config
rbd_qos_write_bps_limit                             0         config
rbd_qos_write_iops_burst                            0         config
rbd_qos_write_iops_limit                            0         config

OpenStack

IO rate limiting can also be managed using a flavor's metadata. This will trigger libvirt to apply `iotune` limits on the ephemeral disk.

Available disk tuning options
  • disk_read_bytes_sec
  • disk_read_iops_sec
  • disk_write_bytes_sec
  • disk_write_iops_sec
  • disk_total_bytes_sec
  • disk_total_iops_sec

NOTE: Updating a flavors metadata does not have any effect on existing virtual machines.

Example commands to create or modify flavors metadata with rate limiting options roughly equal to a 7200RPM SATA Disk:

openstack flavor create \
  --ram 2048 \
  --disk 20 \
  --vcpus 1 \
  --private \
  --project testlabs \
  --id 857921a5-f0af-4069-8ad1-8f5ea86c8ba2 \
  --property quota:disk_total_iops_sec=100 m1.small-ceph
openstack flavor set --property quota:disk_total_bytes_sec=$((100<<20)) 857921a5-f0af-4069-8ad1-8f5ea86c8ba2

Example rate limit configuration as seen by libvirt. (virsh dumpxml <instance name>)

<target dev='vda' bus='virtio'/>
 <iotune>
  <total_bytes_sec>104857600</total_bytes_sec>
  <total_iops_sec>100</total_iops_sec>
 </iotune>

Monitoring

Dashboards

The Grafana dashboards provided by the Ceph community have been installed and updated for our environment.

Upstream source: https://github.com/ceph/ceph/tree/master/monitoring/grafana/dashboards

Icinga alerts

Ceph Cluster Health

  • Description 
    Ceph storage cluster health check
  • Status Codes
    • 0 - healthy, all services are healthy
    • 1 - warn, cluster is running in a degraded state, data is still accessible
    • 2 - critical, cluster is failed, some or all data is inaccessible
  • Next steps
    • On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command sudo ceph --status. Example output from a healthy cluster:
cloudcephmon1001:~$ sudo ceph --status
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_OK

 services:
   mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 3w)
   mgr: cloudcephmon1002(active, since 10d), standbys: cloudcephmon1003, cloudcephmon1001
   osd: 24 osds: 24 up (since 3w), 24 in (since 3w)

 data:
   pools:   1 pools, 256 pgs
   objects: 3 objects, 19 B
   usage:   25 GiB used, 42 TiB / 42 TiB avail
   pgs:     256 active+clean

Ceph Monitor Quorum

  • Description 
    Verify there are enough Ceph monitor daemons running for proper quorum
  • Status Codes
    • 0 - healthy, 3 or more Ceph Monitors are running
    • 2 - critical, Less than 3 Ceph monitors are running
  • Next steps
    • On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command sudo ceph mon stat. Example output from a healthy cluster:
cloudcephmon1001:~$ sudo ceph mon stat
e1: 3 mons at {cloudcephmon1001=[v2:208.80.154.148:3300/0,v1:208.80.154.148:6789/0],cloudcephmon1002=[v2:208.80.154.149:3300/0,v1:208.80.154.149:6789/0],cloudcephmon1003=[v2:208.80.154.150:3300/0,v1:208.80.154.150:6789/0]}, election epoch 24, leader 0 cloudcephmon1001, quorum 0,1,2 cloudcephmon1001,cloudcephmon1002,cloudcephmon1003

Performance Testing

Network

Baseline (default tuning options)

Iperf options used to simulate Ceph storage IO.

-N disable Nagle's Algorithm
-l 4M set read/write buffer size to 4 megabyte
-P number of parallel client threads to run (one per OSD)

Server:

iperf -s -N -l 4M

Client:

iperf -c <server> -N -l 4M -P 8
cloudcephosd <-> cloudcephosd
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.74 GBytes  2.35 Gbits/sec
[ 10]  0.0-10.0 sec  2.74 GBytes  2.35 Gbits/sec
[  9]  0.0-10.0 sec   664 MBytes   557 Mbits/sec
[  6]  0.0-10.0 sec   720 MBytes   603 Mbits/sec
[  5]  0.0-10.0 sec  1.38 GBytes  1.18 Gbits/sec
[ 13]  0.0-10.0 sec  1.38 GBytes  1.18 Gbits/sec
[  7]  0.0-10.0 sec   720 MBytes   602 Mbits/sec
[  8]  0.0-10.0 sec   720 MBytes   603 Mbits/sec
[SUM]  0.0-10.0 sec  11.0 GBytes  9.42 Gbits/sec
cloudvirt1022 -> cloudcephosd
cloudvirt1022 <-> cloudcephosd: 8.55 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  1.11 GBytes   949 Mbits/sec
[  6]  0.0-10.0 sec  1.25 GBytes  1.07 Gbits/sec
[  4]  0.0-10.0 sec  1.39 GBytes  1.19 Gbits/sec
[  9]  0.0-10.0 sec  1.24 GBytes  1.06 Gbits/sec
[ 10]  0.0-10.0 sec  1.07 GBytes   920 Mbits/sec
[  5]  0.0-10.0 sec  1.36 GBytes  1.16 Gbits/sec
[  3]  0.0-10.0 sec  1.41 GBytes  1.21 Gbits/sec
[  8]  0.0-10.0 sec  1.17 GBytes  1.00 Gbits/sec
[SUM]  0.0-10.0 sec  10.0 GBytes  8.55 Gbits/sec

Ceph RBD

Test cases

FIO random read/write

$ fio --name fio-randrw \
      --bs=4k \
      --direct=1 \
      --filename=/srv/fio.randrw \
      --fsync=256 \
      --gtod_reduce=1 \
      --iodepth=64 \
      --ioengine=libaio \
      --randrepeat=1 \
      --readwrite=randrw \
      --rwmixread=50 \
      --size=5G \
      --group_reporting

FIO sequential read/write

$ fio --name=fio-seqrw \
      --bs=4k \
      --direct=1 \
      --filename=/srv/fio.seqrw \
      --fsync=256 \
      --gtod_reduce=1 \
      --iodepth=64 \
      --ioengine=libaio \
      --rw=rw \
      --size=5G \
      --group_reporting

Baseline (default tuning options)

single virtual machine
$ dd if=/dev/zero of=/srv/test.dd bs=4k count=125000 conv=sync
512000000 bytes (512 MB, 488 MiB) copied, 0.875202 s, 585 MB/s

FIO sequential read/write

fio-seqrw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
fio-seqrw: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=12.1MiB/s,w=11.9MiB/s][r=3092,w=3045 IOPS][eta 00m:00s]
fio-seqrw: (groupid=0, jobs=1): err= 0: pid=31970: Fri Jan 10 15:24:29 2020
 read: IOPS=3849, BW=15.0MiB/s (15.8MB/s)(2561MiB/170310msec)
  bw (  KiB/s): min= 7048, max=41668, per=100.00%, avg=15403.11, stdev=7014.38, samples=340
  iops        : min= 1762, max=10417, avg=3850.78, stdev=1753.59, samples=340
 write: IOPS=3846, BW=15.0MiB/s (15.8MB/s)(2559MiB/170310msec); 0 zone resets
  bw (  KiB/s): min= 6464, max=41365, per=100.00%, avg=15389.03, stdev=7006.27, samples=340
  iops        : min= 1616, max=10341, avg=3847.25, stdev=1751.56, samples=340
 cpu          : usr=3.43%, sys=11.35%, ctx=623109, majf=0, minf=9
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.4%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued rwts: total=655676,655044,0,5021 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  READ: bw=15.0MiB/s (15.8MB/s), 15.0MiB/s-15.0MiB/s (15.8MB/s-15.8MB/s), io=2561MiB (2686MB), run=170310-170310msec
 WRITE: bw=15.0MiB/s (15.8MB/s), 15.0MiB/s-15.0MiB/s (15.8MB/s-15.8MB/s), io=2559MiB (2683MB), run=170310-170310msec

Disk stats (read/write):
 vda: ios=656106/663558, merge=28/3888, ticks=3895800/3550224, in_queue=7399608, util=74.52%

Rate limiting enabled

single virtual machine
$ dd if=/dev/zero of=/srv/1test.dd bs=4k count=125000 conv=sync
512000000 bytes (512 MB, 488 MiB) copied, 4.57852 s, 112 MB/s

FIO sequential read/write

fio-seqrw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [M(1)][100.0%][r=228KiB/s,w=172KiB/s][r=57,w=43 IOPS][eta 00m:00s]
fio-seqrw: (groupid=0, jobs=1): err= 0: pid=30958: Fri Jan 10 19:10:12 2020
 read: IOPS=49, BW=198KiB/s (203kB/s)(2561MiB/13237587msec)
  bw (  KiB/s): min=    7, max=  584, per=100.00%, avg=201.54, stdev=48.09, samples=26014
  iops        : min=    1, max=  146, avg=50.33, stdev=12.02, samples=26014
 write: IOPS=49, BW=198KiB/s (203kB/s)(2559MiB/13237587msec); 0 zone resets
  bw (  KiB/s): min=    7, max=  696, per=100.00%, avg=201.30, stdev=56.83, samples=26023
  iops        : min=    1, max=  174, avg=50.27, stdev=14.21, samples=26023
 cpu          : usr=0.16%, sys=0.62%, ctx=1208453, majf=0, minf=10
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.4%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued rwts: total=655676,655044,0,5021 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  READ: bw=198KiB/s (203kB/s), 198KiB/s-198KiB/s (203kB/s-203kB/s), io=2561MiB (2686MB), run=13237587-13237587msec
 WRITE: bw=198KiB/s (203kB/s), 198KiB/s-198KiB/s (203kB/s-203kB/s), io=2559MiB (2683MB), run=13237587-13237587msec

Disk stats (read/write):
 vda: ios=659758/669762, merge=0/9391, ticks=517938687/296952049, in_queue=800011092, util=98.11%

CloudVPS Configuration Changes

Glance

TODO add more detail, need to store OS images in glance

Nova compute CPU mode

A virtual machine can only be live migrated to a hypervisor matching the same CPU. CloudVPS currently has multiple CPU models and is using the default "host-model" nova configuration.

To enable live migration between any production hypervisor, the cpu_mode parameter should match the lowest hypervisor CPU model.

Hypervisor range CPU model Launch date
cloudvirt[1023-1030].eqiad.wmnet Gold 6140 Skylake 2017
cloudvirt[1016-1022].eqiad.wmnet E5-2697 v4 Broadwell 2016
cloudvirt[1012-1014].eqiad.wmnet E5-2697 v3 Haswell 2014
cloudvirt[1001-1009].eqiad.wmnet E5-2697 v2 Ivy Bridge 2013

Virtual Machine Images

Important: Using QCOW2 for hosting a virtual machine disk is NOT recommended. If you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), please use the raw image format within Glance.

Once all CloudVPS virtual machines have been migrated to Ceph we can convert the existing virtual machine images in Glance from QCOW2 to raw. This will avoid having nova-compute convert the image each time a new virtual machine is created.

VirtIO SCSI devices

Currently CloudVPS virtual machines are configured with the virtio-blk driver. This driver does not support discard/trim operations to free up deleted blocks.

Discard support can be enabled by using the virtio-scsi driver, but it's important to note that the device labels will change from /dev/vda to /dev/sda.

Migrating VMs from local storage to Ceph

NOTE: This process is working but requires more testing and verification

Switch puppet roles to the Ceph enabled wmcs::openstack::eqiad1::virt_ceph role. In operations/puppet/manifest/site.pp:

node 'cloudvirt1022.eqiad.wmnet' {
   role(wmcs::openstack::eqiad1::virt_ceph)
}

Run the puppet agent on the hypervisor

hypervisor $ sudo puppet agent -tv

Shutdown the VM

cloudcontrol $ openstack server stop <UUID>

Convert the local QCOW2 image to raw and upload to Ceph

hypervisor $ qemu-img convert -f qcow2 -O raw /var/lib/nova/instances/<UUID>/disk rbd:compute/<UUID>_disk:id=eqiad1-compute

Undefine the virtual machine. This command removes the existing libvirt definition from the hypervisor, once nova attempts to start the VM it will be redefined with the RBD configuration.

hypervisor $ virsh undefine <OS-EXT-SRV-ATTR:instance_name>

Cleanup local storage files

hypervisor $ rm /var/lib/nova/instances/<UUID>/disk
hypervisor $ rm /var/lib/nova/instances/<UUID>/disk.info

Power on the VM

cloudcontrol $ openstack server start <UUID>

CLI examples

Create, format and mount a RBD image (useful for testing / debugging)

$ rbd create datatest --size 250 --pool compute --image-feature layering
$ rbd map datatest --pool compute --name client.admin
$ mkfs.ext4 -m0 /dev/rbd0
$ mount /dev/rbd0 /mnt/
$ umount /mnt
$ rbd unmap /dev/rbd0
$ rbd rm compute/datatest

List RBD nova images

$ rbd ls -p compute
9e2522ca-fd5e-4d42-b403-57afda7584c0_disk

Show RBD image information

$ rbd info -p compute 9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk
rbd image '9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk':
       size 20 GiB in 5120 objects
       order 22 (4 MiB objects)
       snapshot_count: 0
       id: aec56b8b4567
       block_name_prefix: rbd_data.aec56b8b4567
       format: 2
       features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
       op_features:
       flags:
       create_timestamp: Mon Jan  6 21:36:11 2020
       access_timestamp: Mon Jan  6 21:36:11 2020
       modify_timestamp: Mon Jan  6 21:36:11 2020

View RBD image with qemu tools on a hypervisor

$ qemu-img info rbd:<pool>/<vm uuid>_disk:id=<ceph user>

Community best practice notes

Resources