You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Netbox: Difference between revisions
imported>Volans m (→Netbox Extras: Use the CuminHosts template instead of hardcoding the hostnames) |
imported>Ayounsi |
||
(11 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
'''Netbox''' is a "IP address management (IPAM) and data center infrastructure management (DCIM) tool". | '''Netbox''' is a "IP address management (IPAM) and data center infrastructure management (DCIM) tool". | ||
* https://github.com/ | * https://github.com/netbox-community/netbox | ||
At Wikimedia it is used as the [[:en:Data_center_management#Data_center_infrastructure_management|DCIM]] and [[:en:IP_address_management|IPAM]] system, as well as being used as an integration point for switch and port management, DNS management, and similar operations. | |||
At Wikimedia it is used as the [[:en:Data_center_management#Data_center_infrastructure_management|DCIM]] and [[:en:IP_address_management|IPAM]] system, as well as being used as an integration point for switch and port management, DNS management and similar operations. | |||
== History == | == History == | ||
Line 10: | Line 9: | ||
* At Wikimedia it was evaluated in [[Phab:T170144]] as a replacement for [[Racktables]]. | * At Wikimedia it was evaluated in [[Phab:T170144]] as a replacement for [[Racktables]]. | ||
* In [[Phab:T199083]] the actual migration between the systems took place | * In [[Phab:T199083]] the actual migration between the systems took place | ||
* {{Phabricator/en|T266487}} - Netbox 2.9 upgrade | |||
* {{Phabricator/en|T288515}} - Netbox vs. Nautobot | |||
* {{Phabricator/en|T296452}} - documents the large upgrade from 2.10 to 3.2 and the subsequent improvements it brought | |||
== Web UI == | == Web UI == | ||
* https://netbox.wikimedia.org/ | * https://netbox.wikimedia.org/ | ||
* | * Login using your LDAP/Wikitech credentials | ||
* | * Currently you need to be in either the "ops" or "wmf" LDAP group to be able to login (hiera <code>profile::netbox::cas_group_attribute_mapping</code>) | ||
== API == | |||
* Endpoint | |||
** From an internal host (eg. everything BUT your laptop or WMCS), you MUST use netbox.discovery.wmnet | |||
* REST API | |||
** Create a token on https://netbox.wikimedia.org/user/api-tokens/, ideally read-only, ideally with an expiration date | |||
** Doc: https://netbox.wikimedia.org/api/docs/ | |||
** Python library: https://github.com/netbox-community/pynetbox/ | |||
** Note that the REST API is quite slow, make sure to optimize your queries | |||
* GraphQL | |||
** Soon | |||
* Spicerack | |||
** See https://doc.wikimedia.org/spicerack/master/api/spicerack.netbox.html for the Netbox support | |||
** It's preferred to use the built in wrapper functions and not the pynetbox interface directly as your cookbook might break if they're not updated when Netbox introduces breaking changes | |||
== | == Staging == | ||
* It consists of a single bullseye VM (netbox-devXXXX) combining frontend and database | |||
* Reachable on netbox-next.wikimedia.org and netbox-next.discovery.wmnet | |||
** Behind caches, similarly to the prod infrastructure | |||
* It's data comes from a manual dump of production's database | |||
** Reach out to Infrastructure Foundations if you need a more fresh database | |||
** Be careful not to leak any of its data | |||
* It is used to test Netbox upgrades, scripts, reports, etc | |||
* This host is [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox-dev active in monitoring] (with notifications disabled) | |||
** As such, make sure that all the alerts cleared after your tests | |||
== Production infrastructure == | |||
the production Netbox infrastructure consists of 4 bullseye VMs (see [https://netbox.wikimedia.org/virtualization/virtual-machines/?q=netbox all Netbox VMs]): | |||
* 2 active/passive frontends (netboxXXXX) | |||
** Running a local Redis instance - (maybe soon central with {{Phabricator/en|T311385}}) | |||
* 2 primary/replica postgresSQL databases (netboxdbXXXX) | |||
By default the active/primary servers are the eqiad ones. | |||
The public endpoint is behind our CDN so the request flow is: | |||
# [[Caching overview|CDN]] - (using the wildcard *.wikimedia.org as its TLS certificate) | |||
# active frontends | |||
## Apache (using [[PKI/Clients|cfssl]] for its TLS certificate) | |||
## Django app (through uwsgi) | |||
# Active database | |||
== Monitoring == | |||
=== Icinga === | |||
See all Netbox related Icinga checks: <nowiki>https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox</nowiki> | |||
In addition to the regular set of VM checks that are run on all servers, there are Icinga checks that only run on the active servers. | |||
Controlled by the <code>profile::netbox::db::primary</code> and <code>profile::netbox::active_server</code> Hiera keys. | |||
==== Frontends ==== | |||
Controlled by the <code>profile::netbox::active_server</code> Hiera key: | |||
* Alerting for the Ganeti sync systemd timers (are they running correctly?) - see also [[Netbox#Ganeti sync]] | |||
* Alerting for the Netbox reports (is there invalid data in Netbox?) - see also [[Netbox#Reports]] | |||
* Alerting for the DNS export automation (are there Uncommitted DNS changes in Netbox?) - See also [[Netbox#DNS]] | |||
==== Databases ==== | |||
The replica have a check for replication delay. | |||
=== Prometheus === | |||
Setup task: https://phabricator.wikimedia.org/T243928 | |||
Global health overview (beta): https://grafana.wikimedia.org/d/DvXT6LCnk/ | |||
== Failover == | |||
=== Frontends === | |||
Using confctl, pool the passive server and depool the previous active one. | |||
<syntaxhighlight lang="bash"> | |||
confctl --object-type discovery select 'dnsdisc=netbox,name=codfw' set/pooled=true | |||
confctl --object-type discovery select 'dnsdisc=netbox,name=eqiad' set/pooled=false | |||
</syntaxhighlight> | |||
If the failover is going to last (eg. longer than a server reboot), change the <code>profile::netbox::active_server</code> Hiera key to the backup server. This will ensure the cron/systemd timers as well as Icinga checks are running. | |||
Note that having the active frontend in a different datacenter than the primary database will result in Netbox being slower. | |||
=== Databases === | |||
If the primary database server needs a short downtime it's recommended to not try a failover and instead have Netbox offline for a short amount of time. | |||
There are currently no documented procedure on how to fail the database over, and even less how to fail back to the former primary. | |||
See also [[Postgres]] | |||
== Database == | |||
=== Restore === | === Restore === | ||
First of all analyze the [https://netbox.wikimedia.org/extras/changelog/ Netbox changelog] to choose what's the best action to perform a restore. | First of all analyze the [https://netbox.wikimedia.org/extras/changelog/ Netbox changelog] to choose what's the best action to perform a restore. | ||
Line 42: | Line 118: | ||
To restore files from Bacula back to the client, use bconsole on helium and refer to [[Bacula#Restore_(aka_Panic_mode)]] | To restore files from Bacula back to the client, use bconsole on helium and refer to [[Bacula#Restore_(aka_Panic_mode)]] | ||
for detailed steps. | for detailed steps. | ||
=== PostgresSQL === | |||
==== Dumps backups ==== | |||
On the database servers, a puppetized cron job (class postgresql::backup) automatically creates a daily dump file of all local [[Postgres]] databases (pg_dumpall) and stores it in /srv/postgres-backup. | |||
This path is then backed up by [[Bacula#Adding a new client]] | |||
For more details, the related subtask to setup backups was [[Phab:T190184]]. | |||
==== Restore the DB dump ==== | ==== Restore the DB dump ==== | ||
* Check the dump list | * Check the dump list on both DB hosts in <code>/srv/postgres-backup</code>, a more recent one might be in the other host. | ||
* Copy if needed the dump to the master host (as of | * Copy if needed the dump to the master host (as of June 2022 <code>netboxdb1002</code>) | ||
* Unzip the chosen dump file | * Unzip the chosen dump file | ||
* Take a one-off backup right before starting the restore with (the .bak suffix is important to not be auto-evicted): | * Take a one-off backup right before starting the restore with (the .bak suffix is important to not be auto-evicted): | ||
sudo -u postgres /usr/bin/pg_dumpall | /bin/gzip > /srv/postgres-backup/${USER}-DESCRIPTION.psql-all-dbs-$(date +\%Y\%m\%d).sql.gz | |||
* Connect to the DB, list and drop the Netbox database: | * Connect to the DB, list and drop the Netbox database: | ||
<syntaxhighlight lang="sql"> | <syntaxhighlight lang="sql"> | ||
Line 60: | Line 145: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
* Restore the DB with: | * Restore the DB with: | ||
<syntaxhighlight lang="shell"> | |||
$ gunzip < ${DUMP_FILE} | sudo -u postgres /usr/bin/psql | |||
</syntaxhighlight> | |||
===== Flush caches after a restore ===== | |||
After a restore Netbox caches must be flushed to ensure both consistency and see the changes. | After a restore Netbox caches must be flushed to ensure both consistency and see the changes. | ||
To perform the flush SSH into the Netbox | To perform the flush SSH into the Netbox active host (as of June 2022 <code>netbox1002</code>) and execute: | ||
<syntaxhighlight lang="shell"> | <syntaxhighlight lang="shell"> | ||
cd /srv/deployment/netbox | cd /srv/deployment/netbox | ||
Line 74: | Line 160: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
=== | ==== Sanitizing a database dump ==== | ||
== | |||
The Netbox database contains a few bits of sensitive information, and if it is going to be used for testing purposes in WMCS it should be sanitized first. | The Netbox database contains a few bits of sensitive information, and if it is going to be used for testing purposes in WMCS it should be sanitized first. | ||
Line 85: | Line 166: | ||
# Run the below SQL code on <code>netbox-sanitize</code> database. | # Run the below SQL code on <code>netbox-sanitize</code> database. | ||
# Dump and drop database <code>pg_dump netbox-sanitize > netbox-sanitized.sql</code>; <code>dropdb netbox-sanitize</code> | # Dump and drop database <code>pg_dump netbox-sanitize > netbox-sanitized.sql</code>; <code>dropdb netbox-sanitize</code> | ||
THE BELOW COMMANDS ARE OUTDATED AND MIGHT NOT COVER EVERYTHING THAT NEEDS TO BE SANITIZED | |||
<syntaxhighlight lang="sql" line="1"> | <syntaxhighlight lang="sql" line="1"> | ||
-- truncate secrets | -- truncate secrets | ||
Line 111: | Line 193: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
== Custom Links == | === CSV === | ||
This might be decommissioned in a short to medium term, see https://phabricator.wikimedia.org/T310615 | |||
==== CSV backups ==== | |||
On the active frontend, each hour at :37, a script dumps most pertinent tables to a target directory in <code>/srv/netbox-dumps</code> with a timestamp. | |||
Sixteen of these dumps are retained for backup purposes, which is executed by the script in <code>/srv/deployment/netbox-extras/tools/rotatedump</code>. | |||
This script only rotates directories in the pattern <code>20*</code>, so if a manual, retained dump is desired, one can simply run the script (<code>su netbox -c /srv/deployment/netbox-extras/tools/rotatedump)</code> and rename the resulting dump outside of this pattern, perhaps with a descriptive prefix. | |||
<code>/srv/netbox-dumps</code> is backed up in [[Bacula#Adding a new client]] for historical copies. | |||
==== CSV Restores ==== | |||
The normal CSV import feature of Netbox could be used to manually re-import data (after selecting what needs to be re-imported), but it has not been tested with the CSV backups. | |||
== Netbox Extras == | |||
CustomScripts, Reports and other associated tools for Netbox are collected in the netbox-extras repository at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/. This repository is deployed to the Netbox frontends under /srv/deployment/netbox-extras. It is not automatically deployed on merged, and must be manually `git pull` after merge on both front-ends. This can be most comfortably accomplished with [[Cumin]] on a Cumin host: | |||
sudo cumin 'A:netbox-all' 'cd /srv/deployment/netbox-extras; git pull --ff-only' | |||
This will have the dual purpose of resetting any local changes and updating the deployment to the latest version. | |||
== Netbox features == | |||
=== Custom Links === | |||
WebUI (defined there): https://netbox.wikimedia.org/extras/custom-links/ | |||
Doc: https://docs.netbox.dev/en/stable/models/extras/customlink/ | |||
Netbox allow to setup custom links to other websites using Jinja2 templating for both the visualized name and the actual link, allowing for quite some flexibility. | Netbox allow to setup custom links to other websites using Jinja2 templating for both the visualized name and the actual link, allowing for quite some flexibility. | ||
The current setup (as of | The current setup (as of June 2022) has the following links: | ||
* Grafana (for all physical devices and VMs) | * Grafana (for all physical devices and VMs) | ||
* Icinga (for all physical devices and VMs) | * Icinga (for all physical devices and VMs) | ||
Line 119: | Line 230: | ||
* Procurement Ticket (only for physical devices that have a ticket that matches either Phabricator or RT) | * Procurement Ticket (only for physical devices that have a ticket that matches either Phabricator or RT) | ||
* Hardware config (for Dell and HP physical devices, pointing to the manufacturer page for warranty information based on their serial number) | * Hardware config (for Dell and HP physical devices, pointing to the manufacturer page for warranty information based on their serial number) | ||
* LibreNMS (for Juniper, opengear and sentry devices) | |||
* Puppetboard (for all physical devices and VMs) | |||
== | === Reports === | ||
WebUI (reports results): https://netbox.wikimedia.org/extras/reports/ | |||
Doc: https://docs.netbox.dev/en/stable/customization/reports/ | |||
Defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/reports/ | |||
Netbox reports are a way of validating data within Netbox. | |||
In summary, reports produce a series of log lines that indicate some status connected to a machine, and may be either <code>error</code>, <code>warning</code>, or <code>success</code>. Log lines with no particular disposition for information purposes may also be emitted. | |||
{{Note|content=It is better to prevent the invalid data entry at the first place when possible (eg. regex, custom validation)}} | |||
=== Report Conventions === | ==== Report Conventions ==== | ||
Because of limitations to the UI for Netbox reports, certain conventions have emerged: | Because of limitations to the UI for Netbox reports, certain conventions have emerged: | ||
Line 143: | Line 256: | ||
## '''''missing purchase date''''' | ## '''''missing purchase date''''' | ||
# Summary log messages should be formatted like <count> <verb/condition> <noun/subobject> | # Summary log messages should be formatted like <count> <verb/condition> <noun/subobject> | ||
# If possible followed with a suggestion on how to fix it (for example what are the proper values | |||
=== Report Alert=== | ==== Report Alert ==== | ||
Most reports that alert are data integrity mismatches due to changes in infrastructure, as a secondary check, and the responsibility of DC-ops. | |||
Some (eg. network report) can have unforeseen consequences on the infrastructure (eg. miss-configurations). | |||
{| class="wikitable" | {| class="wikitable" | ||
|+Reports and their Errors | |+Reports and their Errors | ||
!Report | !Report | ||
!Typical Responsibility | !Typical Responsibility | ||
!Alerts | |||
!Typical Error(s) | !Typical Error(s) | ||
!Note | !Note | ||
Line 157: | Line 272: | ||
|[https://netbox.wikimedia.org/extras/reports/accounting.Accounting/ Accounting] | |[https://netbox.wikimedia.org/extras/reports/accounting.Accounting/ Accounting] | ||
|Faidon or DC-ops | |Faidon or DC-ops | ||
|✅ | |||
| | | | ||
| | | | ||
Line 162: | Line 278: | ||
|[https://netbox.wikimedia.org/extras/reports/cables.Cables/ Cables] | |[https://netbox.wikimedia.org/extras/reports/cables.Cables/ Cables] | ||
|DC-ops | |DC-ops | ||
|✅ | |||
| | | | ||
| | | | ||
|- | |- | ||
|[https://netbox.wikimedia.org/extras/reports/coherence.Coherence/ Coherence | |[https://netbox.wikimedia.org/extras/reports/coherence.Coherence/ Coherence] | ||
| | |DC-Ops | ||
|✅ | |||
| | | | ||
| | | | ||
Line 172: | Line 290: | ||
|[https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/ LibreNMS] | |[https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/ LibreNMS] | ||
|DC-ops or Netops | |DC-ops or Netops | ||
|✅ | |||
| | | | ||
|You can ignore a LibreNMS device by setting its "ignore alert" flag in LibreNMS | |You can ignore a LibreNMS device by setting its "ignore alert" flag in LibreNMS | ||
Line 177: | Line 296: | ||
|[https://netbox.wikimedia.org/extras/reports/management.ManagementConsole/ Management] | |[https://netbox.wikimedia.org/extras/reports/management.ManagementConsole/ Management] | ||
|DC-ops | |DC-ops | ||
|✅ | |||
| | | | ||
| | | | ||
Line 182: | Line 302: | ||
|[https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ PuppetDB] | |[https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ PuppetDB] | ||
|Whoever changed / reimaged host | |Whoever changed / reimaged host | ||
|✅ | |||
|''<device> missing from PuppetDB'' or ''<device> missing from Netbox''. These occur because the data in PuppetDB does not match the data in Netbox, typically related to missing devices or unexpected devices. Generally these errors fix themselves once the reimage is complete, but the Netbox record for the host may need to be updated for decommissioning and similar operations. | |''<device> missing from PuppetDB'' or ''<device> missing from Netbox''. These occur because the data in PuppetDB does not match the data in Netbox, typically related to missing devices or unexpected devices. Generally these errors fix themselves once the reimage is complete, but the Netbox record for the host may need to be updated for decommissioning and similar operations. | ||
| | | | ||
|- | |- | ||
| | |[https://netbox.wikimedia.org/extras/reports/network.Network/ Network] | ||
|DC-ops or Netops | |DC-ops or Netops | ||
|✅ | |||
| | | | ||
| | | | ||
|} | |} | ||
=== Custom Scripts === | |||
WebUI: https://netbox.wikimedia.org/extras/scripts/ | |||
Doc: https://docs.netbox.dev/en/stable/customization/custom-scripts/ | |||
Defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/ | |||
While Netbox reports are read-only and have a fixed output format, CustomScripts can both write to the database and provide custom output. | |||
In our infrastructure they're used for those two aspects: | |||
* Abstract and automate data entry, | |||
** Interface_automation | |||
** Offline_device | |||
** Replace_device | |||
* Format and expose data in a way that can be consumed by external tools, | |||
** Capirca | |||
** Getstats | |||
** Hiera_export | |||
The above scripts should probably be moved to the plugin feature.{{Warning|content=When running a script that writes to the database, run it a first time with "Commit changes" unchecked. | |||
Review the changes that would happen. | |||
Then a second time with "Commit changes" checked to make the changes permanent.}} | |||
==== Extra Errors, Notes and Procedures ==== | |||
===== Would like to remove interface ===== | |||
This error is produced in the Interface Automation script when cleaning up old interfaces during an import. | |||
Interfaces are considered for removal if they don't appear in the list provided by the data source (generally speaking, PuppetDB); they are then checked if there is an IP address or a cable associated with the interface. If there is one of these the interface is left in place so as to not lose data. It is considered a bug if this happens, so if you see this error in an output feel free to open a ticket against #netbox in Phabricator. | |||
===== Error removing interface after speed change ===== | |||
This error is produced in the Interface Automation script when cleaning up old interfaces when provisioning a server's network attributes. | |||
Specifically for modular interfaces on Juniper devices, the interface name is determined by the speed of the interface, and the port number. If an old interface exists, say xe-1/0/8, on a modular port and we replace the 10G SFP+ with a 25G SFP28, the name of the interface will change to et-1/0/8. JunOS cannot have both defined so the import script will remove the old (xe-1/0/8) interface in Netbox before adding the new one. | |||
This error will get thrown if the old interface still has a cable connected, or an IP address assigned. This shouldn't normally happen, but if it does the old interface should be manually removed, and cables/IPs cleaned up as necessary. Feel free to ping netops members on IRC if there is any confusion, or open a Phabricator task. | |||
=== Custom Fields === | |||
WebUI (defined there): https://netbox.wikimedia.org/extras/custom-fields/ | |||
Doc: https://docs.netbox.dev/en/stable/customization/custom-fields/ | |||
Please open a task if you need a new Custom Field. | |||
=== nbshell === | |||
Not a user facing feature, but an admin feature, useful for troubleshooting. | |||
Doc: https://docs.netbox.dev/en/stable/administration/netbox-shell/ | |||
{{Warning|content=This have the power to break things very quickly if not used carefully.}} | |||
The bellow command will drop you in a python shell with access to all the Netbox models, similarly to what the CustomScripts use. | |||
<syntaxhighlight lang="bash"> | |||
sudo -i | |||
cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox && python manage.py nbshell | |||
</syntaxhighlight> | |||
=== Tags/ConfigContext === | |||
Tags are a slippery slope as they are global and don't have built in mechanism to prevent typos. ConfigContext are much more difficult to audit than fields. We've so far managed to not need them. Therefore, | |||
{{Warning|content=They MUST NOT be used in our environment.}} | |||
== Exports == | == Exports == | ||
Line 203: | Line 385: | ||
The repository is also mirrored in Phabricator: https://phabricator.wikimedia.org/source/netbox-exported-dns/ though it may not be immediately up-to-date. | The repository is also mirrored in Phabricator: https://phabricator.wikimedia.org/source/netbox-exported-dns/ though it may not be immediately up-to-date. | ||
== | === Puppet === | ||
https://phabricator.wikimedia.org/T229397 | |||
=== | [[phab:T311304|TODO]] | ||
=== Prometheus === | |||
The "GetStats" CustomScripts exports Prometheus metrics about devices statistics. | |||
Which is used to generate https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown?orgId=1 | |||
This might get replaced with a plugin, see https://phabricator.wikimedia.org/T311052 | |||
== Imports == | |||
=== Ganeti sync === | |||
Refactor and improvements (eg. cluster_group support) in [[phab:T262446|T262446]] | |||
For each entry under the <code>profile::netbox::ganeti_sync_profiles</code> Hiera key, Puppet created a systemd timer on the active sever to run [[gerrit:plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/tools/ganeti-netbox-sync.py|ganeti-netbox-sync.py]] with the matching parameters. | |||
== External scripts == | |||
Scripts and tools not previously listed that interact with Netbox, and thus need to be checked for compatibility after significant Netbox changes (eg. upgrades). | |||
[[Homer]] | |||
[[Spicerack]] | |||
[[Spicerack/Cookbooks|Cookbooks]] with direct pynetbox calls: | |||
* cookbooks/sre/pdus/uptime.py | |||
* cookbooks/sre/pdus/rotate-snmp.py | |||
* cookbooks/sre/network/configure-switch-interfaces.py | |||
* cookbooks/sre/hosts/dhcp.py | |||
* cookbooks/sre/pdus/rotate-password.py | |||
* cookbooks/sre/hosts/provision.py | |||
* cookbooks/sre/hosts/reimage.py | |||
* cookbooks/sre/pdus/reboot-and-wait.py | |||
== Upgrading Netbox == | == Upgrading Netbox == | ||
{{Outdated|reason=Will be cleaned up/updated with the next minor Netbox upgrade.}} | |||
Upgrading Netbox is usually an extremely simple procedure as within patch-level releases they maintain a reasonable level of compatibility in the APIs that we use. When it comes to upgrading across minor versions, breaking changes may have occurred and careful reading of changelogs is needed, as well as testing of the scripts which consume and manipulate data in Netbox. | Upgrading Netbox is usually an extremely simple procedure as within patch-level releases they maintain a reasonable level of compatibility in the APIs that we use. When it comes to upgrading across minor versions, breaking changes may have occurred and careful reading of changelogs is needed, as well as testing of the scripts which consume and manipulate data in Netbox. | ||
Line 243: | Line 457: | ||
|824fe21597c251ce6e0667b97b258a23ff210949 | |824fe21597c251ce6e0667b97b258a23ff210949 | ||
|Adds a way to pass settings directly into the configuration.py that we use for configuring the Swift storage backend | |Adds a way to pass settings directly into the configuration.py that we use for configuring the Swift storage backend | ||
| | |||
| | |||
|- | |||
|98f32d988d7e1bc146298025900ba54969e8d3c1 | |||
|Add CAS authentication support | |||
| | | | ||
| | | | ||
Line 250: | Line 469: | ||
# In a working copy of operations/software/netbox, add an upstream remote | # In a working copy of operations/software/netbox, add an upstream remote | ||
# Pull the upstream remote's commits into the working copy | # Pull the upstream remote's commits into the working copy | ||
# Cherry-pick the above hashes into the working copy's HEAD < | # Cherry-pick the above hashes into the working copy's HEAD | ||
<syntaxhighlight lang="console"> | |||
$ git cherry-pick -x 03e3b07f5dcf5538fb0b90641a4e3d043684bb37 | |||
Auto-merging netbox/netbox/urls.py | |||
[wmf-dev 6467d1651] Switch swagger to non-public mode | |||
Author: Cas Rusnov <crusnov@wikimedia.org> | |||
Date: Thu Aug 8 13:48:34 2019 -0700 | |||
1 file changed, 1 insertion(+), 1 deletion(-) | |||
$ git cherry-pick -x 824fe21597c251ce6e0667b97b258a23ff210949 | |||
Auto-merging netbox/netbox/settings.py | |||
[wmf-dev a653243bc] Add a passthrough configuration system | |||
Author: Cas Rusnov <crusnov@wikimedia.org> | |||
Date: Tue Jul 30 15:22:28 2019 -0700 | |||
1 file changed, 8 insertions(+) | |||
$ git cherry-pick -x 98f32d988d7e1bc146298025900ba54969e8d3c1 | |||
Auto-merging requirements.txt | |||
CONFLICT (content): Merge conflict in requirements.txt | |||
Auto-merging netbox/users/views.py | |||
Auto-merging netbox/netbox/settings.py | |||
CONFLICT (content): Merge conflict in netbox/netbox/settings.py | |||
Auto-merging .gitignore | |||
error: could not apply 98f32d988... Add CAS authentication support | |||
hint: after resolving the conflicts, mark the corrected paths | |||
hint: with 'git add <paths>' or 'git rm <paths>' | |||
hint: and commit the result with 'git commit' | |||
</syntaxhighlight> | |||
# Remove the master branch <code>git branch -D master</code> | # Remove the master branch <code>git branch -D master</code> | ||
# Create a new master branch <code>git branch master</code> | # Create a new master branch <code>git branch master</code> | ||
Line 388: | Line 632: | ||
[] Recheck DNS generation and examine diffs | [] Recheck DNS generation and examine diffs | ||
</pre> | </pre> | ||
== Troubleshooting == | |||
It's possible to run scripts through the command line, for example: | |||
<code>python3 manage.py runscript interface_automation.ImportPuppetDB --data '{"device": "ml-cache1003"}' --commit</code> | |||
== Future improvements == | |||
Phabricator project - https://phabricator.wikimedia.org/tag/netbox/ | |||
=== Improve our infrastructure modeling === | |||
* Add license keys as inventory items - {{Phabricator/en|T311008}} | |||
* Make more extensive use of Netbox custom fields - {{Phabricator/en|T305126}} | |||
* Represent sub-interface and bridge device associations in Netbox - {{Phabricator/en|T296832}} | |||
* Netbox: use Custom Model Validation - {{Phabricator/en|T310590}} | |||
* Netbox: use Provider Networks - {{Phabricator/en|T310591}} | |||
* Netbox: investigate custom status - {{Phabricator/en|T310594}} | |||
* Netbox: use FHRP Groups feature - {{Phabricator/en|T311218}} | |||
* Move AS allocations to Netbox - {{Phabricator/en|T310744}} | |||
* netbox network report improvement - {{Phabricator/en|T310299}} | |||
* Import row information into Netbox for Ganeti instances - {{Phabricator/en|T262446}} | |||
=== Improve automation and reduce tech debt === | |||
* Netbox: investigate GraphQL API - {{Phabricator/en|T310577}} | |||
* Netbox: use Journaling feature - {{Phabricator/en|T310583}} | |||
* Netbox: basic change rollback - {{Phabricator/en|T310589}} | |||
* Netbox: drop profile::netbox::active_server parameter - {{Phabricator/en|T309034}} | |||
* Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - {{Phabricator/en|T311052}} | |||
* Upgrade pynetbox - {{Phabricator/en|T310745}} | |||
* Netbox: get rid of WMF Production Patches - {{Phabricator/en|T310717}} | |||
* Netbox: replace CSV dump with more frequent DB dumps {{Phabricator/en|T310615}} | |||
[[Category:Services]] | [[Category:Services]] | ||
[[Category:SRE Infrastructure Foundations]] | [[Category:SRE Infrastructure Foundations]] |
Revision as of 12:21, 27 June 2022
Netbox is a "IP address management (IPAM) and data center infrastructure management (DCIM) tool".
At Wikimedia it is used as the DCIM and IPAM system, as well as being used as an integration point for switch and port management, DNS management, and similar operations.
History
- At Wikimedia it was evaluated in Phab:T170144 as a replacement for Racktables.
- In Phab:T199083 the actual migration between the systems took place
- task T266487 - Netbox 2.9 upgrade
- task T288515 - Netbox vs. Nautobot
- task T296452 - documents the large upgrade from 2.10 to 3.2 and the subsequent improvements it brought
Web UI
- https://netbox.wikimedia.org/
- Login using your LDAP/Wikitech credentials
- Currently you need to be in either the "ops" or "wmf" LDAP group to be able to login (hiera
profile::netbox::cas_group_attribute_mapping
)
API
- Endpoint
- From an internal host (eg. everything BUT your laptop or WMCS), you MUST use netbox.discovery.wmnet
- REST API
- Create a token on https://netbox.wikimedia.org/user/api-tokens/, ideally read-only, ideally with an expiration date
- Doc: https://netbox.wikimedia.org/api/docs/
- Python library: https://github.com/netbox-community/pynetbox/
- Note that the REST API is quite slow, make sure to optimize your queries
- GraphQL
- Soon
- Spicerack
- See https://doc.wikimedia.org/spicerack/master/api/spicerack.netbox.html for the Netbox support
- It's preferred to use the built in wrapper functions and not the pynetbox interface directly as your cookbook might break if they're not updated when Netbox introduces breaking changes
Staging
- It consists of a single bullseye VM (netbox-devXXXX) combining frontend and database
- Reachable on netbox-next.wikimedia.org and netbox-next.discovery.wmnet
- Behind caches, similarly to the prod infrastructure
- It's data comes from a manual dump of production's database
- Reach out to Infrastructure Foundations if you need a more fresh database
- Be careful not to leak any of its data
- It is used to test Netbox upgrades, scripts, reports, etc
- This host is active in monitoring (with notifications disabled)
- As such, make sure that all the alerts cleared after your tests
Production infrastructure
the production Netbox infrastructure consists of 4 bullseye VMs (see all Netbox VMs):
- 2 active/passive frontends (netboxXXXX)
- Running a local Redis instance - (maybe soon central with task T311385)
- 2 primary/replica postgresSQL databases (netboxdbXXXX)
By default the active/primary servers are the eqiad ones.
The public endpoint is behind our CDN so the request flow is:
- CDN - (using the wildcard *.wikimedia.org as its TLS certificate)
- active frontends
- Apache (using cfssl for its TLS certificate)
- Django app (through uwsgi)
- Active database
Monitoring
Icinga
See all Netbox related Icinga checks: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox
In addition to the regular set of VM checks that are run on all servers, there are Icinga checks that only run on the active servers.
Controlled by the profile::netbox::db::primary
and profile::netbox::active_server
Hiera keys.
Frontends
Controlled by the profile::netbox::active_server
Hiera key:
- Alerting for the Ganeti sync systemd timers (are they running correctly?) - see also Netbox#Ganeti sync
- Alerting for the Netbox reports (is there invalid data in Netbox?) - see also Netbox#Reports
- Alerting for the DNS export automation (are there Uncommitted DNS changes in Netbox?) - See also Netbox#DNS
Databases
The replica have a check for replication delay.
Prometheus
Setup task: https://phabricator.wikimedia.org/T243928
Global health overview (beta): https://grafana.wikimedia.org/d/DvXT6LCnk/
Failover
Frontends
Using confctl, pool the passive server and depool the previous active one.
confctl --object-type discovery select 'dnsdisc=netbox,name=codfw' set/pooled=true
confctl --object-type discovery select 'dnsdisc=netbox,name=eqiad' set/pooled=false
If the failover is going to last (eg. longer than a server reboot), change the profile::netbox::active_server
Hiera key to the backup server. This will ensure the cron/systemd timers as well as Icinga checks are running.
Note that having the active frontend in a different datacenter than the primary database will result in Netbox being slower.
Databases
If the primary database server needs a short downtime it's recommended to not try a failover and instead have Netbox offline for a short amount of time.
There are currently no documented procedure on how to fail the database over, and even less how to fail back to the former primary.
See also Postgres
Database
Restore
First of all analyze the Netbox changelog to choose what's the best action to perform a restore.
The general options are:
- Manually (or via the API) re-play the actions listed in the changelog in reverse order. The changelog entries don't have full raw data, some of them might show the names instead of the IDs required in the API.
- Use the CSV dumps to recover data. Their restore is not trivial either due to the fact that some of the Netbox exports are not immediately re-importable due to reference resolution.
- Restore a database dump. This ensure consistency at a given point in time, and could even be used to perform some partial restore using
pg_restore
.
To restore files from Bacula back to the client, use bconsole on helium and refer to Bacula#Restore_(aka_Panic_mode) for detailed steps.
PostgresSQL
Dumps backups
On the database servers, a puppetized cron job (class postgresql::backup) automatically creates a daily dump file of all local Postgres databases (pg_dumpall) and stores it in /srv/postgres-backup.
This path is then backed up by Bacula#Adding a new client
For more details, the related subtask to setup backups was Phab:T190184.
Restore the DB dump
- Check the dump list on both DB hosts in
/srv/postgres-backup
, a more recent one might be in the other host. - Copy if needed the dump to the master host (as of June 2022
netboxdb1002
) - Unzip the chosen dump file
- Take a one-off backup right before starting the restore with (the .bak suffix is important to not be auto-evicted):
sudo -u postgres /usr/bin/pg_dumpall | /bin/gzip > /srv/postgres-backup/${USER}-DESCRIPTION.psql-all-dbs-$(date +\%Y\%m\%d).sql.gz
- Connect to the DB, list and drop the Netbox database:
psql
postgres=# \l
...
postgres=# DROP DATABASE netbox;
DROP DATABASE
postgres=#
- Restore the DB with:
$ gunzip < ${DUMP_FILE} | sudo -u postgres /usr/bin/psql
Flush caches after a restore
After a restore Netbox caches must be flushed to ensure both consistency and see the changes.
To perform the flush SSH into the Netbox active host (as of June 2022 netbox1002
) and execute:
cd /srv/deployment/netbox
. venv/bin/activate # Activate the Netbox Python virtualenv
cd deploy/src/netbox
python manage.py invalidate all # Perform the flush
Sanitizing a database dump
The Netbox database contains a few bits of sensitive information, and if it is going to be used for testing purposes in WMCS it should be sanitized first.
- Create a copy of the main database
createdb netbox-sanitize && pg_dump netbox | psql netbox-sanitize
- Run the below SQL code on
netbox-sanitize
database. - Dump and drop database
pg_dump netbox-sanitize > netbox-sanitized.sql
;dropdb netbox-sanitize
THE BELOW COMMANDS ARE OUTDATED AND MIGHT NOT COVER EVERYTHING THAT NEEDS TO BE SANITIZED
-- truncate secrets
TRUNCATE secrets_secret CASCADE;
TRUNCATE secrets_sessionkey CASCADE;
TRUNCATE secrets_userkey CASCADE;
-- sanitize dcim_serial
UPDATE dcim_device SET serial = concat('SERIAL', id::TEXT);
-- truncate user table
TRUNCATE auth_user CASCADE;
-- sanitize dcim_interface.mac_address
UPDATE dcim_interface SET mac_address = CONCAT(
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0')) :: macaddr;
-- sanitize cricuits_circuit.cid
UPDATE circuits_circuit SET cid = concat('CIRCUIT', id::TEXT);
CSV
This might be decommissioned in a short to medium term, see https://phabricator.wikimedia.org/T310615
CSV backups
On the active frontend, each hour at :37, a script dumps most pertinent tables to a target directory in /srv/netbox-dumps
with a timestamp.
Sixteen of these dumps are retained for backup purposes, which is executed by the script in /srv/deployment/netbox-extras/tools/rotatedump
.
This script only rotates directories in the pattern 20*
, so if a manual, retained dump is desired, one can simply run the script (su netbox -c /srv/deployment/netbox-extras/tools/rotatedump)
and rename the resulting dump outside of this pattern, perhaps with a descriptive prefix.
/srv/netbox-dumps
is backed up in Bacula#Adding a new client for historical copies.
CSV Restores
The normal CSV import feature of Netbox could be used to manually re-import data (after selecting what needs to be re-imported), but it has not been tested with the CSV backups.
Netbox Extras
CustomScripts, Reports and other associated tools for Netbox are collected in the netbox-extras repository at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/. This repository is deployed to the Netbox frontends under /srv/deployment/netbox-extras. It is not automatically deployed on merged, and must be manually `git pull` after merge on both front-ends. This can be most comfortably accomplished with Cumin on a Cumin host:
sudo cumin 'A:netbox-all' 'cd /srv/deployment/netbox-extras; git pull --ff-only'
This will have the dual purpose of resetting any local changes and updating the deployment to the latest version.
Netbox features
Custom Links
WebUI (defined there): https://netbox.wikimedia.org/extras/custom-links/
Doc: https://docs.netbox.dev/en/stable/models/extras/customlink/
Netbox allow to setup custom links to other websites using Jinja2 templating for both the visualized name and the actual link, allowing for quite some flexibility. The current setup (as of June 2022) has the following links:
- Grafana (for all physical devices and VMs)
- Icinga (for all physical devices and VMs)
- Debmonitor (for all physical devices and VMs)
- Procurement Ticket (only for physical devices that have a ticket that matches either Phabricator or RT)
- Hardware config (for Dell and HP physical devices, pointing to the manufacturer page for warranty information based on their serial number)
- LibreNMS (for Juniper, opengear and sentry devices)
- Puppetboard (for all physical devices and VMs)
Reports
WebUI (reports results): https://netbox.wikimedia.org/extras/reports/
Doc: https://docs.netbox.dev/en/stable/customization/reports/
Netbox reports are a way of validating data within Netbox.
In summary, reports produce a series of log lines that indicate some status connected to a machine, and may be either error
, warning
, or success
. Log lines with no particular disposition for information purposes may also be emitted.
![]() | It is better to prevent the invalid data entry at the first place when possible (eg. regex, custom validation) |
Report Conventions
Because of limitations to the UI for Netbox reports, certain conventions have emerged:
- Reports should emit one
log_error
line for each failed item. If the item doesn't exist as a Netbox object,None
may be passed in place of the first argument. - If any
log_warning
lines are produced, they should be grouped after the loop which produceslog_error
lines. - Reports should emit one
log_success
which contains a summary of successes, as the last log in the report. - Log messages referring to a single object should be formatted like <verb/condition> <noun/subobject>[: <explanatory extra information>]. Examples:
- malformed asset tag: WNF1212
- missing purchase date
- Summary log messages should be formatted like <count> <verb/condition> <noun/subobject>
- If possible followed with a suggestion on how to fix it (for example what are the proper values
Report Alert
Most reports that alert are data integrity mismatches due to changes in infrastructure, as a secondary check, and the responsibility of DC-ops.
Some (eg. network report) can have unforeseen consequences on the infrastructure (eg. miss-configurations).
Report | Typical Responsibility | Alerts | Typical Error(s) | Note |
---|---|---|---|---|
Accounting | Faidon or DC-ops | ✅ | ||
Cables | DC-ops | ✅ | ||
Coherence | DC-Ops | ✅ | ||
LibreNMS | DC-ops or Netops | ✅ | You can ignore a LibreNMS device by setting its "ignore alert" flag in LibreNMS | |
Management | DC-ops | ✅ | ||
PuppetDB | Whoever changed / reimaged host | ✅ | <device> missing from PuppetDB or <device> missing from Netbox. These occur because the data in PuppetDB does not match the data in Netbox, typically related to missing devices or unexpected devices. Generally these errors fix themselves once the reimage is complete, but the Netbox record for the host may need to be updated for decommissioning and similar operations. | |
Network | DC-ops or Netops | ✅ |
Custom Scripts
WebUI: https://netbox.wikimedia.org/extras/scripts/
Doc: https://docs.netbox.dev/en/stable/customization/custom-scripts/
While Netbox reports are read-only and have a fixed output format, CustomScripts can both write to the database and provide custom output.
In our infrastructure they're used for those two aspects:
- Abstract and automate data entry,
- Interface_automation
- Offline_device
- Replace_device
- Format and expose data in a way that can be consumed by external tools,
- Capirca
- Getstats
- Hiera_export
The above scripts should probably be moved to the plugin feature.
![]() | When running a script that writes to the database, run it a first time with "Commit changes" unchecked.
Review the changes that would happen. Then a second time with "Commit changes" checked to make the changes permanent. |
Extra Errors, Notes and Procedures
Would like to remove interface
This error is produced in the Interface Automation script when cleaning up old interfaces during an import.
Interfaces are considered for removal if they don't appear in the list provided by the data source (generally speaking, PuppetDB); they are then checked if there is an IP address or a cable associated with the interface. If there is one of these the interface is left in place so as to not lose data. It is considered a bug if this happens, so if you see this error in an output feel free to open a ticket against #netbox in Phabricator.
Error removing interface after speed change
This error is produced in the Interface Automation script when cleaning up old interfaces when provisioning a server's network attributes.
Specifically for modular interfaces on Juniper devices, the interface name is determined by the speed of the interface, and the port number. If an old interface exists, say xe-1/0/8, on a modular port and we replace the 10G SFP+ with a 25G SFP28, the name of the interface will change to et-1/0/8. JunOS cannot have both defined so the import script will remove the old (xe-1/0/8) interface in Netbox before adding the new one.
This error will get thrown if the old interface still has a cable connected, or an IP address assigned. This shouldn't normally happen, but if it does the old interface should be manually removed, and cables/IPs cleaned up as necessary. Feel free to ping netops members on IRC if there is any confusion, or open a Phabricator task.
Custom Fields
WebUI (defined there): https://netbox.wikimedia.org/extras/custom-fields/
Doc: https://docs.netbox.dev/en/stable/customization/custom-fields/
Please open a task if you need a new Custom Field.
nbshell
Not a user facing feature, but an admin feature, useful for troubleshooting.
Doc: https://docs.netbox.dev/en/stable/administration/netbox-shell/
![]() | This have the power to break things very quickly if not used carefully. |
The bellow command will drop you in a python shell with access to all the Netbox models, similarly to what the CustomScripts use.
sudo -i
cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox && python manage.py nbshell
Tags/ConfigContext
Tags are a slippery slope as they are global and don't have built in mechanism to prevent typos. ConfigContext are much more difficult to audit than fields. We've so far managed to not need them. Therefore,
![]() | They MUST NOT be used in our environment. |
Exports
Set of resources that exports Netbox data in various formats.
DNS
A git repository of DNS zonefile snippets generated from Netbox data and exported via HTTPS in read-only mode to be consumed by the DNS#Authoritative_nameservers and the Continuous Integration tests run for the operations/dns
Gerrit repository.
The repository is available via:
$ git clone https://netbox-exports.wikimedia.org/dns.git
To update the repository, see DNS/Netbox#Update_generated_records.
The repository is also mirrored in Phabricator: https://phabricator.wikimedia.org/source/netbox-exported-dns/ though it may not be immediately up-to-date.
Puppet
https://phabricator.wikimedia.org/T229397
Prometheus
The "GetStats" CustomScripts exports Prometheus metrics about devices statistics.
Which is used to generate https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown?orgId=1
This might get replaced with a plugin, see https://phabricator.wikimedia.org/T311052
Imports
Ganeti sync
Refactor and improvements (eg. cluster_group support) in T262446
For each entry under the profile::netbox::ganeti_sync_profiles
Hiera key, Puppet created a systemd timer on the active sever to run ganeti-netbox-sync.py with the matching parameters.
External scripts
Scripts and tools not previously listed that interact with Netbox, and thus need to be checked for compatibility after significant Netbox changes (eg. upgrades).
Cookbooks with direct pynetbox calls:
- cookbooks/sre/pdus/uptime.py
- cookbooks/sre/pdus/rotate-snmp.py
- cookbooks/sre/network/configure-switch-interfaces.py
- cookbooks/sre/hosts/dhcp.py
- cookbooks/sre/pdus/rotate-password.py
- cookbooks/sre/hosts/provision.py
- cookbooks/sre/hosts/reimage.py
- cookbooks/sre/pdus/reboot-and-wait.py
Upgrading Netbox
![]() | This page may be outdated or contain incorrect details. Please update it if you can. Will be cleaned up/updated with the next minor Netbox upgrade. |
Upgrading Netbox is usually an extremely simple procedure as within patch-level releases they maintain a reasonable level of compatibility in the APIs that we use. When it comes to upgrading across minor versions, breaking changes may have occurred and careful reading of changelogs is needed, as well as testing of the scripts which consume and manipulate data in Netbox.
Overview
- Update WMF version of netbox
- Review changelog and note any changes that may interact with our integrations or deployment
- Update deploy repository
- Deploy to netbox-dev2001
- Simple tests
- Review UI and note any differences to call out during announcement
- (if API changes or minor version bump) Complex tests
- (if breaking changes) Port scripts
- (if breaking changes) Test scripts
- Deploy to production
Update the WMF Version of Netbox
We maintain a minimal fork of Netbox to change a few small things that are required for our configuration. This takes the form of two commits in the WMF version of the Netbox repository:
Commit hash | |||
---|---|---|---|
03e3b07f5dcf5538fb0b90641a4e3d043684bb37 | Switches swagger into Non-Public mode | ||
824fe21597c251ce6e0667b97b258a23ff210949 | Adds a way to pass settings directly into the configuration.py that we use for configuring the Swift storage backend | ||
98f32d988d7e1bc146298025900ba54969e8d3c1 | Add CAS authentication support |
Generally these are cherry-picked into a copy of the upstream branch, and then pushed to main on https://gerrit.wikimedia.org/g/operations/software/netbox/, with the following procedure:
- In a working copy of operations/software/netbox, add an upstream remote
- Pull the upstream remote's commits into the working copy
- Cherry-pick the above hashes into the working copy's HEAD
$ git cherry-pick -x 03e3b07f5dcf5538fb0b90641a4e3d043684bb37
Auto-merging netbox/netbox/urls.py
[wmf-dev 6467d1651] Switch swagger to non-public mode
Author: Cas Rusnov <crusnov@wikimedia.org>
Date: Thu Aug 8 13:48:34 2019 -0700
1 file changed, 1 insertion(+), 1 deletion(-)
$ git cherry-pick -x 824fe21597c251ce6e0667b97b258a23ff210949
Auto-merging netbox/netbox/settings.py
[wmf-dev a653243bc] Add a passthrough configuration system
Author: Cas Rusnov <crusnov@wikimedia.org>
Date: Tue Jul 30 15:22:28 2019 -0700
1 file changed, 8 insertions(+)
$ git cherry-pick -x 98f32d988d7e1bc146298025900ba54969e8d3c1
Auto-merging requirements.txt
CONFLICT (content): Merge conflict in requirements.txt
Auto-merging netbox/users/views.py
Auto-merging netbox/netbox/settings.py
CONFLICT (content): Merge conflict in netbox/netbox/settings.py
Auto-merging .gitignore
error: could not apply 98f32d988... Add CAS authentication support
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'
- Remove the master branch
git branch -D master
- Create a new master branch
git branch master
- Create a tag for the version number, it is our standard to append -wmf to the version number, so for Netbox version 2.10.4 we would
git tag v2.10.4-wmf
- Force push master to gerrit
- Go to the administrative panel of the netbox repository https://gerrit.wikimedia.org/r/admin/repos/operations/software/netbox, click Access, Edit,
Reference: refs/heads/master
and set toAllow pushing with or without force
, and Save git push --set-upstream origin master --force --tags
- Return to administrative panel above, and switch access back to
Allow pushing (but not force pushing)
and save.
- Go to the administrative panel of the netbox repository https://gerrit.wikimedia.org/r/admin/repos/operations/software/netbox, click Access, Edit,
Build deploy repository
Netbox is deployed using Scap, and thus has a deployment repository with the collected artifacts (the virtual environment and associated libraries) is used to deploy it. This is updated separately from our branch of Netbox with the following procedure which uses the operations/software/netbox-deploy repository https://gerrit.wikimedia.org/g/operations/software/netbox-deploy/
- In a working copy of operations/software/netbox-deploy, update the src/ subdirectory, which is a submodule of this repository pointing at the above operations/software/netbox; to do this
git pull
in that directory, and then check out the tag of the version that is being updated to, for examplegit checkout v2.10.4-wmf
. - Update the .gitmodules file with the correct version as checked out above.
- Build the artifacts by doing a
make clean
and thenmake
. This uses docker to collect all of the required libraries as specified in the various requirements.txt files. It creates the artifacts asartifacts/artifacts.buster.tar.gz
andfrozen-requirements.txt
. - Commit the changes to the repisitory and submit for review, be sure the following files have changes: frozen-requirements.txt,artifacts/artifacts.buster.tar.gz, .gitmodules, and src
Once the repository is reviewed and merged via gerrit, it is ready to deploy!
Deploy to Testing Server
The next phase, even for simple upgrades is to deploy to netbox-dev2001.wikimedia.org for basic testing prior to deploying to production. This is done via Scap on a deploy server.
- Login to a deploy server such as deploy1001.eqiad.wmnet
- Go to /srv/deployment/netbox/deploy; this is a check out of the -deploy repository from above.
- Pull to the latest version, and update the submodule in src by pulling and checking out the tag that is going to be deployed.
- Deploy with scap to netbox-dev2001, with bug reference in hand: scap deploy -l netbox-dev2001.wikimedia.org 'Deploying Netbox v2.10.4-wmf to netbox-dev Tbug'
- This process should go smoothly and leave the target machine ready to test.
It may be necessary to deploy a new production database dump to netbox-dev2001's database to ensure parity with production.
It may be necessary to make changes to the puppet side of the deployment due to changes in Netbox's requirements, which will be generally called out in the upstream changelog. In addition, Puppet is used to ship the configuration files, so any new or different requirements in those may need to be accounted for. Additionally, occationally new or different tools are invoked in upstream's upgrade.sh
; we use netbox-deploy/scap/checks/netbox_setup.sh to perform these tasks and it may need to be updated.
Simple Testing
On https://netbox-next.wikimedia.org we can perform some basic tests:
- Test login
- Test each Report (Other/Reports menu) and compare to production
- Test each CustomScript (Other/Scripts menu) and compare to production
- Look at some samples of Devices (Devices/Devices menu) and compare to production
- Look at some samples of IPAM (IPAM/IP Addresses menu) and compare to production
- Look at a cable trace and compare to production (go to an active Device, Interfaces tab, click the 'Trace' button next to a connected interface)
Complex Testing
In the event that a more breaking update is being made, more extensive testing, and potential porting to account for API changes might need to be done. Note in the above
Simple Testing, any of the reports or CustomScripts that produce errors due to API changes (errored
state indicates some python error occured which most often is the result of API
changes). If the outputs of the reports or scripts vary substantially from the production versions, for example unexpected failures or warnings, this may also indicate that porting is required.
In addition to the above, the following things need to be tested:
- DNS generation. This should produce no diff if the database and DNS repository on netbox-dev2001 are updated to production contents.
- CSV dumps. Should produce results similar to production. Note that there may be changes to the dumped tables required if models have changed.
- Script proxy on getstats.GetDeviceStatistics. This should produce results similar to production.
- The ganeti sync script. This should be a no-op on a recent production data. If additional tests are desired, removing Virtual devices and rerunning should recreate them. Note that an existing bug in the sync script produces ignorable errors when trying to remove ganeti nodes that are no longer in the network.
- Homer. This should produce no diff if the database is updated to production contents.
Porting and Testing
This process doesn't generalize, but over time there are drifts in the internal and external APIs used by our integrations, and some porting work may be required to operate against them as they change. Generally these changes are minor such as changing method names or adding arguments, and other times they are rather more complicated such as splitting Virtual device interfaces from non-virtual device interfaces. In general once porting is thought to be complete, the changes should be deployed to netbox-dev2001 and a full run of testing should be done to verify that the changes made fix the problems that turned up in initial testing, including attempting to go down avenues of execution that may not normally be hit.
Any or all of the items tested in Simple and Complex testing may need porting depending on which internal or external APIs have changed.
Deploy to Production
After a final run through of any problem areas exposed in above testing and fixes are deployed to netbox-dev2001 it is finally time to deploy the new version to production, with the following procedure:
- Announce that the release will be occuring on #wikimedia-dcops, #mediawiki_security and, if necessary, coordinate a time when integration tools or DCops work will not be interrupted.
- Merge any outstanding changes to
netbox-extras
orhomer
repositories (if necessary). - On
netbox-db2001
, perform a manual dump of the database. - Deploy netbox-extras to production using cumin, as in #Netbox_Extras
- Announce on IRC that a deploy is happening, on #wikimedia-operations
!log Deploying Netbox v2.10.4-wmf to production Tbug
- Deploy to production:
- Login to a deploy server such as deploy1001.eqiad.wmnet
- Go to /srv/deployment/netbox/deploy; this is a check out of the -deploy repository from above.
- Pull to the latest version, and update the submodule in src by pulling and checking out the tag that is going to be deployed.
- Deploy with scap with bug reference in hand:
scap deploy 'Deploying Netbox v2.10.4-wmf to production Tbug'
- This process should go smoothly for netbox1001, and prompt for continuation. Allow it to continue to deploy to 2001 and -dev2001.
- Announce on IRC that the deploy is complete, on #wikimedia-operations
!log Finished deploying Netbox v2.10.4-wmf to production Tbug
- Perform simple and Complex testing as above, and in general make sure everything is as expected.
Checklists
Here are cut-and-pastable checklists for tickets for doing this upgrade process. They should be used in a ticket titled "Update Netbox to vN.M.X-wmf" tagged with sre-tools and netbox.
Simple upgrade
Use when patch level updates or a review of the changelog shows nothing that should break things.
[] Update netbox repository + deploy repository [] Upgrade -dev2001 [] Rerun reports [] Try a PuppetDB import for an existing host [] Check diffs in DNS generation [] Coordinate time with DCops and SRE for release [] Dump a pre-upgrade copy of database [] Release to production [] Perform simple tests
Complex upgrade
Use with any update that may cause breaking changes to the API, or that the simple testing indicates may have extra work involved.
[] Review upgrade.sh [] Examine change log for any major changes [] Update netbox repository and deploy repository [] Look around the UI for any changes, rearrangements or process changes [] Upgrade -dev2001 [] Check and make any necessary changes to reports: [] accounting.py [] cables.py [] coherence.py [] librenms.py [] management.py [] puppetdb.py [] Check and make any necessary changes to scripts: [] getstats.py [] interface_automation.py [] offline_device.py [] Check DNS generation, and review diffs [] Make any necessary changes to generate_dns_snippets.py [] Check custom_script_proxy.py [] Execute CSV dumps and examine dumps for any anomolies. [] Update dumpbackup.py for any model changes, and any issues [] Execute Ganeti sync against all sites [] Make any necessary changes to ganeti-netbox-sync.py [] Check and make any necessary changes to Homer [] Coordinate time with DCops and SRE for release [] Dump a pre-upgrade copy of database [] Release to production [] Perform simple tests [] Recheck DNS generation and examine diffs
Troubleshooting
It's possible to run scripts through the command line, for example:
python3 manage.py runscript interface_automation.ImportPuppetDB --data '{"device": "ml-cache1003"}' --commit
Future improvements
Phabricator project - https://phabricator.wikimedia.org/tag/netbox/
Improve our infrastructure modeling
- Add license keys as inventory items - task T311008
- Make more extensive use of Netbox custom fields - task T305126
- Represent sub-interface and bridge device associations in Netbox - task T296832
- Netbox: use Custom Model Validation - task T310590
- Netbox: use Provider Networks - task T310591
- Netbox: investigate custom status - task T310594
- Netbox: use FHRP Groups feature - task T311218
- Move AS allocations to Netbox - task T310744
- netbox network report improvement - task T310299
- Import row information into Netbox for Ganeti instances - task T262446
Improve automation and reduce tech debt
- Netbox: investigate GraphQL API - task T310577
- Netbox: use Journaling feature - task T310583
- Netbox: basic change rollback - task T310589
- Netbox: drop profile::netbox::active_server parameter - task T309034
- Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - task T311052
- Upgrade pynetbox - task T310745
- Netbox: get rid of WMF Production Patches - task T310717
- Netbox: replace CSV dump with more frequent DB dumps task T310615