You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Juniper router upgrade: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ayounsi
imported>Ayounsi
Line 1: Line 1:
== Known issues ==
* Junos 21.4R2-Sx doesn't work with (at least) the MX204 (the FPC doesn't come online)
* Junos 21.2R2-Sx have an incompatibility with older (at least 17.x) Junos, preventing VRRP adjacency to establish with a MD5 key
== Preparation ==
== Preparation ==
# Download the proper image to apt1001:/srv/junos/
# Download the proper image to apt1001:/srv/junos/
Line 12: Line 17:
#* <code>request system configuration rescue save</code>
#* <code>request system configuration rescue save</code>
# Copy image  
# Copy image  
#* <code>file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/</code> routing-instance mgmt_junos
#* <code>file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/ routing-instance mgmt_junos</code>
#* As data point this takes ~1h15 from eqiad to ulsfo
# Check checksum
# Check checksum
#* <code>file checksum md5 /var/tmp/$filename.tgz</code>
#* <code>file checksum md5 /var/tmp/$filename.tgz</code>
#* Compare with checksum on Juniper's website
#* Compare with checksum on Juniper's website
#Validate new image against existing config
#*<code>request vmhost software validate /var/tmp/$filename.tgz</code>
#*On Junos >=18.4 (after the upgrade to 21+)
== Upgrade ==
== Upgrade ==
# Check if console port(s) is(/are) working
# Check if console port(s) is(/are) working
Line 32: Line 41:
#* <code>set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70    </code>
#* <code>set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70    </code>
#* <code>set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70</code>
#* <code>set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70</code>
# Downtime host in Icinga and LibreNMS
# Downtime host in Icinga and Alert-manager
#* <code>sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'</code>
#* This needs to match the Icinga "hosts", <code>cr3-ulsfo</code> will match in AlertManager as well.
# Double check site has been fully drained of traffic before proceeding:
#* Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
#* Check Cloudflare DDoS tunnels are disabled for site: <code>sudo cookbook sre.network.cf status all</code>
#* Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network
# Disable BGP sessions to LVS/PyBal load-balancers
#* <code>deactivate protocols bgp group PyBal</code>
If Multi RE:
If Multi RE:
# Remove <code>graceful-switchover</code>
# Remove <code>graceful-switchover</code>
Line 39: Line 56:
# Install image on backup RE
# Install image on backup RE
#* <code>request vmhost software add /var/tmp/$filename.tgz re1</code>
#* <code>request vmhost software add /var/tmp/$filename.tgz re1</code>
#* This [https://www.juniper.net/documentation/us/en/software/junos/junos-install-upgrade/topics/topic-map/upgrading-and-downgrading-to-upgraded-bsd.html#d14e511 needs] <code>no-validate</code> when upgrading to 21+
# Reboot RE1
# Reboot RE1
#* <code>request vmhost reboot re1</code>
#* <code>request vmhost reboot re1</code>
Line 49: Line 67:
# Install image on RE
# Install image on RE
#* <code>request vmhost software add /var/tmp/$filename.tgz</code>
#* <code>request vmhost software add /var/tmp/$filename.tgz</code>
#* This [https://www.juniper.net/documentation/us/en/software/junos/junos-install-upgrade/topics/topic-map/upgrading-and-downgrading-to-upgraded-bsd.html#d14e511 needs] <code>no-validate</code> when upgrading to 21+
# Reboot router
# Reboot router
#* <code>request vmhost reboot</code>
#* <code>request vmhost reboot</code>

Revision as of 16:00, 6 September 2022

Known issues

  • Junos 21.4R2-Sx doesn't work with (at least) the MX204 (the FPC doesn't come online)
  • Junos 21.2R2-Sx have an incompatibility with older (at least 17.x) Junos, preventing VRRP adjacency to establish with a MD5 key

Preparation

  1. Download the proper image to apt1001:/srv/junos/

All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>

  1. Make room for the image
    • request system storage cleanup
  2. Save rescue config (just in case)
    • request system configuration rescue save
  3. Copy image
  4. Check checksum
    • file checksum md5 /var/tmp/$filename.tgz
    • Compare with checksum on Juniper's website
  5. Validate new image against existing config
    • request vmhost software validate /var/tmp/$filename.tgz
    • On Junos >=18.4 (after the upgrade to 21+)

Upgrade

  1. Check if console port(s) is(/are) working
  2. Depool site (optional)
  3. Drain traffic away from router
    1. Apply GRACEFUL_SHUTDOWN - https://phabricator.wikimedia.org/T211728
    2. Disable the peers
      • deactivate protocols bgp group Transit4
      • deactivate protocols bgp group Transit6
      • deactivate protocols bgp group IX4
      • deactivate protocols bgp group IX6
      • Adjust OSPF metrics
      • If eqiad/codfw drain the pfw3 link by increasing the MED value on both sides
  4. Ensure router is not VRRP master
    • show vrrp summary
    • set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70    
    • set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
  5. Downtime host in Icinga and Alert-manager
    • sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
    • This needs to match the Icinga "hosts", cr3-ulsfo will match in AlertManager as well.
  6. Double check site has been fully drained of traffic before proceeding:
  7. Disable BGP sessions to LVS/PyBal load-balancers
    • deactivate protocols bgp group PyBal

If Multi RE:

  1. Remove graceful-switchover
    • deactivate chassis redundancy graceful-switchover
    • request system configuration rescue save (to not have the above statement in the rescue config)
  2. Install image on backup RE
    • request vmhost software add /var/tmp/$filename.tgz re1
    • This needs no-validate when upgrading to 21+
  3. Reboot RE1
    • request vmhost reboot re1
  4. Once back up (show chassis routing-engine), perform RE switchover (impactful)
    • request chassis routing-engine master switch
  5. Once done, repeat previous 3 steps for re0
  6. Rollback "Remove graceful-switchover"

If single RE:

  1. Install image on RE
    • request vmhost software add /var/tmp/$filename.tgz
    • This needs no-validate when upgrading to 21+
  2. Reboot router
    • request vmhost reboot

Both single and dual RE:

  1. Check if router is healthy
    • show log messages | last
    • show system alarms
    • show ospf(3) interface
    • show bgp summary
    • All green in Icinga and LibreNMS

Cleanup

    • request system storage cleanup
  1. Remove Icinga and LibreNMS downtimes
  2. Rollback "Drain traffic away from router"
  3. Rollback VRRP change if any
  4. Save rescue config (just in case)
    • request system configuration rescue save
  5. On vmhost devices, save the disk snapshot to the backup partition
    • request vmhost snapshot for single RE devices
    • request vmhost snapshot routing-engine both for dual RE devices