You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Homer"

From Wikitech-static
Jump to navigation Jump to search
imported>CDanis
imported>Ayounsi
(25 intermediate revisions by 4 users not shown)
Line 50: Line 50:
* Pull the latest changes: <code>git pull</code>
* Pull the latest changes: <code>git pull</code>
* Verify that in the <code>src/</code> directory we're at the correct commit (check also with <code>git status</code>)
* Verify that in the <code>src/</code> directory we're at the correct commit (check also with <code>git status</code>)
* Deploy the new release: <code>scap deploy --verbose "Homer release v... - T..."</code>
* Deploy the new release, it currently needs two steps:
**For the buster host (as of May 2021 <code>cumin1001</code>): <code>scap deploy --verbose "Homer release v... - T..."</code>
**For the bullseye hosts (as of May 2021 <code>cumin2002</code>): move to one of the cumin hosts and run: <code>sudo cookbook sre.deploy.python-code -r 'Release vX.Y.Z' homer 'cumin2002*'</code>


=== Daily diffs (not ready yet) ===
=== Daily diffs ===
A cron job will run Homer every day to compare the live network configuration with our intended state. Any discrepancies will be emailed to someone (ideally a list) to be fixed.
A cron job runs Homer every 12h (24h per cumin hosts) to compare the live network configuration with our intended state. Any discrepancies is emailed to the rancid-core alias.


== Usage 🚀 ==
== Usage 🚀 ==
Line 63: Line 65:
Manually edit then '''commit''' the files on ssh://cumin1001.eqiad.wmnet:/srv/homer/private .
Manually edit then '''commit''' the files on ssh://cumin1001.eqiad.wmnet:/srv/homer/private .


git will sync them with the other cumin host. And will email a summary of the changes to Riccardo (TODO: change it to SREs).
git will sync them with the other cumin host. And will email a summary of the changes to SREs.


Make sure to mirror all your changes on the mock-private repo: https://gerrit.wikimedia.org/g/operations/homer/mock-private
Make sure to mirror all your changes on the mock-private repo: https://gerrit.wikimedia.org/g/operations/homer/mock-private
Line 70: Line 72:


==== Editing the public repository ====
==== Editing the public repository ====
Similar to our other public repositories, send CRs to https://gerrit.wikimedia.org/g/operations/homer/public , try not to +2 your changes.
Similar to our other public repositories, send CRs to https://gerrit.wikimedia.org/g/operations/homer/public , try not to self-+2 your changes without other review.
 
Its documentation is published at https://doc.wikimedia.org/homer-public/master/.


==== Editing Netbox ====
==== Editing Netbox ====
Data is also pulled from Netbox, always make sure that Netbox accurate before using Homer.
Data is also pulled from Netbox, always make sure that Netbox is accurate before using Homer.


=== Running Homer from cumin hosts (recommended) ===
=== Running Homer from cumin hosts (recommended) ===
Line 91: Line 95:
* Clone the public repo: https://gerrit.wikimedia.org/g/operations/homer/public
* Clone the public repo: https://gerrit.wikimedia.org/g/operations/homer/public
* Clone private repo: [Ssh://cumin1001.eqiad.wmnet:/srv/homer/private ssh://cumin1001.eqiad.wmnet:/srv/homer/private]
* Clone private repo: [Ssh://cumin1001.eqiad.wmnet:/srv/homer/private ssh://cumin1001.eqiad.wmnet:/srv/homer/private]
* Clone deploy repo: https://gerrit.wikimedia.org/g/operations/software/homer/deploy
* Install Homer with either:
* Install Homer with either:
** <code>pip install homer</code>
** <code>pip install homer</code>
** <code>https://gerrit.wikimedia.org/g/operations/software/homer</code> + <code>python3 setup.py install</code> (if you live on the edge)
** <code>https://gerrit.wikimedia.org/g/operations/software/homer</code> + <code>python3 setup.py install</code> (if you live on the edge)
* Make the plugins included in the deploy repo available in the Python path:
** If homer's code is checked out, just create a symlink in the root's of homer's checkout to the <code>homer_plugins/</code> directory in the deploy repo. If they are all checkout in the same root directory, from within the homer's checkout run: <code>ln -s ../homer-deploy/plugins/ homer_plugins</code>
** If homer is installed via pip, find the <code>site_packages</code> directory where homer is installed, usually something like <code>venv/lib/python3.X/site-packages/</code> and add there a symlink to the plugins like <code>ln -s /PATH_TO_DEPLOY_REPO/plugins/ homer_plugins</code>.
* Create your configuration file to match https://doc.wikimedia.org/homer/master/configuration.html
* Create your configuration file to match https://doc.wikimedia.org/homer/master/configuration.html
** Including the plugin setup: <code>homer_plugins.wmf-netbox</code>
* Get familiar with the command line: https://doc.wikimedia.org/homer/master/homer.html
* Get familiar with the command line: https://doc.wikimedia.org/homer/master/homer.html


Line 104: Line 113:


=== YAML files ===
=== YAML files ===
TBD
We use json-schema to both prevent mistakes in the configuration, as well as document it.
 
https://doc.wikimedia.org/homer-public/master/


=== Templates ===
=== Templates ===
It's ok to give up on indentation.
It's ok to give up on indentation.


== Network configuration coverage ==
== Capirca (ACL generation) ==
Task: https://phabricator.wikimedia.org/T273865
 
[https://github.com/google/capirca Capirca] is an actively maintained open source tool made by Google to generate multi-platform ACLs based on generic policy and definitions files.
 
=== How it works? ===
[[File:Capirca.png]]
 
# User edits relevant files (see below)
#*As well as runs https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/
# User run Homer
# Homer pulls the hosts definitions from Netbox
# Homer executes Capirca for each relevant policy files (defined in <code>homer-{public|private}/config/{devices|roles}.yaml</code> )
# Capirca takes all the (hosts/services) definition files as input, as well as the policy files (while following the includes) and generates the firewall rules in the proper format
# Homer adds the previously generated file to the other parts of the generated config and pushes it to the device
 
=== Advantages ===
 
* IPv4 and IPv6 filters will be updated automatically as long as the hosts have both v4 and v6 records
* Limited blast radius if a mistake is done in a given .inc policy file
* Centralized services (ports) definitions in text file
* Hosts definitions synced up from Netbox
* Same syntax for all platforms (Juniper and JuniperSRX in our case)
* Shading detection (eg. useless rules hidden behind a more generic one)
* Reduced operational complexity
* Easier to audit
 
=== Limitations ===
 
* Dependency on Netbox
* Increased setup complexity
*Doesn't distinguish between v4 and v6 prefix-lists, that means to leverage the auto-generation of both v4 and v6 we have to specify both prefix-lists<syntaxhighlight lang="diff">
[edit firewall family inet filter loopback4 term return-tcp from source-prefix-list]
        wikimedia4 { ... }
+        wikimedia6;
 
</syntaxhighlight>
**Use prefix-lists when the prefixes will be used in routing (eg. BGP) rules as well. So they're only defined once
*Can't have jinja2 applied to it, so filters like ping-offload needs to stay out of Capirca
*Netbox definitions need to be manually updated by running https://netbox-next.wikimedia.org/extras/scripts/capirca.GetHosts/
*Upstream issues:
**https://github.com/google/capirca/issues/257
**https://github.com/google/capirca/issues/246
**https://github.com/google/capirca/issues/245
*
 
=== How to use it? ===
 
==== Update an existing ACL ====


=== CR ===
# Browse <code>homer-{public|private}/policies</code>
# Find the relevant <code>.pol</code> or <code>.inc</code> file (eg. <code>cr-analytics.pol</code>)
# Update it (use existing rules and guidelines as models)


==== Done ====
===== Guidelines =====
<syntaxhighlight lang=text>
<syntaxhighlight>
groups {}
term my-term {
apply-groups [ re0 re1 ];
  comment:: "T123456"  # Don't forget the quotes
system {}
  destination-address:: foo # from either static.net or Netbox
logical-systems {}
  destination-port:: bar  # From the services.svc file
services {}
  action:: deny
snmp {}
forwarding-options {}
protocols {
    ospf {}
    ospf3 {}
    lldp {}
}
}
policy-options {}
firewall {}
routing-instances {}
</syntaxhighlight>


==== TODO ====
term allow_rest {
<syntaxhighlight lang=text>
  action:: accept # All our platforms have a default deny
interfaces {}  # (Partial) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/547584
routing-options {}  # (Partial) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/547587
chassis {}  # (Partial) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/550389
protocols {
    router-advertisement {}
    bgp {}  # Out of scope (except for group Netflow, which is done)
    pim {}  # https://gerrit.wikimedia.org/r/c/operations/homer/public/+/549689
}
}
</syntaxhighlight>
</syntaxhighlight>


=== ASW ===
* <code>{source|destination}-port::</code> are defined in in the <code>homer-public/definitions/services.svc</code> file, add yours if it's not already there. Ordered by port numbers.
* Most <code>{source|destination}-address::</code> are pulled from Netbox and grouped by their hostname prefix (eg. all <code>aqs*</code> hosts are under <code>aqs_group</code>)
** To update a group (eg. provisioning/deprogramming a host), run https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/ then run Homer.
**For network prefixes and special IPs (eg. VIPs), add them to <code>homer-public/definitions/static.net</code>.


==== Done ====
==== Add a new ACL (firewall filter) ====
<syntaxhighlight lang="text">
Most likely for a Netops.
system {}
 
snmp {}
# Create a <code>.pol</code> file in <code>homer-{public|private}/policies</code> (eg. <code>my-filter.pol</code>), see guidelines below
protocols {}
# Reference the above policy file in either:
routing-options {}
#* <code>homer-{public|private}/config/{devices|roles}.yaml</code>  (recommended)<syntaxhighlight lang="yaml">
virtual-chassis {}
capirca:
vlans {}
    - my-filter  # The policy file name without the extention
</syntaxhighlight>
</syntaxhighlight>
#* Another <code>.pol</code> file with <code>#include 'my-filter.pol'</code>
===== Guidelines =====
* Example juniper headers<syntaxhighlight>
header {
  comment:: "foobar"
  target:: juniper my-filter4 inet
  target:: juniper my-filter6 inet6
}
# juniper: platform (which final syntax to use)
# my-filter4: the juniper filter name that will be generated
# inet: IP family to target (ipv4 s. ipv6)


==== TODO ====
<syntaxhighlight lang="text">
chassis {}  # https://gerrit.wikimedia.org/r/c/operations/homer/public/+/550389
interfaces {}
</syntaxhighlight>
</syntaxhighlight>
* Example SRX security policies headers<syntaxhighlight>
header {
    comment:: "Generated by Capirca"
    target:: srx from-zone production to-zone production address-book-global
}
# srx: platform
# from/to security-zones (they need to already exist)
# Use global address-book (default everywhere in our infra)
</syntaxhighlight>
* If you have to have different terms for v4 and v6 policies, put all the common policies in a <code>.inc</code> file, then include it before/after the specific term. For example:<syntaxhighlight>
header {
  target:: juniper border-in4 inet
}
term offload-ping {
    verbatim:: juniper "term offload-ping4 {"
    verbatim:: juniper "    filter offload-ping4;"
    verbatim:: juniper "}"
}
#include 'cr-border-in.inc'


=== MR ===
header {
 
  target:: juniper border-in6 inet6
==== Done ====
<syntaxhighlight lang="text">
groups {}
system {}
snmp {}
protocols {}
routing-options {}
policy-options {}
security {
    zones {}
    alg {}
    forwarding-options {}
    screen {}
}
}
#include 'cr-border-in.inc'
</syntaxhighlight>
</syntaxhighlight>
* If a specific Juniper syntax is not supported by Capirca, use the <code>verbatim::</code> keyword, that will be copied as-is.
=== Common errors ===
* '''<code>Error parsing cr: No such service, foo</code>'''
** There is a <code>{source|destination}-port:: foo</code> in <code>cr.pol</code> (or one of its child includes) not defined in <code>services.svc</code>.
* '''<code>Error parsing cr-analytics: UNDEFINED: puppetmaster</code>'''
** There is a <code>{source|destination}-address:: puppetmaster</code> in <code>cr-analytics.pol</code> (or one of its child includes) not defined <code>static.net</code> or Netbox.
**See https://netbox.wikimedia.org/extras/scripts/#script.GetHosts "Last run" then "output" tab to see the Netbox generated definitions.
* <code>'''Error parsing cr:  ERROR on "udp" (type STRING, line 38, Next 'destination-port').'''  '''Error parsing cr-analytics:  ERROR on "T274951" (type STRING, line 312, Next 'destination-address')'''</code>
** Most common cause is forgetting the double semi-colon <code>::</code> or forgetting the quotes around a comment.
** Note that it shows the next line, and the lines don't always match if there are includes.
*'''<code>Multiple definitions found for service:  git-ssh.eqiad</code>'''
**The service is defined twice, either in <code>services.svc</code> or in network definitions (<code>static.net</code> or Netbox)
=== References ===
[https://github.com/google/capirca/wiki/Policy-format Capirca policy format] (which keywords are accepted?)
== Network configuration coverage ==
=== CR ===


==== TODO ====
==== TODO ====
<syntaxhighlight lang="text">
<syntaxhighlight lang="text">
interfaces {}
chassis {} (partial)
security {
routing-options {}  # TODO: statics
    address-book {}  # Capirca?
protocols {
     nat {}
     router-advertisement {}
     policies {}  # Capirca?
     bgp {}  # TODO: confed. IXPs are out of scope (dedicated tool like peering-manager)
}
}
routing-instances {}
applications {}  # Capirca?
</syntaxhighlight>
</syntaxhighlight>


=== MSW ===
=== MR ===


==== Done ====
==== TODO ====
<syntaxhighlight lang="text">
<syntaxhighlight lang="text">
system {}
routing-instances {}
snmp {}
 
protocols {}
routing-options {}
vlans {}
</syntaxhighlight>
</syntaxhighlight>
=== CLOUDSW ===


==== TODO ====
==== TODO ====
<syntaxhighlight lang="text">
<syntaxhighlight lang="text">
interfaces {}
bgp {}
routing-options {}
</syntaxhighlight>
</syntaxhighlight>


Line 211: Line 293:
(Almost) None.
(Almost) None.


* The "commit" action doesn't work on the SRXs and the MX104, it will do the Juniper's "commit confirmed 2", but not the "commit check" to make the change permanent.
* The "commit" action will not work on the first try with the mr1* devices, but homer will retry.
* Ignore the "Unable to determine FQDN for device" errors.
[[Category:SRE Infrastructure Foundations]]

Revision as of 14:58, 22 June 2021

Homer (previously jnt) is our homemade network configuration manager.

It takes variables from Netbox and yaml files, run them through jinja templates to generate Juniper compatible configuration.

Homer can then send those configurations to selected network devices, for a diff or a safe commit.

The tool is written to not be Wikimedia specific. It only supports Junos but can easily be extended to other platforms.

Its doc is available on https://doc.wikimedia.org/homer/master/

Its code on Gerrit https://gerrit.wikimedia.org/g/operations/software/homer

Its bug and feature requests on Phabricator: https://phabricator.wikimedia.org/tag/homer/

This page focuses on Wikimedia's deployment.

Deployment

Homer is deployed via Puppet and Scap to the cumin (fleet management) hosts.

You can find its deploy repository here https://gerrit.wikimedia.org/g/operations/software/homer/deploy

And its Puppet module there https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/homer

In addition it's available on Pypi: https://pypi.org/project/homer/

Releasing a new version

  • Make a release patch updating the changelog (see this example patch).
  • Once it's merged, update the local checkout and make a git tag. Ideally an annotated one (requires a GPG key and have git configured to use it, see signingkey):
$ RELEASE=v0.1.0
$ git tag -s -a "${RELEASE}" -m "${RELEASE}" -m "[Release Notes](CHANGELOG.rst)"
  • Push the generated tag: git push origin "${RELEASE}"
  • Move to the homer-deploy checkout:
$ cd src/
$ git pull
$ git log -1  # to check to be at the right commit
$ cd ..
# At this point git status would show that there is a diff for the 'src' path, indicating the different SHA1 of the git submodule
# Ensure that docker is running
$ make -f Makefile.build all
# Verify that the generated wheels are correct
# At this point the frozen-requirements.txt file will most likely have some changes and the artifacts/artifacts.stretch.tar.gz will be different
git add .
git commit -m "Release ${RELEASE}"
git review
  • Once the above patch has been merged (C+2, V+2 + submit), move to the deployment server in /srv/deployment/homer/deploy
  • Pull the latest changes: git pull
  • Verify that in the src/ directory we're at the correct commit (check also with git status)
  • Deploy the new release, it currently needs two steps:
    • For the buster host (as of May 2021 cumin1001): scap deploy --verbose "Homer release v... - T..."
    • For the bullseye hosts (as of May 2021 cumin2002): move to one of the cumin hosts and run: sudo cookbook sre.deploy.python-code -r 'Release vX.Y.Z' homer 'cumin2002*'

Daily diffs

A cron job runs Homer every 12h (24h per cumin hosts) to compare the live network configuration with our intended state. Any discrepancies is emailed to the rancid-core alias.

Usage 🚀

Making changes

Note that Homer explicitly asks you when its about to modify the live network configuration (Type "yes" to commit, "no" to abort.) and will prompt you with a diff of the changes beforehand.

Editing the private repository

Manually edit then commit the files on ssh://cumin1001.eqiad.wmnet:/srv/homer/private .

git will sync them with the other cumin host. And will email a summary of the changes to SREs.

Make sure to mirror all your changes on the mock-private repo: https://gerrit.wikimedia.org/g/operations/homer/mock-private

This repository doesn't have CI, please be extra careful.

Editing the public repository

Similar to our other public repositories, send CRs to https://gerrit.wikimedia.org/g/operations/homer/public , try not to self-+2 your changes without other review.

Its documentation is published at https://doc.wikimedia.org/homer-public/master/.

Editing Netbox

Data is also pulled from Netbox, always make sure that Netbox is accurate before using Homer.

Running Homer from cumin hosts (recommended)

Get familiar with the command line: https://doc.wikimedia.org/homer/master/homer.html everything else is taken care of.

The public repository is regularly updated by Puppet.

When pushing configurations, homer will ssh to the network devices using the Homer user. You need to be in the ops group to be able to use its private key.

Some examples:

  • homer "*" diff All devices
  • homer "cr*ams*" diff esams and knams core routers
  • homer "mr*" commit "My commit message" All management routers

Running Homer from your local machine (less recommended)

When pushing configurations, your machine will ssh directly to the network devices, which mean that you have to have an account there, with the proper permissions.

It's common to test a change locally with the "diff" option. Once satisfied with the result, please merge your change on Gerrit before pushing them with the "commit" action.

Style guides

YAML files

We use json-schema to both prevent mistakes in the configuration, as well as document it.

https://doc.wikimedia.org/homer-public/master/

Templates

It's ok to give up on indentation.

Capirca (ACL generation)

Task: https://phabricator.wikimedia.org/T273865

Capirca is an actively maintained open source tool made by Google to generate multi-platform ACLs based on generic policy and definitions files.

How it works?

File:Capirca.png

  1. User edits relevant files (see below)
  2. User run Homer
  3. Homer pulls the hosts definitions from Netbox
  4. Homer executes Capirca for each relevant policy files (defined in homer-{public|private}/config/{devices|roles}.yaml )
  5. Capirca takes all the (hosts/services) definition files as input, as well as the policy files (while following the includes) and generates the firewall rules in the proper format
  6. Homer adds the previously generated file to the other parts of the generated config and pushes it to the device

Advantages

  • IPv4 and IPv6 filters will be updated automatically as long as the hosts have both v4 and v6 records
  • Limited blast radius if a mistake is done in a given .inc policy file
  • Centralized services (ports) definitions in text file
  • Hosts definitions synced up from Netbox
  • Same syntax for all platforms (Juniper and JuniperSRX in our case)
  • Shading detection (eg. useless rules hidden behind a more generic one)
  • Reduced operational complexity
  • Easier to audit

Limitations

How to use it?

Update an existing ACL

  1. Browse homer-{public|private}/policies
  2. Find the relevant .pol or .inc file (eg. cr-analytics.pol)
  3. Update it (use existing rules and guidelines as models)
Guidelines
term my-term {
  comment:: "T123456"  # Don't forget the quotes
  destination-address:: foo # from either static.net or Netbox
  destination-port:: bar  # From the services.svc file
  action:: deny
}

term allow_rest {
  action:: accept # All our platforms have a default deny
}
  • {source|destination}-port:: are defined in in the homer-public/definitions/services.svc file, add yours if it's not already there. Ordered by port numbers.
  • Most {source|destination}-address:: are pulled from Netbox and grouped by their hostname prefix (eg. all aqs* hosts are under aqs_group)

Add a new ACL (firewall filter)

Most likely for a Netops.

  1. Create a .pol file in homer-{public|private}/policies (eg. my-filter.pol), see guidelines below
  2. Reference the above policy file in either:
    • homer-{public|private}/config/{devices|roles}.yaml (recommended)
      capirca:
          - my-filter  # The policy file name without the extention
      
    • Another .pol file with #include 'my-filter.pol'
Guidelines
  • Example juniper headers
    header {
      comment:: "foobar"
      target:: juniper my-filter4 inet
      target:: juniper my-filter6 inet6
    }
    # juniper: platform (which final syntax to use)
    # my-filter4: the juniper filter name that will be generated
    # inet: IP family to target (ipv4 s. ipv6)
  • Example SRX security policies headers
    header {
        comment:: "Generated by Capirca"
        target:: srx from-zone production to-zone production address-book-global
    }
    # srx: platform
    # from/to security-zones (they need to already exist)
    # Use global address-book (default everywhere in our infra)
  • If you have to have different terms for v4 and v6 policies, put all the common policies in a .inc file, then include it before/after the specific term. For example:
    header {
      target:: juniper border-in4 inet
    }
    term offload-ping {
        verbatim:: juniper "term offload-ping4 {"
        verbatim:: juniper "    filter offload-ping4;"
        verbatim:: juniper "}"
    }
    #include 'cr-border-in.inc'
    
    header {
      target:: juniper border-in6 inet6
    }
    #include 'cr-border-in.inc'
  • If a specific Juniper syntax is not supported by Capirca, use the verbatim:: keyword, that will be copied as-is.

Common errors

  • Error parsing cr: No such service, foo
    • There is a {source|destination}-port:: foo in cr.pol (or one of its child includes) not defined in services.svc.
  • Error parsing cr-analytics: UNDEFINED: puppetmaster
  • Error parsing cr:  ERROR on "udp" (type STRING, line 38, Next 'destination-port'). Error parsing cr-analytics:  ERROR on "T274951" (type STRING, line 312, Next 'destination-address')
    • Most common cause is forgetting the double semi-colon :: or forgetting the quotes around a comment.
    • Note that it shows the next line, and the lines don't always match if there are includes.
  • Multiple definitions found for service:  git-ssh.eqiad
    • The service is defined twice, either in services.svc or in network definitions (static.net or Netbox)

References

Capirca policy format (which keywords are accepted?)

Network configuration coverage

CR

TODO

chassis {} (partial)
routing-options {}  # TODO: statics
protocols {
    router-advertisement {}
    bgp {}  # TODO: confed. IXPs are out of scope (dedicated tool like peering-manager)
}

MR

TODO

routing-instances {}

CLOUDSW

TODO

bgp {}
routing-options {}

Common/known issues

(Almost) None.

  • The "commit" action will not work on the first try with the mr1* devices, but homer will retry.