You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Toolforge/Admin/Son Of Grid Engine Notes

From Wikitech-static
Jump to navigation Jump to search

This page is a for organizing links and documentation in the process of moving our Grid Engine installation to Stretch and SonOfGridEngine.

Puppet Issues

Probably the most baffling part of changing things in GridEngine/Toolforge is the Puppet setup as well as the accompanying Hiera chaos. In order to move around this, this is an outline of how it should come out:

  1. Each function of a grid server should have a single, usable class within the gridengine module. A base module should exist, but the linter won't accept all the includes and inheritance currently relied on. This should be little more than insuring that directories, certain commands and packages are installed/available with minimal customization.
  2. From there, a set of profiles needed to configure the classes in the module should handle all of the injection of configuration and commands needed to get the grid running. This is the place where we should be looking whenever anything needs adjustment. Hiera needs to be contained in git, the private repo and a minimal set in Horizon (just for perhaps parameters). The settings here in wikitech tend to make trouble in any kind of testing because there are surprising numbers of dependencies on them.
  3. Roles can just collect the profiles and even get applied via Horizon like they are now. However, a goal for being able to test and fix things might be to set things up so that if a Toolforge role is applied, it would be nice if it was not impossible to figure out what Hiera dependencies are missing (using params that show up, for instance).

Open questions

  • NFS - Can we live without it?
    • Three sticking points:
  1. Job spooling - without shared spool directories, very large datasets will require a prolog and epilog handler to prepare and dismantle the job. This seems solvable.
  2. Shadow master failover - This is a serious one. Right now, a file on NFS is how everything knows who the master is (act_qmaster). If we can live with either using a floating IP instead or an immediately pushed file change of some sort, this can be skipped.
  3. Host-Based Authentication - We collect ssh host keys in the NFS config in order to enable this. To get around it, we would probably need PuppetDB or Exported Resources. At this time, Production has that, but Cloud Services does not.