Installing Grid Engine 6.x on modern versions of Mac OS X (client and server)
As of early 2010, Grid Engine is effectively not installable out of the box on modern version of Mac OS X. We have seen this recently with server and client versions of OS X 10.5.* as well as the new Snow Leopard (10.6.*) releases.
This used to not be a big deal and the workarounds were trivial. Something seems to have changed, however, in recent updates to OS X that render the Grid Engine binaries unable to function when :
- When started during the installation process
- SystemStarter() scripts that SGE tries to install on Apple systems
- Manually started by user root via the command line
This is obviously a show-stopper now for new SGE users. We have no idea why this is the case but have witnessed the behavior on many 10.5 and 10.6 systems (many of them running OS X Server).
The problem with Grid Engine on modern versions of OS X can be simply stated:
SGE binaries will not launch or will be unreliable when started by ANY METHOD other than the system level OS X “launchd” service framework.
… and since neither the SGE installation scripts nor “traditional” SystemStarter() scripts that SGE install on Mac OS X systems use launchd framework this basically means that SGE 6.2 is unusable out of the box without manual intervention and custom starter scripts.
How to manually get SGE 6.2x working on Mac OS X systems
This process will be very simple to someone who already understands Grid Engine administration and is comfortable with SGE admin commands – unfortunately it may be confusing to novice or beginners because we have to interrupt the automatic SGE installation process and complete things by hand using SGE admin commands.
- Run the “install_qmaster” script normally. The fact that it will fail is not a big deal – before it fails it will construct the SGE_CELL directory, configure spooling and otherwise do all of the behind the scenes steps necessary to support a functional SGE qmaster process.
- When the install_qmaster script fails, exit out and manually kill any zombie sge_qmaster daemons that may be hanging around on the system
- Download and run the sge-launchd-scriptmaker tarball from this BioTeam Blog post.
- Run the sge-launchd-scriptmaker utility. This simple perl script will query your SGE environment and construct several .plist files suitable for copying into the /Library/LaunchdDaemons/ OS X folder
- Using OS X “launchctl” commands, restart sge_qmaster
- At this point the SGE qmaster is running and functional and we can manually complete a few additional configuration steps…
- Create and populate a hostgroup named “@allhosts”
- Create and configure the default cluster queue object (“all.q”)
- At this point the SGE install_execd script should work but you can also skip that step and just launchd sge_execd via the launchctl framework (a step that will be necessary even if the exexhost installer script functions fine)
The screencast below shows a recording of me stepping through this process on a small Apple Mac Mini running Mac OS X 10.5.8 Server (sorry, I thought it was a Snow Leopard box when I began the work …).
If you don’t want to watch the embedded video below, you can navigate directly to the screencast site and download the full movie file. The video is hosted here: http://www.screencast.com/t/NjMyNGJiNWM