Grid Engine 6.2 on Mac OS X

07 Feb 2010 Grid Engine 6.2 on Mac OS X

Installing Grid Engine 6.x on modern versions of Mac OS X (client and server)

As of early 2010, Grid Engine is effectively not installable out of the box on modern version of Mac OS X. We have seen this recently with server and client versions of OS X 10.5.* as well as the new Snow Leopard (10.6.*) releases.

This used to not be a big deal and the workarounds were trivial. Something seems to have changed, however, in recent updates to OS X that render the Grid Engine binaries unable to function when :

  • When started during the installation process
  • SystemStarter() scripts that SGE tries to install on Apple systems
  • Manually started by user root via the command line

This is obviously a show-stopper now for new SGE users. We have no idea why this is the case but have witnessed the behavior on many 10.5 and 10.6 systems (many of them running OS X Server).

The problem with Grid Engine on modern versions of OS X can be simply stated:

SGE binaries will not launch or will be unreliable when started by ANY METHOD other than the system level OS X “launchd” service framework.

… and since neither the SGE installation scripts nor “traditional” SystemStarter() scripts that SGE install on Mac OS X systems use launchd framework this basically means that SGE 6.2 is unusable out of the box without manual intervention and custom starter scripts.

How to manually get SGE 6.2x working on Mac OS X systems

This process will be very simple to someone who already understands Grid Engine administration and is comfortable with SGE admin commands – unfortunately it may be confusing to novice or beginners because we have to interrupt the automatic SGE installation process and complete things by hand using SGE admin commands.

The method:

  1. Run the “install_qmaster” script normally. The fact that it will fail is not a big deal – before it fails it will construct the SGE_CELL directory, configure spooling and otherwise do all of the behind the scenes steps necessary to support a functional SGE qmaster process.
  2. When the install_qmaster script fails, exit out and manually kill any zombie sge_qmaster daemons that may be hanging around on the system
  3. Download and run the sge-launchd-scriptmaker tarball from this BioTeam Blog post.
  4. Run the sge-launchd-scriptmaker utility. This simple perl script will query your SGE environment and construct several .plist files suitable for copying into the /Library/LaunchdDaemons/ OS X folder
  5. Using OS X “launchctl” commands, restart sge_qmaster
  6. At this point the SGE qmaster is running and functional and we can manually complete a few additional configuration steps…
  7. Create and populate a hostgroup named “@allhosts”
  8. Create and configure the default cluster queue object (“all.q”)
  9. At this point the SGE install_execd script should work but you can also skip that step and just launchd sge_execd via the launchctl framework (a step that will be necessary even if the exexhost installer script functions fine)

The screencast below shows a recording of me stepping through this process on a small Apple Mac Mini running Mac OS X 10.5.8 Server (sorry, I thought it was a Snow Leopard box when I began the work …).

If you don’t want to watch the embedded video below, you can navigate directly to the screencast site and download the full movie file. The video is hosted here:

Feedback welcome.

Related Posts
Filter by
Post Page
Employee Posts Tech Notes Training Presentations Events Screencasts
Sort by

Grid Engine for Users

2011-03-10 16:08:58


Grid Engine & Amazon EC2

I delivered the following presentation at the 2009 Sun HPC Workshop where
2009-09-09 09:10:43


SGE Video Tips: Portable Grid Engine

A short video showing how to use IP Aliasing to reliably run Grid Engine on laptop devices that may frequently
2007-12-15 19:12:58


  • Trey
    Posted at 18:39h, 08 February Reply

    Hey Chris,

    Awesome guide. It was extremely helpful and made quick work of setting up the qmaster correctly and efficiently.

    Have you tried to start up an execution host on a separate machine? I can’t for the life of me get this to work. The dead end that I keep running into looks like this:

    remotehost:SGE_ROOT root# bin/darwin-x86/sge_execd
    daemonize error: timeout while waiting for daemonize state

    Everything else is configured correctly as far as I can tell. The secondary host is configured as an administrative host, a submit host, an execution host, it has been added to the @allhosts group.

    qmaster:~ sgeadmin$ qstat -f
    queuename qtype resv/used/tot. load_avg arch states
    all.q@qmaster.cs BIP 0/0/16 0.09 darwin-x86
    all.q@remotehost.cs BIP 0/0/16 -NA- -NA- au

    Thanks again for all the help.
    – Trey Wessler

    • blogadmin
      Posted at 07:17h, 10 February Reply

      Hi Trey, the error about being “unable to daemonize” is pretty much what I see when trying to run SGE binaries as the root user outside the control of the Launchd framework.

      To get your execution host going I’d first issue some commands on the qmaster to make sure that the compute known is known as a host (“qconf -ah “), “(qconf -as ), etc.

      You can even use commands on the master to make your nodes parts of queues without having the exechosts running yet although this is not required. I’ve done many clusters in the past where we pre-staged all of the configuration settings for the compute nodes so that all we had to do was install the startup scripts on the nodes and it all worked.

      I think you need to skip the install script, get the sge_execd daemon running via launchd and just configure queue related stuff manually. You may have to make a spool directory for the compute nodes manually as that is one step that the install script usually does. Even that may not be necessary – it could all just get sorted out automatically.

      Once sge_execd is running under launchd it will accept commands and will be able to communicate. Getting things functional from there should be quite easy.

  • Trey
    Posted at 15:05h, 10 February Reply

    Hey guys,

    Just a quick update and heads up.

    I finally got it working! The problem is that I was using AFP as my network sharing protocol. I don’t know the exact reason, but I decided to change my automount options to NFS on a hunch. Nothing else had to be modified. So, just as a tip:

    SGE does not work with AFP.

    My guess is that the sge_execd checks for nfs mounted directories. That, or, the service is waiting for NFS to start before it will start. I haven’t done much testing, but I came here to share before I forgot.

    Thanks for all of the help. You guys are great! Keep up the good work.


  • Eric Brown
    Posted at 12:04h, 22 February Reply

    Very nice screencast. I remember a problem where SGE would not cooperate with users who were installed via OpenDirectory. Does this problem still exist?

    Do the users have to be added as local users on each machine?

    • blogadmin
      Posted at 12:12h, 22 February Reply

      Hi Eric,

      From what I recall, OpenDirectory was not at fault, it was actually the incredibly long group membership that users created in OpenDirectory. Once I truncated group membership down in OD I had no issues with OD users and grid engine.

      Try running the command “id ” or “groups ” and if you see a massive group membership list that may be the culprit. Reduce the size and number of groups that the user belongs to and you might be ok.


  • Tom
    Posted at 14:32h, 11 March Reply

    Not clear what to download. I downloaded ge62u5_darwin-x86.tar. Untarring this gives: ge-6.2u5-bin-darwin-x86.tar.gz and ge-6.2u5-common.tar.gz. ge-6.2u5-common.tar.gz has intall_qmaster, but does not have utilbin (unlike your example machine). When I run ./install_qmaster, I get:

    ./util/install_modules/ line 69: ./utilbin/darwin-unsupported/uidgid: No such file or directory
    ./util/install_modules/ line 70: ./utilbin/darwin-unsupported/uidgid: No such file or directory
    Can’t find binaries for architecture: darwin-unsupported!
    Please check your binaries. Installation failed!
    Exiting installation.

    Maybe I downloaded the wrong files? Where does one find the right ones?

    I’m trying to do this on a Fall 2009 MacPro running 10.6.2

    Thanks for any suggestions!


    • blogadmin
      Posted at 06:32h, 12 March Reply

      Tom – this is very interesting. The root cause seems to be that SGE can’t figure out your machine architecture. On the system you describe the output of the utilbin/arch command should be “darwin-x86” which is why you have the darwin-x86 architecture specific binary tarball that came with the SGE distribution.

      You should be able to work around this by making a symbolic link from darwin-x86 -> darwin-unsupported. That will mask the problem and all of the tools and scripts looking for that “darwin-unsupported” path will then (hopefully) find binaries that work on your system.

      We really should tease apart the arch script and find the case statement that deals with darwin. Almost certainly it’s running some sort of ‘uname’ call and getting back some parsable response that it does not know about yet as a valid OS X system. This should be a quick fix and patch with the SGE team.


  • Cristobal
    Posted at 15:08h, 31 March Reply

    i am almost complete, i just got stuck on the

    qconf -ahgrp @allhosts command
    rereseolve -> cannot resolve host name

    my /etc/hosts file is localhost broadcasthost
    ::1 localhost
    fe80::1%lo0 localhost

    but if i type /bin/hostname i get:

    im a little confused, i put common dir in “/common” and changed owner:
    chown ijorge /common
    but i was on a root session, so i had all mixed and i guess i messes it up.

    i have some questions:

    a) on my mac, the Admin User is “ijorge” but the root user is “root”. which user should i use to install? i see you used root because of the command prompt but the name was different from “root”, so i got confused.

    b) does the “common” directory has to be placed on “/common” anyways or depends on which user i used to install, im a not clear here?

    thanks in advance!

    • blogadmin
      Posted at 15:13h, 31 March Reply

      (1) You should install as the “root” user but the directory holding the SGE files can be “ijorge” or whomever you want to be the SGE Administrator.

      (2) SGE can be installed anywhere, my use of “/common” was just a personal preference. The main thing is that SGE should be installed on a shared NFS files system that is visible to all the computers in your cluster for the easiest method of operation. It is possible to install without a common filesystem underneath but it can be harder to setup and troubleshoot. If you are just installing SGE on a single machine than the location does not matter at all.

  • Cristobal
    Posted at 15:50h, 31 March Reply


    i reinstalled with your advices,

    im getting this error when doing “qconf -sconf”
    reresolve hostname failed: can’t resolve host name

    my host name from /bin/hostname is:

    and my file /etc/hosts is the one i mentioned on the last post

    • blogadmin
      Posted at 15:53h, 31 March Reply

      SGE is sensitive to DNS and it looks like it is not set up for your system. One workaround would be to edit /etc/hosts on your system and make an entry for “ijorge.local” that uses the IP that the SGE qmaster is listening too. On most Apple OS X systems I will go out of my way to make sure that /etc/hosts is correct and fully populated in addition to having forward and reverse DNS set up.


  • Cristobal
    Posted at 17:05h, 31 March Reply

    i tried adding the line ijorge.local

    but no luck,

    when you said that my DNS was probably not set up, do you mean that i have to go to System Preferences->Network->Ethernet1(my case)–> and set up DNS ip ??

    at the moment the values are
    DNS server: (the same as router ip)
    DNS search domain: nothing

    and my ip is
    is this set up ok?

    • blogadmin
      Posted at 17:17h, 31 March Reply

      You can’t use for grid engine or any other program that talks on a network. The address is a local loopback “special” IP address.

      Try putting: ijorge.local into your /etc/hosts file instead

  • Cristobal
    Posted at 17:29h, 31 March Reply

    it still doesnt work 🙁

    im very thankful for your help already,
    i will try installing again everything and post back!

  • Cristobal
    Posted at 23:12h, 31 March Reply

    Hello again Chris,

    i just cant make this work, i reinstalled all again.
    added to /etc/hosts ijorge.local

    and still qconf -sconf cannot resolve.
    however, if i run from the utilbin gethostbyname it does resolve it.

    sh-3.2# /common/utilbin/darwin-x86/gethostbyname ijorge.local
    Hostname: ijorge.local
    Host Address(es):

    and reverse too
    sh-3.2# /common/utilbin/darwin-x86/gethostbyaddr
    Hostname: ijorge.local
    Host Address(es):

    this must be a really small error hidden somewhere, i just cannot find it. has this ever happened to you??

  • Cristobal
    Posted at 23:58h, 31 March Reply


    it worked now, you wont believe how small was the problem.
    i had to add the line to /etc/hosts this way. ijorge.local ijorge

    and it worked everything till the end of your tutorial!!
    i have to say really thanks for your help, your tutorial (the best i’ve found on internet), and your will to help cluster-noobs like me.

    well now i have to do this, to the lab. because i was only testing at home with 1 iMac.

    on the lab we have 4 Mac Pros with Leopard 10.5.6, they are 64 bit archs if im not mistaken, with Intel Xeon, they were bought just on November 2009.

    my first attempt was using xgrid, it was easy to install but a headache to make it work with openMPI since the compatibility is somehow broken.

    my question is the following,

    i) where can i find the 64 bit version of the SGE installer for Mac XEON ??

    ii) i there wasnt any other chance than using this 32-bit version, will the C programs be able to use all 8GB of memory from each Mac pro??

    iii) and my last question since i was testing in my same machine i only added “myself” to the grid… is the process of adding more machines as simple as adding them in “@allhosts”. i mean do i have to do something on each machine apart from that?

    thanks in advance, your tutorial is exelent i hope the video never goes down.

  • Cristobal
    Posted at 17:17h, 07 April Reply


    i tried the ./install_qmaster script on a mac-pro with Leopard 10.5.6 and the daemon did start,
    however, after reboot it does not restart, so i anyways i had to include your fixes.

    im facing a weird problem when configuring an execution host, its on the same network

    but this is the error of the script after the “port step”.

    Checking hostname resolving

    Cannot contact qmaster. The command failed:

    ./bin/darwin-x86/qconf -sh

    The error message was:

    ERROR: unable to send message to qmaster using port 6444 on host “mac-pro-3”: can’t resolve host name

    You can fix the problem now or abort the installation procedure.
    The problem can be:

    – the qmaster is not running
    – the qmaster host is down
    – an active firewall blocks your request

    Contact qmaster again (y/n) (‘n’ will abort) [y] >>


    im sure this is a problem of hostnames, because the exec-host tries to resolve “mac-pro-3”, but if i type hostname on the master node, i get “mac-pro-3.local”. i dont know how to “really” change the hostname, and get rid of that “local” sufix after each hostname.
    modifiyng the /etc/hosts file did not work even the aliases didnt work. “scutil” command did change the hostname to “mac-pro-3”, however when i tried to ping from another PC, it cannot resolve that new name and it still responds pings as “mac-pro-3.local”.

    have you faced this problem before?

    thans in advance

  • Jake
    Posted at 09:29h, 11 May Reply

    Did the bioteam ever figure out the issue with the Arch script on install as noted by Tom above? I’m getting the same response, “darwin-unsupported”, during install after the upgrade to Mac OS 10.6 and the install fails and exits.

    Specs: Darwin Kernel Version 10.3.0: Fri Feb 26 11:57:13 PST 2010; root:xnu-1504.3.12~1/RELEASE_X86_64 x86_64

    By the way, I noticed that when I run
    $ arch
    from within the util directory, “i386” is returned. When I run
    $ ./arch
    from within the util directory, “darwin-unsupported” is returned.

  • Jake
    Posted at 10:31h, 11 May Reply

    nevermind–so far it’s working by changing unsupported in the arch script to “x86”

  • Ryan Evans
    Posted at 09:12h, 28 May Reply

    I found a solution to the “darwin-unsupported” issue.
    Under 10.6 (desktop and server) and previous running on an intel chip the /usr/bin/arch returns “i386”. The SGE arch script uses “uname -m”
    Running “uname -m” on 10.6 desktop returns “i386” while “uname -m” on 10.6 server returns “x86_64”.
    I added this to the SGE arch script inside the “Darwin” section.


    This solved the darwin-unsupported problem.

  • Barry McInnes
    Posted at 14:54h, 19 July Reply

    We have had sge62u3 working fine in a 10.5 cluster using cron to startup the sge_execd process on clients. Trying a 10.6 client the sge_execd process hangs on the qping process. Next I have tried the launchctl files you generated but get a startup error –

    bash-3.2# launchctl start /Library//LaunchDaemons/net.sunsource.gridengine.sgeexecd.plist
    launchctl start error: No such process

    The filenames in the plist are all defined –
    bash-3.2# ls -l /usr/local/sge/bin/darwin-x86/sge_execd
    -rwxr-xr-x 1 root wheel 1650040 Jun 4 2009 /usr/local/sge/bin/darwin-x86/sge_execd

    Is there a problem with the syntax in the plist file ?
    cat net.sunsource.gridengine.sgeexecd.plist







    • blogadmin
      Posted at 14:57h, 19 July Reply

      Barry — did you try “launchctl load” instead of “start” — that may be a requirement for initially telling launchctl about the new .plist files. I’ll have to check your syntax against one of ours and the blog comment may have messed with your formatting.

  • Barry McInnes
    Posted at 13:40h, 15 March Reply

    THanks for the help. I got 62u5 going on the PPC server and Intel cluster.
    IN 10.6.5 I had problems with the group membership problem, but solved it turning off
    the “Map GID to attribute” in AD setup using Directory Utility.
    Now after setting up on a 10.6.6 Intel Server, I am getting stopped with the following qmaster messages

    03/14/2011 18:12:00|worker|g5s2|W|job 15009.1 failed on host general before job because: 03/14/2011 18:11:59 [0:74906]: can’t set additional group id (uid=0, euid=0): the user already has too many group ids
    03/14/2011 18:12:00|worker|g5s2|W|rescheduling job 15009.1
    03/14/2011 18:12:00|worker|g5s2|E|queue quad marked QERROR as result of job 15009’s failure at host
    03/14/2011 18:31:15|worker|g5s2|W|job 15009.1 failed on host general before job because: 03/14/2011 18:31:14 [0:9083]: can’t set additional group id (uid=0, euid=0): the user already has too many group ids
    03/14/2011 18:31:15|worker|g5s2|W|rescheduling job 15009.1
    03/14/2011 18:31:15|worker|g5s2|E|queue quad marked QERROR as result of job 15009’s failure at host

    The group membership has ballooned again whether from AD or 10.6 update I don;t know. It currently for every user

    [mac27:~/scripts] bmcinnes% id
    uid=2101(bmcinnes) gid=200(climate) groups=200(climate),1953027852(PSDsysadmins),829578209(PSDdomain admins),801476512(PSDlog1),204(_developer),100(_lpoperator),98(_lpadmin),81(_appserveradm),80(admin),79(_appserverusr),62(netaccounts),12(everyone),1207(rain),1100(systems),998(lmadmin),900(sawrtrs),400(cuac),2109053379(PSDdomain users),1858905114(PSDdenied rodc password replication group),1358185131(PSDit_wikis),404(,928177777(PSDcoopcall),401(,403(,402(

    Is there an easy global way of stopping all these extra groups eg. the PSD ones and the _ones ?

    thanks for the help

  • Barry McInnes
    Posted at 14:08h, 15 March Reply

    I tried from a local user account with just 6 group memberships and got similar error messages

    03/14/2011 18:12:00|worker|g5s2|W|rescheduling job 15009.1
    03/14/2011 18:12:00|worker|g5s2|E|queue quad marked QERROR as result of job 15009’s failure at host
    03/14/2011 18:31:15|worker|g5s2|W|job 15009.1 failed on host general before job because: 03/14/2011 18:31:14 [0:9083]: can’t set additional group id (uid=0, euid=0): the user already has too many group ids
    03/14/2011 18:31:15|worker|g5s2|W|rescheduling job 15009.1
    03/14/2011 18:31:15|worker|g5s2|E|queue quad marked QERROR as result of job 15009’s failure at host

    So it looks like reducing membership is not a solution.
    Is there any patch to the scheduler to just skip this “check” ?

    thanks again.

    • blogadmin
      Posted at 21:44h, 15 March Reply

      HI Barry,

      I think the main problem is not the NUMBER of groups despite what the error message says. I think the root cause is the LENGTH of the string listing the group memberships. I did see this problem in the past and “solved” it only by manually pruning group memberships in the OD server. Nothing automated as it was for a cluster used by 2-3 people at the time.

      Not sure if there is a patch for this but you might want to file a bug report or check in with the new mailing list that has recently been reconstituted.

  • Barry McInnes
    Posted at 14:38h, 16 March Reply

    Thanks for the info. I did try the Oracale GE 62u7 install, and still get the same error.
    It in qmaster/messages even before there are jobs submitted. It then puts the nodes in error mode.
    I am trying to work out the Oracle Univa GE cluster mess – I will try the, but last time I logged an error there was no reply beyond the cut down the number of groups previous replies.

    thanks barry

  • gmac
    Posted at 08:57h, 13 May Reply

    How did you prune group memberships?
    ‘dscl’ shows only 7 memberships and ‘id’ shows a lot… :


  • sandy
    Posted at 15:03h, 27 July Reply

    In all fairness, I’m using 10.7, Lion, but I can’t get the launchctl step to work. I loaded both and then tried to start sgeqmaster, but it failed with

    $ sudo launchctl start /Library/LaunchDaemons/net.sunsource.gridengine.sgeqmaster.plist
    launchctl start error: No such process

    it is already loaded…

    $ sudo launchctl load /Library/LaunchDaemons/net.sunsource.gridengine.sgeqmaster.plist
    net.sunsource.gridengine.sgeqmaster: Already loaded

    If it could run in the shell, I’d try that… but it can’t. Anyway, the file the .plist created points to what looks like a valid spot.

    I also treid unloading them and loading them not using sudo and I get the same error. In both cases launchctl says reports it in the list and says it exited with error code 1

    $ launchctl list | fgrep net
    – 1 net.sunsource.gridengine.sgeqmaster
    – 1 net.sunsource.gridengine.sgeschedd

    I changed where standard out and standard error go to. My SGE_ROOT is /opt/sge/

    Standard error reports:

    error: directory doesn’t exist: /opt/sge//common

    So I created that directory, then I got

    error: fopen(“/opt/sge//common/bootstrap”) failed: No such file or directory

    Obviously, I’ve got a bigger problem. I located a file called bootstrap in the folder I created with my host name in it in SGE_ROOT (mac1) and soft linked a folder in SGE_ROOT/common to SGE_ROOT/mac1/common. Then I got somewhat further. Now the output log is still empty, but the error log reads
    read job database with 0 entries in 0 seconds
    qmaster hard descriptor limit is set to 8192
    qmaster soft descriptor limit is set to 8192
    qmaster will use max. 8172 file descriptors for communication
    qmaster will accept max. 99 dynamic event clients
    starting up SGE 6.2u5 (darwin-x86)
    FD_SETSIZE is limited to 8192 file descriptors on this system.
    If you want to support more than 8172 qmaster clients you have to
    recompile the source code with a higher FD_SETSIZE setting.
    Bug Link:
    can’t open job sequence number file “jobseqnum”: for reading: No such file or directory — guessing next number
    can’t open ar sequence number file “arseqnum”: for reading: No such file or directory — guessing next number

    What should I do now?

    • blogadmin
      Posted at 15:15h, 27 July Reply


      We don’t have a quick answer for you right away. There is a very good chance that fundamental things have changed with OS X 10.7 “Lion” and internally here at BioTeam I don’t think any of us are running Lion on our laptops yet — we rely on a third party whole-disk-encryption (WDE) provider to secure our Mac laptops and this WDE dependency is slowing down our Lion adoption. You might want to ask the mailing list for help. As soon as we get an internal machine running Lion we’ll try ASAP to get all of the various open source SGE forks running on it and will document the process involved.


  • sandy
    Posted at 19:47h, 27 July Reply

    I think it is actually working now doing what I described above (making the soft link). I noticed that its status is that it has not returned in a call to “sudo launchctl list” and it has a pid. I think what I did worked.

    When I did the “install_qmaster” I did pure defaults except in the certificate information where there are not defaults. Did I mess something up that I had to make the soft link?

    Anyway, I’ll report back if I manage to submit a job.

  • sandy
    Posted at 20:26h, 27 July Reply

    I take it back. I didn’t get what you did when I ran “qconf -sconf” My setup was also much longer. Presumably I’m looking at another version?

  • Zorzal Zilba
    Posted at 08:44h, 06 August Reply

    Fantastic instructions and screencast. Saved me a lot a time, many thanks.

  • naring
    Posted at 07:10h, 09 April Reply

    hi everyone, it’s great tutorial to set up grid engine on mac OS.
    however, I’ve got some errors- I followed your tutorial step-by-step, but when i typed qconf – sconf, there is error message like this
    “ERROR: unable to send message to qmaster using port 6444 on host “master”: got send timeout”

    what should I do?

    sh-3.2# ps ax | grep sge
    1329 s000 S+ 0:00.00 grep sge
    sh-3.2# launchctl load /Library/LaunchDaemons/net.sunsource.gridengine.sgeqmaster.plist
    sh-3.2# ps aux | grep sge
    JShong 1331 0.0 0.0 2487564 3132 ?? Ss 3:58PM 0:00.03 /Users/JShong/GE2011.11p1/source/bin/darwin-x64/sge_qmaster
    root 1336 0.0 0.0 2442000 648 s000 S+ 3:58PM 0:00.00 grep sge
    sh-3.2# qconf -sconf
    ERROR: unable to send message to qmaster using port 6444 on host “master”: got send timeout

Post A Comment