Posts tagged ‘SGE’

If anyone interested in it – here’s new update for Sun Grid Engine 6.2 – update 4. It almost about bug fixing and man’s changes – list of changes is here. Sources’ tag for CVS is V62u4_TAG (make sense for Grid Engine, ARCo, SGE Inspect ), by the way as I know Hedeby is still 1.0u3.

Sun Grid Engine’s top engineer Richard Hierlmeier wrote article ( and some bash scripts which implements it – btw why you not to put it them onto your cvs? ) about using SDM in compute cloud ( here’s EC2 as example, I suppose that GoGrid can be used also without too many changes ) – Using SDM Cloud Adapter to Manage Solaris Zones.

Sun released new version of Sun Grid Engine – 6.2 Update 3. That’s new:

upd. Also there’s new Sun Studio 12 Update 1 is available too.

As base AMI i used ami-7db75014 – it’s OpenSolaris supported by Sun, common informartion about installing and using OpenSolaris in EC2 also available in Sun’s Amazon EC2 Getting started guide – in this post I will focus almost in SGE using in Amazon EC2. As SGE distributive i use all-in-one tar package – i choosed “All supported platform” in Grid Engige download page – it takes about 350 Mb, but I don’t worry about platform architecture – if sun support it – it will be in this package. This ge62u2_1.tar.gz contains bunch of other tar.gz’s ( and even hedeby’s core package ) and can be unpacked by :

root@ec2-server:~/tools/archive# gzip -dc ge62u2_1.tar.gz | tar xvpf –

So I just go inside ge6.2u2_1 and unpack them all using something like this

for myfile in *.tar.gz
do
gzip -dc $myfile | tar xvpf –
done

One important thing – hedeby-1.0u2-core.tar.gz contains old versions of some files from ge-6.2u2_1-common.tar.gz – there’s conflicts in files common/util/arch  and common/util/arch_variables – here’s diff for them – may be sometimes it can be usefull, but for my configuration it causes very strange errors when I try to install executor host :

value == NULL for attribute “mailer” in configuration list of “ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com”
./inst_sge[261]: Translate: not found [No such file or directory]
./inst_sge[263]: Translate: not found [No such file or directory]
./inst_sge[264]: Translate: not found [No such file or directory]

When I replace this files from ge-6.2u2_1-common.tar.gz installation works as expected. Next point it’s DNS configuration – SGE is very picky to DNS and it will cause some problems in running SGE Amazon EC2 instances with SGE, this stuff can be fixed using host_aliases file in SGE, or other way it’s to use /etc/hosts file for it – some kind of this technique used in Hedeby-SGE on Amazon EC2 demo, for example if we have master this name and 2 executor hosts I put this lines into /etc/hosts :

#internal_ip external_full_name external_short_name internal_full_name internal_short_name
10.yyy.xyz.zzz ec2-RRR-TTT-ZZZ-YYY.compute-1.amazonaws.com ec2-RRR-TTT-ZZZ-YYY domU-mm-ww-PPP-WWW-FFF-GGG.compute-1.internal domU-mm-ww-PPP-WWW-FFF-GGG
10.yyy.qwe.ttt ec2-aaa-bbb-ccc-ddd.compute-1.amazonaws.com ec2-aaa-bbb-ccc-ddd domU-mm-ww-JJJ-HHH-DDD-SSS.compute-1.internal domU-mm-ww-JJJ-HHH-DDD-SSS
10.yyy.pre.ppp ec2-yyy-rrr-eee-qqq.compute-1.amazonaws.com ec2-yyy-rrr-eee-qqq domU-mm-ww-UUU-III-OOO-PPP.compute-1.internal domU-mm-ww-UUU-III-OOO-PPP

Also I use hostname ec2-RRR-TTT-ZZZ-YYY ( external_short_name )  to set instance hostname – this names I use as hostnames when I configure SGE.

Below I try to summary my experience with SGE and it’s using on vary platform ( Solaris 10, Ubuntu, OpenSolaris, etc.. ). If you use Solaris – check out my Solaris – common questions and it’s differences from Linux – may be your problems deal with Solaris, but not SGE.
So let’s go :

  • when I installing SGE, and after export SGE_ROOT=<my_sge_path> i try to run util/setfileperm.sh I got ‘can’t find script /util/arch‘ error as shown below :
    root@domU-12-31-39-03-CC-95:/opt/ge6.2u2_1# util/setfileperm.sh $SGE_ROOT
    can’t find script /util/arch
    this error can be fixed by set SDM_DIST enviroment variable :
    export SDM_DIST=$SGE_ROOT
  • I got commlib error :
    error: commlib error: access denied (client IP resolved to host name “”. This is not identical to clients host name “”)ERROR: unable to contact qmaster using port 10500 on host “solaris-master.devnet.int.corp”
    rebooting SGE master host helps – see Sun Grid Engine : execution host can’t connet to master host with “commlib error: access denied
  • to be continued..

I got some problems with my SGE cluster – I got some amount of Solaris 10 which running under some virtualization, all servers are the same configured and have equally environment, on one machine I install SGE master, on other SGE execution hosts – and some execution hosts works well, but on another I have strange error from “install_execd” :

Checking hostname resolving
—————————
Cannot contact qmaster. The command failed:
./bin/sol-x86/qconf -sh
The error message was:
error: commlib error: access denied (client IP resolved to host name “”. This is not identical to clients host name “”)
ERROR: unable to contact qmaster using port 10500 on host “solaris-master.devnet.int.corp”

When I run “qconf -sh” I got :

bash-3.00# qconf -sh
error: commlib error: access denied (client IP resolved to host name “”. This is not identical to clients host name “”)
ERROR: unable to contact qmaster using port 10500 on host “solaris-master.devnet.int.corp

I check out connection – ping works, hostname resolved, telnet connection on port 10500 – it works, after I check connection from master host – there’s no problems too. I compare environment on execution hosts which are worked well with hosts which have error – they got the same environment, master host configuration also have no any suspicios-looking stuff. I try to find something usefull in web – no results, some guys have same problem, but no one knows that’s happen and how to fix it. After I try to reboot execution hosts – no effect.

But when I try run “reboot” on master host – wow, it helps! So, guys, if you’ got the same errors with SGE – try to “reboot” on your master host – it may helps.


One way to do it consists in using queues – you may create unique queue for each host in your SGE grid ( using qconf -aq ) and specify this queue name in submitting parameters –

qsub -q <queue_name> $SGE_ROOT/examples/jobs/simple.sh

In case if you would like do deploy jobs onto grid from application ( C or Java ) SGE supports special API – Direct esource Managment Application API – DRMAA – here’s some examples in C++ and Java which may help to figure out this stuff. There’s SGE DRMAA Javadocs, drmaa package JavaDocs and common help – C library functions listed in section 3. To specify queue name dmraa_set_attribute function should be used as shown below :

drmaa_set_attribute(jt, DRMAA_NATIVE_SPECIFICATION, “q queue_name”, error, DRMAA_ERROR_STRING_BUFFER – 1);

Another way to route jon onto specific host it’s to specify request attributes in qsub  : – qsub -l <request_attr_name> – for Java example please see below. Also you may add “soft” or “hard” resource requirements modifier ( for more see SGE glossary – hard/soft resource requirements).

drmaa_set_attribute(jt, DRMAA_NATIVE_SPECIFICATION, “-hard  -q queue_name”, error, DRMAA_ERROR_STRING_BUFFER – 1);

Here’s a listing of  drmaa C++ example which runs job on specified queue – to build it you may use this simple bash script which listed below – it works on Solaris 10, for Linux I suppose it’s better to use g++ compiler :

INC=-I$SGE_ROOT/include
LIB=-L$SGE_ROOT/lib/sol-x86/
LIB_NAME=-ldrmaa
cc $INC $LIB $LIB_NAME sge_drmaa_test_example.c -o sge_drmaa_test_example.out

If you got below error when you run this example

ld.so.1: sge_drmaa_test_example.out: fatal: libdrmaa.so.1.0: open failed: No such file or directory
Killed

please checkout LD_LIBRARY_PATH environment variable, it should be set in the way like ( Solaris 10 x86 )

export LD_LIBRARY_PATH=$SGE_ROOT/lib/sol-x86/

Java implementation also use DRMAA, but it looks little different from C++ : instead of  drmaaa_set_attribute it called JobTemplate::setNativeSpecification :

job_template.setNativeSpecification(“-hard -q ” + queue_name);

Another way to run job on needed host it’s to specifying hostname as request attributes – it look like

jt.setNativeSpecification(“-l hostname=dev-host1”);

Here’s an java source for sge drmaa example or Java drmaa example archive – zip  contains source file, eclipse project and compiled binaries – to create jar you may use Eclipse export  or run inside bin folder

jar cf SgeDrmaaJobRunner.jar net/bokov/sge/*.class

To run this jar ( and run /tools/job.sh which already deployed on all executors ) on Solaris 10 I use this command

java -cp $SGE_ROOT/lib/drmaa.jar:SgeDrmaaJobRunner.jar -Djava.library.path=$LD_LIBRARY_PATH net.bokov.sge.SgeDrmaaJobRunner soft host  not_wait  /tools/job.sh host2-dev-net

Also you specify not only one queue name, but use a lists of queue’s names as parameter –

qsub -q queue_1, queue_2 $SGE_ROOT/examples/jobs/simple.sh

At least qsub allows this syntax 🙂