Posts tagged ‘SGE’
Sun Grid Engine’s top engineer Richard Hierlmeier wrote article ( and some bash scripts which implements it – btw why you not to put it them onto your cvs? ) about using SDM in compute cloud ( here’s EC2 as example, I suppose that GoGrid can be used also without too many changes ) – Using SDM Cloud Adapter to Manage Solaris Zones.
Sun released new version of Sun Grid Engine – 6.2 Update 3. That’s new:
- Amazon EC2 adapter – SDM Cloud Service Adapter now is avaialable to scale SGE cluster in EC2 – they only need to have OpenVPN installed on instances, and special configuration for IP addresses ( SGE is still very picky about DNS ) – more about restriction is here and here’s bash scripts which used for EC2 deployment. By the way I see that OpenSolaris is still only one OS which available for SGE on EC2, if you’re looking for some SGE/EC2 solutions you may check out Convergence proejct which deal with SGE/GemFire cluster installation on EC2 – see Installing SGE on EC2, or Hedeby installation on EC2.
- Now only one JVM can be runned on master or managed host – in previous version SGE ran 3 JVM on every host – one for cs_vm ( configuration service ), executor_vm ( executor component ) and rp_vm ( resource provider ). In SGE terminology it called SDM simple install.
- Now SGE have Exclusive Scheduling – this helps to guarantee predictable performance and to avoid interference when a job is not using all of the slots that are available on a host
- Sun declared that now SGE have Microsoft Vista Support ( don’t think that there’s too much SGE installations on Vista ) and also some usual marking speech about Power Saving features 🙂
upd. Also there’s new Sun Studio 12 Update 1 is available too.
As base AMI i used ami-7db75014 – it’s OpenSolaris supported by Sun, common informartion about installing and using OpenSolaris in EC2 also available in Sun’s Amazon EC2 Getting started guide – in this post I will focus almost in SGE using in Amazon EC2. As SGE distributive i use all-in-one tar package – i choosed “All supported platform” in Grid Engige download page – it takes about 350 Mb, but I don’t worry about platform architecture – if sun support it – it will be in this package. This ge62u2_1.tar.gz contains bunch of other tar.gz’s ( and even hedeby’s core package ) and can be unpacked by :root@ec2-server:~/tools/archive# gzip -dc ge62u2_1.tar.gz | tar xvpf –
So I just go inside ge6.2u2_1 and unpack them all using something like thisfor myfile in *.tar.gz do gzip -dc $myfile | tar xvpf – done
One important thing – hedeby-1.0u2-core.tar.gz contains old versions of some files from ge-6.2u2_1-common.tar.gz – there’s conflicts in files common/util/arch and common/util/arch_variables – here’s diff for them – may be sometimes it can be usefull, but for my configuration it causes very strange errors when I try to install executor host :value == NULL for attribute “mailer” in configuration list of “ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com” ./inst_sge: Translate: not found [No such file or directory] ./inst_sge: Translate: not found [No such file or directory] ./inst_sge: Translate: not found [No such file or directory]
When I replace this files from ge-6.2u2_1-common.tar.gz installation works as expected. Next point it’s DNS configuration – SGE is very picky to DNS and it will cause some problems in running SGE Amazon EC2 instances with SGE, this stuff can be fixed using host_aliases file in SGE, or other way it’s to use /etc/hosts file for it – some kind of this technique used in Hedeby-SGE on Amazon EC2 demo, for example if we have master this name and 2 executor hosts I put this lines into /etc/hosts :
#internal_ip external_full_name external_short_name internal_full_name internal_short_name
10.yyy.xyz.zzz ec2-RRR-TTT-ZZZ-YYY.compute-1.amazonaws.com ec2-RRR-TTT-ZZZ-YYY domU-mm-ww-PPP-WWW-FFF-GGG.compute-1.internal domU-mm-ww-PPP-WWW-FFF-GGG
10.yyy.qwe.ttt ec2-aaa-bbb-ccc-ddd.compute-1.amazonaws.com ec2-aaa-bbb-ccc-ddd domU-mm-ww-JJJ-HHH-DDD-SSS.compute-1.internal domU-mm-ww-JJJ-HHH-DDD-SSS
10.yyy.pre.ppp ec2-yyy-rrr-eee-qqq.compute-1.amazonaws.com ec2-yyy-rrr-eee-qqq domU-mm-ww-UUU-III-OOO-PPP.compute-1.internal domU-mm-ww-UUU-III-OOO-PPP
Also I use hostname ec2-RRR-TTT-ZZZ-YYY ( external_short_name ) to set instance hostname – this names I use as hostnames when I configure SGE.
Below I try to summary my experience with SGE and it’s using on vary platform ( Solaris 10, Ubuntu, OpenSolaris, etc.. ). If you use Solaris – check out my Solaris – common questions and it’s differences from Linux – may be your problems deal with Solaris, but not SGE.
So let’s go :
- when I installing SGE, and after export SGE_ROOT=<my_sge_path> i try to run util/setfileperm.sh I got ‘can’t find script /util/arch‘ error as shown below :
root@domU-12-31-39-03-CC-95:/opt/ge6.2u2_1# util/setfileperm.sh $SGE_ROOT
can’t find script /util/arch
this error can be fixed by set SDM_DIST enviroment variable :
- I got commlib error :
error: commlib error: access denied (client IP resolved to host name “”. This is not identical to clients host name “”)ERROR: unable to contact qmaster using port 10500 on host “solaris-master.devnet.int.corp”
rebooting SGE master host helps – see Sun Grid Engine : execution host can’t connet to master host with “commlib error: access denied
- to be continued..
I got some problems with my SGE cluster – I got some amount of Solaris 10 which running under some virtualization, all servers are the same configured and have equally environment, on one machine I install SGE master, on other SGE execution hosts – and some execution hosts works well, but on another I have strange error from “install_execd” :Checking hostname resolving ————————— Cannot contact qmaster. The command failed: ./bin/sol-x86/qconf -sh The error message was: error: commlib error: access denied (client IP resolved to host name “”. This is not identical to clients host name “”) ERROR: unable to contact qmaster using port 10500 on host “solaris-master.devnet.int.corp”
When I run “qconf -sh” I got :bash-3.00# qconf -sh error: commlib error: access denied (client IP resolved to host name “”. This is not identical to clients host name “”) ERROR: unable to contact qmaster using port 10500 on host “solaris-master.devnet.int.corp“
I check out connection – ping works, hostname resolved, telnet connection on port 10500 – it works, after I check connection from master host – there’s no problems too. I compare environment on execution hosts which are worked well with hosts which have error – they got the same environment, master host configuration also have no any suspicios-looking stuff. I try to find something usefull in web – no results, some guys have same problem, but no one knows that’s happen and how to fix it. After I try to reboot execution hosts – no effect.
But when I try run “reboot” on master host – wow, it helps! So, guys, if you’ got the same errors with SGE – try to “reboot” on your master host – it may helps.
One way to do it consists in using queues – you may create unique queue for each host in your SGE grid ( using qconf -aq ) and specify this queue name in submitting parameters –qsub -q <queue_name> $SGE_ROOT/examples/jobs/simple.sh
In case if you would like do deploy jobs onto grid from application ( C or Java ) SGE supports special API – Direct esource Managment Application API – DRMAA – here’s some examples in C++ and Java which may help to figure out this stuff. There’s SGE DRMAA Javadocs, drmaa package JavaDocs and common help – C library functions listed in section 3. To specify queue name dmraa_set_attribute function should be used as shown below :drmaa_set_attribute(jt, DRMAA_NATIVE_SPECIFICATION, “q queue_name”, error, DRMAA_ERROR_STRING_BUFFER – 1);
Another way to route jon onto specific host it’s to specify request attributes in qsub : – qsub -l <request_attr_name> – for Java example please see below. Also you may add “soft” or “hard” resource requirements modifier ( for more see SGE glossary – hard/soft resource requirements).drmaa_set_attribute(jt, DRMAA_NATIVE_SPECIFICATION, “-hard -q queue_name”, error, DRMAA_ERROR_STRING_BUFFER – 1);
Here’s a listing of drmaa C++ example which runs job on specified queue – to build it you may use this simple bash script which listed below – it works on Solaris 10, for Linux I suppose it’s better to use g++ compiler :INC=-I$SGE_ROOT/include LIB=-L$SGE_ROOT/lib/sol-x86/ LIB_NAME=-ldrmaa cc $INC $LIB $LIB_NAME sge_drmaa_test_example.c -o sge_drmaa_test_example.out
If you got below error when you run this example
ld.so.1: sge_drmaa_test_example.out: fatal: libdrmaa.so.1.0: open failed: No such file or directory
please checkout LD_LIBRARY_PATH environment variable, it should be set in the way like ( Solaris 10 x86 )
job_template.setNativeSpecification(“-hard -q ” + queue_name);
Another way to run job on needed host it’s to specifying hostname as request attributes – it look like
Here’s an java source for sge drmaa example or Java drmaa example archive – zip contains source file, eclipse project and compiled binaries – to create jar you may use Eclipse export or run inside bin folder
jar cf SgeDrmaaJobRunner.jar net/bokov/sge/*.class
To run this jar ( and run /tools/job.sh which already deployed on all executors ) on Solaris 10 I use this command
java -cp $SGE_ROOT/lib/drmaa.jar:SgeDrmaaJobRunner.jar -Djava.library.path=$LD_LIBRARY_PATH net.bokov.sge.SgeDrmaaJobRunner soft host not_wait /tools/job.sh host2-dev-net
Also you specify not only one queue name, but use a lists of queue’s names as parameter –
qsub -q queue_1, queue_2 $SGE_ROOT/examples/jobs/simple.sh
At least qsub allows this syntax 🙂