[[TracNav(NuwaNav)]]
[[PageOutline]]

= NuWa Slave : automated build/test setup =

Running a slave provides :
  * automatically updated and tested dybinst'allation
  * web interface to the status of the installation including history of build/test status


= Slave Status (9 Sep 2010) =


  || '''location''' || '''responsible'''   ||     '''host'''                ||     '''status'''                                    || 
  ||  NUU           || Simon               ||  belle7.nuu.edu.tw            ||  nearly continuous operation for several years      || 
  ||  NTU           || Simon               ||  cms01.phys.ntu.edu.tw        ||  nearly continuous operation for several years      ||
  ||  BNL           || Jiajie              ||  daya0001.rcf.bnl.gov         ||  stable running for ~2 weeks,                       ||   
  ||  IHEP          || Miao/Qiumei         ||  lxslc\d\d.ihep.ac.cn         ||  debugging runs by Miao, some issues: svn version?  ||
  ||  LBNL          || ?Cheng-Ju           ||  pdsf\d.nersc.gov             ||  initial setup, master config done                  || 
  ||  Wisconsin     || ?Wei                ||   ?                           ||  ?                                                  || 
  ||  Shandong      || ?                   ||   ?                           ||  ?                                                  ||  
  ||  Dayabay       || Miao/Qiumei         ||   ?                           ||  ?                                                  ||
  ||  Caltech       ||  ?Dan               ||   ?                           ||  ?                                                  || 


General Build status and that of '''dybinst''' configurations are available at 
  * http://dayabay.ihep.ac.cn/tracs/dybsvn/build
  * http://dayabay.ihep.ac.cn/tracs/dybsvn/build/dybinst


= How to setup a slave =

== Pre-requisites : python 2.5?, setuptools, bitten ( 0.6dev-r561 ) ==

Although bitten is installed by dybinst into nuwa python as part of the nosebit external, 
it is more logical to install this into your system python as the slave 
can then perform ''green-field'' dybinst builds without recourse to existing dybinst-allations. 
{{{
svn checkout http://svn.edgewall.org/repos/bitten/branches/experimental/trac-0.11@561 bitn
cd bitn
python setup.py develop       ## probably with sudo
}}}
  * more recent revisions of bitten have incompatibilites with the trac 0.11 master 


== Interactive Test Running of the slave ==

  * Verify that '''bitten-slave''' is installed and in your PATH and is the expected ''standard'' version 
{{{
[blyth@belle7 ~]$ which bitten-slave
/usr/bin/bitten-slave
[blyth@belle7 ~]$ bitten-slave --version
bitten-slave 0.6dev-r561
}}}

  * export dybinst into directory to be used for slave builds (you could use an existing dybinst-allation also) 
  * interactive test run of the slave
{{{
./dybinst trunk slave 
}}}
     * this should fail complaining of lack of config in your {{{$HOME/.dybinstrc}}}

  * add or create {{{$HOME/.dybinstrc}}} containing connection credentials
{{{
slv_buildsurl=http://dayabay.ihep.ac.cn/tracs/dybsvn/builds
slv_username=slave
slv_password=***
slv_loghost=http://your.address       ## if you are able to publish logfiles 
}}}


If your credentials are correct the expected startup messages are :
{{{
[blyth@cms01 trunk]$ ./dybinst trunk slave
Updating existing installation directory installation/trunk/dybinst.
Updating existing installation directory installation/trunk/dybtest.


Mon Aug  9 16:12:04 CST 2010
Start Logging to /data/env/local/dyb/trunk/dybinst-20100809-161204.log (or dybinst-recent.log)


Starting dybinst commands: slave

Stage: "slave"... 


dybinst-slave invoking : /data/env/local/dyb/trunk/installation/trunk/dybinst/scripts/slave.sh trunk

Contacting the master instance, this will take a while.  Go get muffins...

=== slv-main : derive config /home/blyth/.bitten-slave/dybslv.cfg from source /home/blyth/.dybinstrc
[INFO    ] Setting socket.defaulttimeout : 15.0 
[INFO    ] Setting socket.defaulttimeout : 15.0 
[DEBUG   ] Sending POST request to 'http://dayabay.ihep.ac.cn/tracs/dybsvn/builds'
[INFO    ] No pending builds
}}} 


Note that slave asked the master if there are any builds to do and got reply  '''No pending builds''' , the default config is to 
ask the master every 5 mins if there is anything to do.


In order for the master to instruct the slave to perform builds you must send the '''hostname''' to Simon :
{{{
[blyth@belle7 ~]$ hostname
belle7.nuu.edu.tw
}}}
who will inform add the slave to the master through the Trac Admin web interface.



== Running the slave continuously ==

Supervisord is recommended to keep the slave running, 
  * http://supervisord.org/

Install supervisord into your system python with easy_install or pip :
{{{
easy_install supervisor
}}}

For tips on using supervisord, see :
  * http://dayabay.phys.ntu.edu.tw/tracs/env/browser/trunk/base/sv.bash
     * ( includes functions to setup redhat init.d scripts that restart supervisord and all its children when your machine is rebooted ) 

An example of the supervisord config used to keep the dybslv running :
{{{
[program:dybslv]
environment=HOME=/home/blyth,BITTEN_SLAVE=/usr/bin/bitten-slave,SLAVE_OPTS=--verbose
directory=/data1/env/local/dyb
command=/data1/env/local/dyb/dybinst -l dybinst-slave.log trunk slave
redirect_stderr=true
redirect_stdout=true
autostart=true
autorestart=true
priority=999
user=blyth
}}}


== Refreshing the slave build ==

For reasons of efficiency the slave build (which can be performed multiple times each day) is done as an update build. 
Certain types of commits are known to be likely to cause issues with update builds, including :
  * changes to DataModel classes
 
In order to freshen up the build you can try rebuilding after removing various directories, in progressively increasing levels of cleanliness :
  * {{{rm -rf NuWa-trunk/dybgaudi/DybRelease/$CMTCONFIG}}}
  * {{{rm -rf NuWa-trunk/dybgaudi/InstallArea}}}
  * {{{rm -rf NuWa-trunk/dybgaudi/*  ; svn up NuWa-trunk/dybgaudi}}}
  
To trigger a slave build after the removal, invalidate the last build on the node in question using the web interface (BUILD_ADMIN privilege required) 

== Monitoring the slave node ==

After many failures on a slave, it is wise to check running processes {{{ps aux}}}, it can happen that many tens of stuck nuwa.py processes can kill your node. 
Clean up with 
{{{pgrep -f nuwa.py ; pkill -f nuwa.py}}}





= Getting the slave to do periodic builds =

To zeroth order only a few steps are needed to convert a
standard update-build bitten slave into a periodic (daily/weekly) builder.

== Develop/Debug the cron commandline ==
  
Starting point ...  interactive trials with :
{{{
SLAVE_OPTS="--single --dry-run" ./dybinst -b singleshot_\\\${revision} -l /dev/stdout  trunk slave 
}}}

   ||  '''dybinst'''   options             ||                                                                  ||    
   ||  '''-l /dev/stdout'''                || send logging to stdout, for debugging                            ||
   ||  '''-b  singleshot_\\\${revision}''' || option propagated to bitten-slave '''--build-dir'''              || 
   ||                                      ||   (variables evaluated in build context supplied by the master)  ||  

The '''SLAVE_OPTS''' are incorporated into the bitten-slave commandline, 
  * '''--dry-run''' is for debugging only : builds are performed but not reported to the master.
  * '''--single'''  perform a single build before exiting


While debugging increase verbosity by adding line to {{{~/.dybinstrc}}} :
{{{
slv_verbose=yes
}}}


=== Issues Forseen / Things TODO ===

  * may need more escaping '''\\\${revision}''' of the '''build-dir'''
  * the cron command might not get a build to perform within the period (if no qualifying commits), 
    * process pile-up will occur ...   
       * maybe avoid by exiting if existing slave process ? 
       * perhaps add a first '''step''' that checks 

  * will need some purging to avoid filling the disk with builds
    * could add a build step to do this cleanup  

  * failed builds need to be marked as such in the file system as well as in the web interface  
    * add a final build step that checks status and takes action for failures ...
        * renaming of build directories   


=== Understanding how {{{./dybinst trunk slave}}} works ===

'''dybinst''' invokes the below which construct and evaluate the bitten-slave commandline to talk to the master and perform builds 
  * source:installation/trunk/dybinst/scripts/dybinst-slave
  * source:installation/trunk/dybinst/scripts/slave.sh

=== bitten-slave options ===

{{{
[blyth@belle7 dyb]$ bitten-slave --help
Usage: bitten-slave [options] url

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --name=NAME           name of this slave (defaults to host name)
  -f FILE, --config=FILE
                        path to configuration file
  -u USERNAME, --user=USERNAME
                        the username to use for authentication
  -p PASSWORD, --password=PASSWORD
                        the password to use when authenticating

  building:
    -d DIR, --work-dir=DIR
                        working directory for builds
    --build-dir=BUILD_DIR
                        name pattern for the build dir to use inside the
                        working dir ["build_${config}_${build}"]
    -k, --keep-files    don't delete files after builds
    -s, --single        exit after completing a single build
    -n, --dry-run       don't report results back to master
    -i SECONDS, --interval=SECONDS
                        time to wait between requesting builds

  logging:
    -l FILENAME, --log=FILENAME
                        write log messages to FILENAME
    -v, --verbose       print as much as possible
    -q, --quiet         print as little as possible
    --dump-reports      whether report data should be printed

}}}




= What happens when builds/tests fail ? =


Failures result in notification emails and an entry on the timeline. Following the
link in the email gets you to the build status page, such as :
   * http://dayabay.ihep.ac.cn/tracs/dybsvn/build/dybinst/3800 

Examining the error reporting there and on the summary page 
   * http://dayabay.ihep.ac.cn/tracs/dybsvn/build/dybinst

will tell you which '''step''' of the build/tests failed.  

You can confirm the error by running pkg tests via dybinst, eg for '''rootiotest'''
{{{
./dybinst trunk tests rootiotest  
}}}
and investigate futher by getting into the environment and directory of the pkg running the tests
{{{
nosetests -v
}}}


== Causes of test failure ==

Non-''Run'' tests can fail by 
  * an assertion/exception in the test being triggered 
 
''Run-style'' tests have many additional ways to fail...  
  * stdout + stderr from command matches a pattern with integer code > 0
  * time taken by the command exceeds the limit
  * command returns with non-zero exit code 
  * memory(maxrss) taken by the command exceeds limit 
  * for '''{{{reference=True}}}''' tests, the output does not match the reference
  * for '''{{{histref=path/to/hists.root}}}''' tests, any of created histograms do not match the reference '''{{{path/to/histref_hists.root}}}'''
 
== Updating reference output/histograms ==
 
To update reference outputs or histograms :
  * simply delete the old one, a new reference will be created at next run, subsequent runs will compare against the new reference

Find '''test_name.ref''' and '''histref_*.root''' by :
{{{
[blyth@cms01 ~]$ cd $DYB/NuWa-trunk/dybgaudi
[blyth@cms01 dybgaudi]$ find . -name '*.ref'

./Simulation/GenTools/test_diffuser.ref
./Simulation/GenTools/test_gun.ref
./Simulation/DetSim/test_historian.ref
./Simulation/DetSim/test_basic_physics.ref
./DataModel/Conventions/test_Conventions.ref
./Production/MDC10b/test_dby0.ref
./RootIO/RootIOTest/test_dybo.ref
./RawData/RawDataTest/share/rawpython.log.ref
./DybAlg/test_dmp.ref
./Tutorial/Quickstart/test_printrawdata_output.ref
./Database/DbiTest/scripts/TestDbiIhep.log.ref
./Database/DbiValidate/tests/test_Conventions.ref

[blyth@cms01 dybgaudi]$ find . -name 'histref_*.root'
./Production/MDC10b/histref_dby1test.root
./Tutorial/Quickstart/histref_rawDataResult.root
[blyth@cms01 dybgaudi]$

}}}
 
 
== Investigating Issues ==

The primary duty is to isolate the cause and report the problem to the author/responsible in the form of a Trac ticket that 
enables the investigator to rapidly reproduce the issue.

While investigating remember to stop the slave to avoid interference and resource competition from additional builds starting ...
eg if using supervisord : 
{{{
[blyth@cms01 dybgaudi]$ supervisorctl
dybslv                           RUNNING    pid 28651, uptime 1 day, 22:27:01
C> stop dybslv
dybslv: stopped
}}}

=== attach to python nuwa.py process with gdb ===

Start the failing test :
{{{
[blyth@cms01 MDC10b]$ nosetests tests/test_mdc10b.py:test_dby0
Warning in <TEnvRec::ChangeValue>: duplicate entry <Library.vector<short>=vector.dll> for level 0; ignored
Run MDC10b.runLED_Muon.FullChain with double-pulsing of LEDs and no muons to produces 50 readouts ...
}}}
 
Attach gdb to the process and continue '''c''' :
{{{
[blyth@cms01 dybgaudi]$ gdb `which python` $(pgrep -f $(which nuwa.py))
...
Loaded symbols for /data/env/local/dyb/trunk/NuWa-trunk/dybgaudi/InstallArea/i686-slc4-gcc34-dbg/lib/libG4DataHelpers.so
0xb6687b23 in ParticlePropertySvc::anti (this=0xaa28798, pp=0xaa66a98) at ../src/ParticlePropertySvc/ParticlePropertySvc.cpp:445
445         const ParticleProperty* ap = *it ;
(gdb)
}}}

Unfortunately this approach sometimes gets '''Killed''' for gdb '''Out of Memory'''.  

=== running the command under gdb ===

Grab the command from the source of the test(if simple) or process table :
{{{
ps --no-headers -o command  -p $(pgrep -f $(which nuwa.py)) > cmd
}}}

Edit the cmd file, fixup any missing quotes and prefixing with gdb command : '''set args'''

Allowing :
{{{
[blyth@cms01 dybgaudi]$ gdb `which python` -x cmd
GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh)
Copyright 2004 Free Software Foundation, Inc.
...
}}}


Capture the backtrace '''bt''' when meet problems :
{{{
ElecSimProc                           INFO Processing hit collections
ToolSvc.EsIdealFeeTool                INFO Processing 73 pmt pulses.
ToolSvc.TsMultTriggerTool             INFO Max multiplicity for DayaBayAD1 is 44
*** glibc detected *** malloc(): memory corruption: 0x0fe95d10 ***

Program received signal SIGABRT, Aborted.
[Switching to Thread -1208318272 (LWP 17858)]
0x00a1e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb)

(gdb) bt
#0  0x00a1e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00a5f915 in raise () from /lib/tls/libc.so.6
#2  0x00a61379 in abort () from /lib/tls/libc.so.6
#3  0x00a93e1a in __libc_message () from /lib/tls/libc.so.6
#4  0x00a9b473 in _int_malloc () from /lib/tls/libc.so.6
#5  0x00a9d0f1 in malloc () from /lib/tls/libc.so.6
#6  0x04fa911e in operator new () from /usr/lib/libstdc++.so.6
#7  0x032762ca in __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<char const* const, DybDaq::FeeTraits*> > >::allocate (this=0x32798c4, __n=1) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/ext/new_allocator.h:81
#8  0x03276232 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::_M_get_node (this=0x32798c4) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:356
#9  0x03276159 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::_M_create_node (this=0x32798c4, __x=@0xbfe81c88) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:365
#10 0x03275ce5 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::_M_insert (this=0x32798c4, __x=0x0, __p=0xfe95b88, __v=@0xbfe81c88) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:809
#11 0x03275ac9 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::insert_unique (this=0x32798c4, __v=@0xbfe81c88) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:929
#12 0x0327583f in std::map<char const*, DybDaq::FeeTraits*, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::insert (this=0x32798c4, __x=@0xbfe81c88)
    at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_map.h:360
#13 0x032755cf in DybDaq::FeeTraits::defaultTraits () at ../src/FeeTraits.cc:52
#14 0xb5880e3c in DayaBay::DaqReadoutPmtCrate::channel (this=0xfe97a80, channelId=@0xbfe81dc0) at ../src/DaqReadoutPmtCrate.cc:170
#15 0xb5884bd5 in DayaBay::ReadoutPmtCrate::daqReadout (this=0xfe97780, run=0, event=0) at ../src/ReadoutPmtCrate.cc:77
#16 0xaeb14500 in SingleLoader::execute (this=0xab6ec28) at ../src/SingleLoader.cc:112
#17 0x03f95d2c in Algorithm::sysExecute (this=0xab6ec28) at ../src/Lib/Algorithm.cpp:558
#18 0xaeb1f6fc in DybAlgorithm<DayaBay::ReadoutHeader>::sysExecute (this=0xab6ec28) at /data/env/local/dyb/trunk/NuWa-trunk/dybgaudi/InstallArea/include/DybAlg/DybAlgorithmImp.h:59
#19 0x01825d45 in GaudiSequencer::execute (this=0xab6bc00) at ../src/lib/GaudiSequencer.cpp:100
#20 0xb58d3823 in Stage::nextElement (this=0xab6ae78, pIStgData=@0xbfe8248c, erase=true) at ../src/Stage.cc:48
#21 0xb58c0a4e in Sim15::execute (this=0xaae7608) at ../src/Sim15.cc:121
Killed
}}}


Report findings in Trac tickets such as #565




=== why are my added tests not running ? ===
 
As a precaution nosetests does not run tests from executable modules unless you do : {{{nosetests  --exe}}}
OR explicitly specify the path {{{nosetests tests/test_mdc10bfadc.py}}}. Thus you can use {{{chmod ugo-x}}} or {{{chmod ugo+x}}}
as a simple way to swap in/out modules of tests from the standard package tests.



 

= ''dybbin'' Approach Under Discussion =


  * '''dybbin''' allows a NuWa installation to be copied to another location

== email to David, Jiajie ==

{{{
 I like the elimination of a separate build in this dybbin
 approach however I see several problems with usage of update builds
 and externally imposed cron dybbin,  my suggested solutions :

  * Update build
       Are promoting usage of an update build,
       which is prone to breakage on data model changes and is generally
       the subject of negative prejudice.

       Can eliminate this issue by making the dybinst command used by the slave
          "./dybinst -c trunk project dybgaudi"
       do a refresh when data model changes are detected.

 Enforcing external cron cut-off times introduces issues (and leads to
 complicated workarounds) :

   * A build can be in-progress at cron dybbin time or the last build
     can be broken ... meaning there is no valid build to copy.

   * Uncoordinated between sites, the prevalent build at dybbin time
     may differ between sites according to speed/breakage differences between slaves.

 To avoid these issues (and complicated workarounds),
 I suggest not to impose an external cron job but rather to to add a
 final step to the dybinst slave build that (if the tests succeeded)
 performs a dybbin copy "installation" into a directory named after the revision.
 And performs purging to remove excess revision directories.

One possible problem with this scheme is that our 'daily' might not match the
'daily' elsewhere, but I think that could be handled, perhaps by having the master
define the revision number corresponding to the current 'daily'.


 Concerning coordination between slaves (although I have implemented
 such a system in the ad-hoc daily.bash) I think supporting and explaining
 complex procedures for this is not worth the effort/confusion.

 Simply referring to builds by their real name : '''the revision number'''
 and not the date eliminates the issue and benefits from direct
 correspondence between the state of builds reported in the web interface
 (including the timeline).

 Users will want to use a build after a particular
 feature has been added (or bug removed) ... thus they should
 know the minimum revision they want.
 They can then look for a corresponding build that the slave on
 their cluster succeeded to build at :
    http://dayabay.ihep.ac.cn/tracs/dybsvn/build/dybinst

Yours, Simon
}}}