| Version 21 (modified by blyth, 15 years ago) | 
|---|
TracNav menu
- 
   System Links
- Edit Wiki Text to URL mappings
- 
   Dayabay Search
- swish dyb search
- 
   Offline User Manual, OUM (auto updated by build slaves)
- BNL
- NUU often updated ~hrs before BNL
- NTU usually outdated, used for testing
- 
   IHEP repositories
- dybsvn:/
- dybaux
- image gallery
- 
   NTU repositories
- env:/
- tracdev:/
- aberdeen:/
- 
   DB interfaces
- ODM DBI Records
- optical/radioactivity measurements
- http://dayabay.ihep.ac.cn/dbi/
- http://web.dyb.ihep.ac.cn/phpMyAdmin/ retired?
- http://dybdb1.ihep.ac.cn/phpMyAdmin/index.php
- http://dcs2.dyb.ihep.ac.cn/index.php
- 
   Monitoring
- DQ Comments
- dybruns
- PQM
- dybprod_temp
- //e/scm/monitor/ihep/
- doc:5050 DAQ dryrun runlist
- 
   Documentation
- BNL Wiki Offline Documentation
- 
   Doxygen Style Documentation
- NuWaDoxygen
- caltech Doxygen
- 
   Mail Archives
- offline sympa archive
- simulation sympa archive
- gaudi-talk
- 
   Chat Logs
- caltech ChatLogs
- 
   Help
- NuWa_Trac
- Testing_Quickstart
- 
   BNL copies
- db:NuWa_Trac
- db:Testing_Quickstart
- 
   PDSF
- warehouse
- 
   ELogs
- LBL elog:/
- LBL elog:Antineutrino_Detectors/
- LBL elog:MDC/
- IHEP http://dayabay.ihep.ac.cn:8099/
- OnSite http://web.dyb.ihep.ac.cn:8099/
- 
   Photo Galleries
- IHEP Gallery
- 
   Calendars
- Google Calendar
- DocDB Calendar
- 
   Dayabay Shifts
- Daya Bay Shifter Home Page
- Shift Scheduling
- doc:7487 Shift Starters Guide
- twiki:Shift
- twiki:ShiftTraining
- twiki:ShiftCheck
- http://web.dyb.ihep.ac.cn:8099/Shift/
- BNL Shifting page Outdated BNL wiki page
- 
   Dayabay Wikis
- BNL public wiki timeline
- BNL private wiki timeline
- IHEP external twiki
- IHEP Internal TWiki
- 
   Dayabay Collaboration
- Conferences List
- Institute Map
- 
   DocDB
- DocDB
- 
   Dayabay Nightly
- dybinst-nightly
- Nightly-manual.pdf
- 
   IHEP Wiki Pages
- ADDryRunGroup
- 
   BNL Wiki Pages
- db:Offline_Documentation
- db:SVN_Statistics
- db_:SVN
- db:Help:Contents
- db:Special:Recentchanges
- dbp:Special:Recentchanges
- dbp:SimulationGroup
- dbp:UserManual
NuWa Slave : automated build/test setup
Running a slave provides :
- automatically updated and tested dybinst'allation
- web interface to the status of the installation including history of build/test status
Slave Status (9 Sep 2010)
location responsible host supervisord status NUU Simon belle7.nuu.edu.tw Y nearly continuous operation for several years NTU Simon cms01.phys.ntu.edu.tw Y nearly continuous operation for several years BNL Jiajie daya0001.rcf.bnl.gov Y stable running for ~2 weeks, IHEP Miao/Qiumei lxslc\d\d.ihep.ac.cn ? debugging runs by Miao, some issues: svn version?, lustre flocking? LBNL ?Cheng-Ju pdsf\d.nersc.gov ? initial setup, master config done Wisconsin ?Wei ? ? Shandong ? ? ? Dayabay Miao/Qiumei ? ? Caltech ?Dan ? ? 
General Build status and that of dybinst configurations are available at
CMTCONFIG for operational slaves (for opt ones just swap the dbg ) :
belle7 i686-slc5-gcc41-dbg cms01 i686-slc4-gcc34-dbg 
How to setup a slave
Pre-requisites : python 2.5?, setuptools, bitten ( 0.6dev-r561 )
Although bitten is installed by dybinst into nuwa python as part of the nosebit external, it is more logical to install this into your system python as the slave can then perform green-field dybinst builds without recourse to existing dybinst-allations.
svn checkout http://svn.edgewall.org/repos/bitten/branches/experimental/trac-0.11@561 bitn cd bitn python setup.py develop ## probably with sudo
- more recent revisions of bitten have incompatibilites with the trac 0.11 master
alternative install with patching of the slave for secure running
See #580 for background.
svn checkout http://svn.edgewall.org/repos/bitten/branches/experimental/trac-0.11@561 bitn ## you may need to accept a certificate cd bitn svn export http://dayabay.phys.ntu.edu.tw/repos/env/trunk/trac/patch/bitten/bitten-trac-0.11-561.patch patch -p0 < bitten-trac-0.11-561.patch python setup.py develop ## probably with sudo
To configure secure running set the below in your ~/.dybinstrc , and stop and start the slave to test :
slv_secure=yes
Interactive Test Running of the slave
- Verify that bitten-slave is installed and in your PATH and is the expected standard version 
[blyth@belle7 ~]$ which bitten-slave /usr/bin/bitten-slave [blyth@belle7 ~]$ bitten-slave --version bitten-slave 0.6dev-r561 
- export dybinst into directory to be used for slave builds (you could use an existing dybinst-allation also)
- interactive test run of the slave
./dybinst trunk slave - this should fail complaining of lack of config in your $HOME/.dybinstrc
 
- add or create $HOME/.dybinstrc containing connection credentials
slv_buildsurl=http://dayabay.ihep.ac.cn/tracs/dybsvn/builds slv_username=slave slv_password=*** slv_loghost=http://your.address ## if you are able to publish logfiles 
If your credentials are correct the expected startup messages are :
[blyth@cms01 trunk]$ ./dybinst trunk slave Updating existing installation directory installation/trunk/dybinst. Updating existing installation directory installation/trunk/dybtest. Mon Aug 9 16:12:04 CST 2010 Start Logging to /data/env/local/dyb/trunk/dybinst-20100809-161204.log (or dybinst-recent.log) Starting dybinst commands: slave Stage: "slave"... dybinst-slave invoking : /data/env/local/dyb/trunk/installation/trunk/dybinst/scripts/slave.sh trunk Contacting the master instance, this will take a while. Go get muffins... === slv-main : derive config /home/blyth/.bitten-slave/dybslv.cfg from source /home/blyth/.dybinstrc [INFO ] Setting socket.defaulttimeout : 15.0 [INFO ] Setting socket.defaulttimeout : 15.0 [DEBUG ] Sending POST request to 'http://dayabay.ihep.ac.cn/tracs/dybsvn/builds' [INFO ] No pending builds
Note that slave asked the master if there are any builds to do and got reply No pending builds , the default config is to ask the master every 5 mins if there is anything to do.
In order for the master to instruct the slave to perform builds you must send the hostname to Simon :
[blyth@belle7 ~]$ hostname belle7.nuu.edu.tw
who will inform add the slave to the master through the Trac Admin web interface.
Running the slave continuously
Supervisord is recommended to keep the slave running,
Install supervisord into your system python with easy_install or pip :
easy_install supervisor
For tips on using supervisord, see :
- http://dayabay.phys.ntu.edu.tw/tracs/env/browser/trunk/base/sv.bash
- ( includes functions to setup redhat init.d scripts that restart supervisord and all its children when your machine is rebooted )
 
An example of the supervisord config used to keep the dybslv running :
[program:dybslv] environment=HOME=/home/blyth,BITTEN_SLAVE=/usr/bin/bitten-slave,SLAVE_OPTS=--verbose directory=/data1/env/local/dyb command=/data1/env/local/dyb/dybinst -l dybinst-slave.log trunk slave redirect_stderr=true redirect_stdout=true autostart=true autorestart=true priority=999 user=blyth
Refreshing the slave build
For reasons of efficiency the slave build (which can be performed multiple times each day) is done as an update build. Certain types of commits are known to be likely to cause issues with update builds, including :
- changes to DataModel classes
In order to freshen up the build you can try rebuilding after removing various directories, in progressively increasing levels of cleanliness :
- rm -rf NuWa-trunk/dybgaudi/DybRelease/$CMTCONFIG
- rm -rf NuWa-trunk/dybgaudi/InstallArea
- rm -rf NuWa-trunk/dybgaudi/* ; svn up NuWa-trunk/dybgaudi
To trigger a slave build after the removal, invalidate the last build on the node in question using the web interface (BUILD_ADMIN privilege required)
Monitoring the slave node
After many failures on a slave, it is wise to check running processes ps aux, it can happen that many tens of stuck nuwa.py processes can kill your node. Clean up with pgrep -f nuwa.py ; pkill -f nuwa.py
Getting the slave to do periodic builds
To zeroth order only a few steps are needed to convert a standard update-build bitten slave into a periodic (daily/weekly) builder.
Develop/Debug the cron commandline
Starting point ... interactive trials with :
SLAVE_OPTS="--single --dry-run" ./dybinst -b singleshot_\\\${revision} -l /dev/stdout  trunk slave 
dybinst options -l /dev/stdout send logging to stdout, for debugging -b singleshot_\\\${revision} option propagated to bitten-slave --build-dir (variables evaluated in build context supplied by the master) 
The SLAVE_OPTS are incorporated into the bitten-slave commandline,
- --dry-run is for debugging only : builds are performed but not reported to the master.
- --single perform a single build before exiting
While debugging increase verbosity by adding line to ~/.dybinstrc :
slv_verbose=yes
Issues Forseen / Things TODO
- may need more escaping \\\${revision} of the build-dir
- the cron command might not get a build to perform within the period (if no qualifying commits), 
- process pile-up will occur ...   
- maybe avoid by exiting if existing slave process ?
- perhaps add a first step that checks
 
 
- process pile-up will occur ...   
- will need some purging to avoid filling the disk with builds
- could add a build step to do this cleanup
 
- failed builds need to be marked as such in the file system as well as in the web interface  
- add a final build step that checks status and takes action for failures ...
- renaming of build directories
 
 
- add a final build step that checks status and takes action for failures ...
Understanding how ./dybinst trunk slave works
dybinst invokes the below which construct and evaluate the bitten-slave commandline to talk to the master and perform builds
- source:installation/trunk/dybinst/scripts/dybinst-slave
- source:installation/trunk/dybinst/scripts/slave.sh
bitten-slave options
[blyth@belle7 dyb]$ bitten-slave --help
Usage: bitten-slave [options] url
Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --name=NAME           name of this slave (defaults to host name)
  -f FILE, --config=FILE
                        path to configuration file
  -u USERNAME, --user=USERNAME
                        the username to use for authentication
  -p PASSWORD, --password=PASSWORD
                        the password to use when authenticating
  building:
    -d DIR, --work-dir=DIR
                        working directory for builds
    --build-dir=BUILD_DIR
                        name pattern for the build dir to use inside the
                        working dir ["build_${config}_${build}"]
    -k, --keep-files    don't delete files after builds
    -s, --single        exit after completing a single build
    -n, --dry-run       don't report results back to master
    -i SECONDS, --interval=SECONDS
                        time to wait between requesting builds
  logging:
    -l FILENAME, --log=FILENAME
                        write log messages to FILENAME
    -v, --verbose       print as much as possible
    -q, --quiet         print as little as possible
    --dump-reports      whether report data should be printed
What happens when builds/tests fail ?
Failures result in notification emails and an entry on the timeline. Following the link in the email gets you to the build status page, such as :
Examining the error reporting there and on the summary page
will tell you which step of the build/tests failed.
You can confirm the error by running pkg tests via dybinst, eg for rootiotest
./dybinst trunk tests rootiotest
and investigate futher by getting into the environment and directory of the pkg running the tests
nosetests -v
Causes of test failure
Non-Run tests can fail by
- an assertion/exception in the test being triggered
Run-style tests have many additional ways to fail...
- stdout + stderr from command matches a pattern with integer code > 0
- time taken by the command exceeds the limit
- command returns with non-zero exit code
- memory(maxrss) taken by the command exceeds limit
- for reference=True tests, the output does not match the reference
- for histref=path/to/hists.root tests, any of created histograms do not match the reference path/to/histref_hists.root
Updating reference output/histograms
To update reference outputs or histograms :
- simply delete the old one, a new reference will be created at next run, subsequent runs will compare against the new reference
Find test_name.ref and histref_*.root by :
[blyth@cms01 ~]$ cd $DYB/NuWa-trunk/dybgaudi [blyth@cms01 dybgaudi]$ find . -name '*.ref' ./Simulation/GenTools/test_diffuser.ref ./Simulation/GenTools/test_gun.ref ./Simulation/DetSim/test_historian.ref ./Simulation/DetSim/test_basic_physics.ref ./DataModel/Conventions/test_Conventions.ref ./Production/MDC10b/test_dby0.ref ./RootIO/RootIOTest/test_dybo.ref ./RawData/RawDataTest/share/rawpython.log.ref ./DybAlg/test_dmp.ref ./Tutorial/Quickstart/test_printrawdata_output.ref ./Database/DbiTest/scripts/TestDbiIhep.log.ref ./Database/DbiValidate/tests/test_Conventions.ref [blyth@cms01 dybgaudi]$ find . -name 'histref_*.root' ./Production/MDC10b/histref_dby1test.root ./Tutorial/Quickstart/histref_rawDataResult.root [blyth@cms01 dybgaudi]$
Investigating Issues
The primary duty is to isolate the cause and report the problem to the author/responsible in the form of a Trac ticket that enables the investigator to rapidly reproduce the issue.
While investigating remember to stop the slave to avoid interference and resource competition from additional builds starting ... eg if using supervisord :
[blyth@cms01 dybgaudi]$ supervisorctl dybslv RUNNING pid 28651, uptime 1 day, 22:27:01 C> stop dybslv dybslv: stopped
attach to python nuwa.py process with gdb
Start the failing test :
[blyth@cms01 MDC10b]$ nosetests tests/test_mdc10b.py:test_dby0 Warning in <TEnvRec::ChangeValue>: duplicate entry <Library.vector<short>=vector.dll> for level 0; ignored Run MDC10b.runLED_Muon.FullChain with double-pulsing of LEDs and no muons to produces 50 readouts ...
Attach gdb to the process and continue c :
[blyth@cms01 dybgaudi]$ gdb `which python` $(pgrep -f $(which nuwa.py)) ... Loaded symbols for /data/env/local/dyb/trunk/NuWa-trunk/dybgaudi/InstallArea/i686-slc4-gcc34-dbg/lib/libG4DataHelpers.so 0xb6687b23 in ParticlePropertySvc::anti (this=0xaa28798, pp=0xaa66a98) at ../src/ParticlePropertySvc/ParticlePropertySvc.cpp:445 445 const ParticleProperty* ap = *it ; (gdb)
Unfortunately this approach sometimes gets Killed for gdb Out of Memory.
running the command under gdb
Grab the command from the source of the test(if simple) or process table :
ps --no-headers -o command -p $(pgrep -f $(which nuwa.py)) > cmd
Edit the cmd file, fixup any missing quotes and prefixing with gdb command : set args
Allowing :
[blyth@cms01 dybgaudi]$ gdb `which python` -x cmd GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh) Copyright 2004 Free Software Foundation, Inc. ...
Capture the backtrace bt when meet problems :
ElecSimProc                           INFO Processing hit collections
ToolSvc.EsIdealFeeTool                INFO Processing 73 pmt pulses.
ToolSvc.TsMultTriggerTool             INFO Max multiplicity for DayaBayAD1 is 44
*** glibc detected *** malloc(): memory corruption: 0x0fe95d10 ***
Program received signal SIGABRT, Aborted.
[Switching to Thread -1208318272 (LWP 17858)]
0x00a1e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb)
(gdb) bt
#0  0x00a1e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00a5f915 in raise () from /lib/tls/libc.so.6
#2  0x00a61379 in abort () from /lib/tls/libc.so.6
#3  0x00a93e1a in __libc_message () from /lib/tls/libc.so.6
#4  0x00a9b473 in _int_malloc () from /lib/tls/libc.so.6
#5  0x00a9d0f1 in malloc () from /lib/tls/libc.so.6
#6  0x04fa911e in operator new () from /usr/lib/libstdc++.so.6
#7  0x032762ca in __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<char const* const, DybDaq::FeeTraits*> > >::allocate (this=0x32798c4, __n=1) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/ext/new_allocator.h:81
#8  0x03276232 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::_M_get_node (this=0x32798c4) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:356
#9  0x03276159 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::_M_create_node (this=0x32798c4, __x=@0xbfe81c88) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:365
#10 0x03275ce5 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::_M_insert (this=0x32798c4, __x=0x0, __p=0xfe95b88, __v=@0xbfe81c88) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:809
#11 0x03275ac9 in std::_Rb_tree<char const*, std::pair<char const* const, DybDaq::FeeTraits*>, std::_Select1st<std::pair<char const* const, DybDaq::FeeTraits*> >, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::insert_unique (this=0x32798c4, __v=@0xbfe81c88) at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:929
#12 0x0327583f in std::map<char const*, DybDaq::FeeTraits*, std::less<char const*>, std::allocator<std::pair<char const* const, DybDaq::FeeTraits*> > >::insert (this=0x32798c4, __x=@0xbfe81c88)
    at /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_map.h:360
#13 0x032755cf in DybDaq::FeeTraits::defaultTraits () at ../src/FeeTraits.cc:52
#14 0xb5880e3c in DayaBay::DaqReadoutPmtCrate::channel (this=0xfe97a80, channelId=@0xbfe81dc0) at ../src/DaqReadoutPmtCrate.cc:170
#15 0xb5884bd5 in DayaBay::ReadoutPmtCrate::daqReadout (this=0xfe97780, run=0, event=0) at ../src/ReadoutPmtCrate.cc:77
#16 0xaeb14500 in SingleLoader::execute (this=0xab6ec28) at ../src/SingleLoader.cc:112
#17 0x03f95d2c in Algorithm::sysExecute (this=0xab6ec28) at ../src/Lib/Algorithm.cpp:558
#18 0xaeb1f6fc in DybAlgorithm<DayaBay::ReadoutHeader>::sysExecute (this=0xab6ec28) at /data/env/local/dyb/trunk/NuWa-trunk/dybgaudi/InstallArea/include/DybAlg/DybAlgorithmImp.h:59
#19 0x01825d45 in GaudiSequencer::execute (this=0xab6bc00) at ../src/lib/GaudiSequencer.cpp:100
#20 0xb58d3823 in Stage::nextElement (this=0xab6ae78, pIStgData=@0xbfe8248c, erase=true) at ../src/Stage.cc:48
#21 0xb58c0a4e in Sim15::execute (this=0xaae7608) at ../src/Sim15.cc:121
Killed
Report findings in Trac tickets such as #565
why are my added tests not running ?
As a precaution nosetests does not run tests from executable modules unless you do : nosetests --exe OR explicitly specify the path nosetests tests/test_mdc10bfadc.py. Thus you can use chmod ugo-x or chmod ugo+x as a simple way to swap in/out modules of tests from the standard package tests.
Optimized Builds
A new bitten config for doing optimized builds opt.dybinst
Optimized builds are done in an "opt" directory within the normal dybinst directory :
      dybinst
      external
      NuWa-trunk
      opt/
         dybinst
         external
         NuWa-trunk
The master can be configured to distribute "dybinst" and/or "opt.dybinst" configs to your slave.
It is not necessary to setup two slaves to perform the "opt" builds, although if you have another node available it has the advantage that "dbg" and "opt" builds can then proceed in parallel. Otherwise with a single slave you will have to wait for the "dbg" build to complete before the "opt" build starts (or vv).
If you want to setup parallel "dbg" and "opt" builds then send me 2 lists of hostnames for "opt.dybinst" and "dybinst" builds.
opt-by-default setup
For the slave test steps to work a manual step is required, to setup your opt installation to be opt-by-default, one line needs to be added to opt/NuWa-trunk/setup/default/cmt/requirements as described at
dybinst copy step
The final copy step of builds allows the update build directory to be copied ( using dybbin pack/unpack/setup ) into a revision named directory.
When enabled this prevents breakage of trunk from hindering progress by allowing users to trivially shift a recent prior revision.
when builds/tests fail
If a build fails (eg dybgaudi fails to compile) then the copy step is not reached and no copy is made. However if the build completes but some of the tests fail then the copy is still done by the name of the copied directory is changed to indicate the number of failed tests.
debugging slvmon results
The return code from installation/trunk/dybinst/scripts/slvmon.py records the number of test failures discerned from the xml logfiles written by the slave.
If you are surprised by this return code and resulting renamed directory then debug the issue by turning up the debug ...
cd /dybinst/export/dir python installation/trunk/dybinst/scripts/slvmon.py dybinst/4059_9542 -l DEBUG cd /dybinst/export/dir/opt python installation/trunk/dybinst/scripts/slvmon.py opt.dybinst/4059_9542 -l DEBUG
The single required argument needed is the BUILD_SLUG which identifies the configuration, build number and revision.
configuration of the copy step
The copy is configured by means of variables dyb_copy.. in envfiles such as ~/.dybinstrc. To allow separate configuration for debug and opt builds variants of the config vars ending with _opt or _dbg are accepted that take precendence over the generic vars.
- dyb_copybase : directory to which revision directories are copyied, not configuring this or the _dbg/_opt variant prevents the copying from being done
- dyb_copykeep : number of revision directories to be retained (defaults to 10, can have different opt/dbg settings using _dbg/_opt), others are purged
Currently the purge algorithm decides what to purge/retain based on
- modification time of the revision directory
- number of symbolic links within dyb_copybase that point to the revision directory

