Subject: World-wide weekly ADCoS operation status report (Feb.14-20,2012)
From: Yuri Smirnov <iouri@mail.cern.ch>
Date: 02/20/2012 01:55 PM


                   Dear Colleagues,

     Please find enclosed a new world-wide weekly Panda
 production/ADCoS operation status report of ADCoS team.

 With best wishes,
                              Yuri
 ------------------------------------------------------------------------

    World-wide weekly ATLAS ADCoS summary (Feb.14-20,2012)
________________________________________________________________________

I. General summary:
    ---------------

A)  During the past week (Feb.14-20,2012) Panda production service
    - completed 1,224,816 managed group, MC production, validation
      and reprocessing jobs
    - average ~174,973 jobs per day
    - failed 81,926 jobs
    - average efficiency:
       -- jobs  ~93.7%
    - active tasks: 2,728
       -- distribution by cloud:
 CA:175 CERN:15 DE:284 ES:183 FR:124 IT:171 ND:175 NL:466 TW:73 UK:299 US:763

B)  - NL/SARA DT (SE,SRM warning:CPU replacement in one f/s), Feb.14.
    - FR/IN2P3-CC DT (CE OUTAGE:maintenance on GE batch system),Feb.15-16.
    - FR/IN2P3-CC DT (CE,SE,SRM,LFC warning:network connectivity), Feb.16.
    - ND/NDGF-T1 DT (CE,SE,SRM warning:reboot pools,f/e), Feb.15.
    - IT/INFN-T1 DT (CE OUTAGE:ce07-lcg down for maintenance), Feb.15-17.
    - TW/Taiwan-LCG2 DT (SE,SRM waning:replace switch for tape s), Feb.16.
    - CERN/CERN-PROD DT (CE warning|OUTAGE:ce207|6 s/w upgrade),Feb.14,16.
    - CERN/CERN-PROD DT (CE OUTAGE:ce205|203,204 s/w upgrade), Feb.17|20.
    - UK/RAL-LGC2 DT (CE+LFC|SE,SRM OUTAGE(LFC migr.)|warning),Feb.14+|14.
    - New new DB Release 18.0.1 subscribed to sites, Feb.16.
    - New pilot version 50d released, Feb.14.
    - CERN: INTR DB migration and 11g upgrade completed (<3h.), Feb.15.


II. Site and FT/DDM related interruptions/issues/news (+fixed, -not yet)
     -------------------------------------------------

1) Tue Feb.14
    -> Cloud/site PROD efficiency ever<50% : FR/IN2P3-CC
    -> Cloud/site DDM  efficiency <50% : -
      ........................................
+ FR/IN2P3-CC 1K job f./[Errno 4]Interrupted system call/Get er.:Failed to
   get LFC replicas.GGUS 79195 solved:timeout in the source of the setup
   from CVMFS and not getting LFC replica.Many "lost heartbeat" due to a
   crash of Master batch scheduler system at around 12:10,Feb 13.DT,Feb.15
- IT/INFN-TRIESTE-LOALGROUPDISK many FT f./Space Token expired/Space
   Management step in srmPrepareToPut failed/SRM_FAILURE:No valid space
   tokens/'userSpaceTokenDescription' does not refer to an existing space.
   GGUS 79215 assigned.
+ DE/Wuppertal many job f./lost heartbeat.GGUS 79216 solved:jobs needed
  too much RAM and were killed by the system.Limit was increased slightly
  to give these jobs a chance,although they violate ATLAS computing model.

2) Wed Feb.15
    -> Cloud/site PROD efficiency ever<50%: -
    -> Cloud/site DDM  efficiency <50% : -
      ...................................................................
+ CERN/Panda-DDM cron config issue causing job f.+analysis autoexclusion/
   Setupper._setupDestination()could not register.Fixed.Elog 33818.
+ FR/IN2P3-CC batch system outage.Elog 33823.
- DE/HEPHY-UIBK t.704611,704063 job f./with open(cmtversion) as cmtversion
   file:SyntaxError:invalid syntax.Savannah 126363,offlined.
+ DE/DEZY-HH 30% job f./[Errno 4]Interrupted system call.GGUS 79243
  solved:the signature for a timeout on sw set.HH is know to have NFS sw
  area scaling issues,and cvmfs is limping to the rescue.Reduced number
  of jobs.
- DE/CSCS-LCG2 333 job f./[Errno 4]Interrupted system call.GGUS 79261
   waiting for reply:also 534 job f./PoolCat_oflcond.xml",Message:invalid
   document structure,Feb.17.
+ US/NET2_PHYS-TOP FT f./DEST.:CONN.:SRM.GGUS 79252 solved:probably caused
  by an occasional and temporary slowness of atlas.bu.edu.About to replace
  theh/w.
- US/NET2_PHYS-EXOTICS DATRI requests pending since Feb.9.Savannah 126386,
   91584:looks like the problem of data transferring (subscription exists,
   but data is not transferring).
+ US/WISC 260 FT f./failed to contact on remote SRM atlas07.cs.wisc.edu.
   GGUS 79265 verified:failure to install the new cert.package fixed.
+ IT/INFN-TRIESTE_LOCALGROUP 25 DATRI subscriptions canceled due missing
   files at source on the 13-14th,but the system still keeps on retrying.
   Savannah 91565:DaTRI wasn't able to delete subscriptions.The stop
   procedure was re-tested,results are positive.DaTRI code was updated for
   more precise catching of such situations.
+ NL/csTCDie 50 job f./Too little space left on local disk to run job.GGUS
   79264 verified:still 133 job f./12h on Feb.16.Deletion of old files
   cleared up some space,no errors on Feb.17.

3) Thu Feb.16
    -> Cloud/site PROD efficiency ever<50% : FR
    -> Cloud/site DDM efficiency <50% : -
      ............................................
+ FR/ATLAS-IN2P3-CC-Frontier 1 server unavailable.Elog.33811,33816.Back.
+ FR/no DDM activity,no conn.to CERN.GGUS ALARM 79278 solved:problem on
   LHCONE network corrected by the RENATER-GEANT teams.
- CA/TRIUMF-LCG2_MCTAPE ~1K FT f./SRMV2STAGER:StatusOfBringOnlineRequest:
   SRM_FAILURE.GGUS 79266 assigned:1 tape got stuck in drive,rewind it,
   seems working now.
- IT/INFN-T1 ~2K job f./Get er.:lcg_gt:Invalid argument.GGUS 79269 solved:
  seems to be related to a down of bdii(h/w failure) on Feb.16 from 5 to
  9am CET.Efficiency is good on Feb.20.
+ ES/IFIC-LCG2 FT f./DEST.:CONN.:SRM.GGUS 79272 solved:machine restarted.
+ UK/UKI-LT2-UCL-HEP many FT f./DEST.:CONN.:SRM.GGUS 79276 solved:DT
  for DPM upgrade/migration updated as 'OUTAGE',but still shows 'warning',
  new f.,end time change back to 'OUTAGE',Feb.18.
+ US/UTD-HEP 1.7K+ job f./IOVDbSvc ERROR**COOL exception caught.Savannah
   126432,126433,126437,91627::corrupted cvmfs on many nodes:database disk
   image is malformed.Problematic WNs removed from service.
+ DE/FZK-LCG2->FR many FT f./SOURCE:GRIDFTP_ER.:conn.reset by peer.GGUS
   79287 verified:it was due to an assymetric routing between CCIN2P3/
   RENATER/LHCONE/FZK.
- NL/SARA-MATRIX still FT f./GRIDFTP_ER.:server err.426 FTP proxy did not
   shut down.GGUS 76920 in progress and GGUS 77660 on hold updated.

4) Fri Feb.17
    -> Cloud/site PROD efficiency ever<50% : ES
    -> Cloud/site DDM efficiency  <50% : FR(<-FZK),ND,NL
      ............................................
- FR/LPC many job f./missing DBRel:FSRelease-0.7.1.2.tar.gz.Savannah 91637
   also some DESD DSs missing.Seems DPM issue,site adm informed.Site set
   offline,blacklisted.Savannah 126477.GGUS 79344 waiting for reply.
+ FR/IN2P3-CC<-FZK FT f.,but can be copied FZK->CERN.GGUS 79291 verified:
   It was due to an assymetric routing between CCIN2P3/RENATER/LHCONE/FZK.
- ND/NDGF-T1 FT f.(6%eff.)/SRMV2STAGER:StatusOfBringOnlineRequest:SRM_
   INVALID_PATH:Failed.GGUS 79308 in progress:still 190 f. on Feb.20.
+ CA/CA-VICTORIA-WESTGRID-T2 84% job f./Put er.:Copy command self timed
   out after 4357s.GGUS 79315 solved:declared a DT,investigate the SE
   issues,DT is over and SE is OK,Feb.18.
+ US/BU_ATLAS_TIER2 many DDM deletion errors.GGUS 79326 solved:see GGUS:
  77729(invalid character in the DS name string for user files).
- US/WT2 258/4h deletion errors.GGUS 79335 in progress(see above:wrong
  character in user DS|file name).
+ IT/INFN-Genova 64 FT f./DEST.:Request timeout due to internal err.GGUS
  79336 solved:Storm b/e crashed,service restarted.Savannah 126479.
+ ES/IFIC-LCG2 2900 FT f./DEST.:CONN.:SRM.GGUS 79339 solved:working
  correctly after srm autorestart.

5) Sat Feb.18
    -> Cloud/site PROD efficiency ever<50% : -
    -> Cloud/site DDM  efficiency <50% : ND,NL
      ..............................
- CERN/CERN-PROD 1K job f./Put err.:Unable to create parent directory.GGUS
  79341 waiting for reply:Jobs are trying to write a CASTOR path to EOS
  (eosatlas => /castor/cern.ch ). This is some misconfiguration on the
  job or ATLAS side and this must fail by definition.Savannah 126500:The
  xrdcp mover for stage-out is not writing correctly to TAPE endpoints on
  castor.
+ IL/IL-TAU-HEP_DATADISK 5K FT f./DEST.:CONN.:SRM.GGUS 79342 solved:storm
  services were down,fixed.Savannah 126476.
+ IT/INFN-MILANO All FT f./DEST>:Request timeout due to internal err.GGUS
  79343 solved:StoRM backend restarted after upgrade.
+ FR/IN2P3-CC+other sites many job f./Put err.:registration failed/No
  space left on device LFC.GGUS ALARM 79347 verified:FT is affected too.
  Increased the allocated space in the concerned tablespace,Feb.19.

6) Sun Feb.19
    -> Cloud/site PROD efficiency ever<50% : CERN,FR
    -> Cloud/site DDM  efficiency < 50%: -
      ........................................
- ND/ARC job f./transformation not installed in CE.GGUS 79308 in progress
  updated.
- US/MWT2 1.3K job f./lost heartbeat.GGUS 78999 reopened in progress:from
  logs:file lookup failed,no such file or dir.
+ DE/PSNC_PRODDISK FT f./SOURCE:Unable to connect to se.reef.man.poznan.pl.
  GGUS 79352 solved:restart dpm-gsiftp service.

7) Mon Feb.20
    -> Cloud/site PROD efficiency ever<50% : CERN-PROD
    -> Cloud/site DDM  efficiency <50% : -
      .......................................................
+ NL/SARA-MATRIX job ./g77: installation problem, cannot exec `f771': No
  such file or dir.GGUS79353 solved:compat g77 was not correctly installed.
  The g77 binary was there but no package was installed.Installed this
  package: compat-gcc-34-g77-3.4.6-4.1.
+ US/SWT2-CPB 15K+ FT f./DEST.:has trouble with canonical path.GGUS 79355
  solved:restart of the xrootdfs process on the SRM gateway host,transfers
  succeed.
- US/SLACXRD_USERDISK still 252 deletion err.GGUS 79335 in progress
  updated:perhaps due to wrong characters users apply to DS/file name.
- CERN/CERN-PROD_LOCALGROUPDISK 159 FT f./SOURCE file doesn't exist.GGUS
  79373 in progress:full parent path doesn't exist,deletion issue?
- IT/INFN-MILANO-ATLASC_PHYS-SM DaTRI 289405 is also AWAITING_SUBSCRIPTION
  (QUOTA_EXCEEDED).Savannah 126102 updated:10TB is available,the problem
  is another one.

III. ATLAS Validation,Repro,DDM,ADC Operation Savannah bug-reports
     (++ bug understood/fixed, -+ fixed in part, -- not fixed yet)
----------------------------------------------------------------------

++ 17.0.6.2.2 valid1 Reco task 709311,709262,709318,709305,709304,709318
  709556,709271,709561,709295 failures:Unable to build inputFileSummary
  from any of the specified input files.Savannah 91508:group production
  Tau.All jobs failed multiple attempts at CERN-PROD.Tasks ABORTED.
  Understood:request was submitted by mistake with the outdated for this
  purpose p851.

++ 17.1.2.1 data11_7TeV Merging task 709238 failures:TRF_UNKNOWN|"Unable
  to commit output ESD.709238._000340.pool.root.7"|"commitOutput failed".
  Savannah 126362:group production TRIGGER HLT.Many jobs failed repeatedly
  at CERN-RELEASE due to too large outut ESD exceeded the limit of 10GB.
  Task ABORTED.

++ 16.6.7.18 mc11_7 simu task 710575-710581 failures: Wrong ATLAS layout:
  ATLAS-GEO-18-02-00 Either ATLAS geometry tag has been misspelled, or the
  DB Release does not contain the geometry specified.Savannah 91598,
  126407:MC11 production - EXOTICS.All jobs failed multiple attempts in
  NL,US,TW,FR,CA,ND clouds.Tasks ABORTED:configuration problem.

-+ 17.0.5.6 valid1 DigiMReco task 710915 failures: TRF_SEGFAULT | "Caught
  signal 11(Segmentation fault).Savannah 91632:T1_McAtNlo_Jimmy.All jobs
  repeatedly crashed at different sites in US cloud on the RAWtoESD step
  due to the error which seems coming from MuonBoy algorithm.Task ABORTED.

-+ 17.1.3.1 valid1 simu task 710653 failures: Payload stdout file too big.
  Savannah 91634:MC11 VALIDATION SampleA 64bit s1421.All jobs failed up to
  9 attempts at various sites in US cloud.Task ABORTED.

-+ 16.6.8.3 mc11_7TeV evgen task 711294 failures:"Output file EVNT.711294.
  _000006.pool.root.5 failed validation.File not created.Savannah 91662:
  MC11 production - HIGGS.All jobs failed 5 attempts each at various sites
  in US cloud/region due to the following PYTHIA error (interesting that
  somehow ATHENA doesn't catch it and exits with exit code o?).Task
  ABORTED.

-+ 17.0.6.4 valid1 DigiMReco task 710918 failures:TRF_UNKNOWN | "Unable to
  commit output tmpRDO.pool.root".Savannah 91663:PythiaH120gamgam.All jobs
  failed up to 3 attempts at various sites in US cloud/region.Thesame
  problem with another 17.1.2.1 vaild1 DigiMReco task 710919,712301
  and 17.0.6.4 task 712300 100% failing in US.Tasks ABORTED.

-+ 16.6.8.4 mc11_7TeV evgen task 712322-712581 failures:TRF_UNKNOWN |"You
  have requested an unknown run number! 152850 could not be decoded".
  Savannah 126474:MC11production - SUSY.All jobs failed multiple attempts
  in various clouds.Tasks ABORTED,requesters notified,waiting for feedback
  /fix before resubmission.

-- 16.6.8.3 mc11_7TeV evgen task 712582 failures:TRF_EXC|"EnvironmentError:
 ( | Error downloading tarball MC11JobOpts-00-02-01_v0.tar.gz.Savannah
 126492:MC11 production:SM.All jobs but 10 first scouts failed multiple
 attempts at different sites in US cloud.Task is still "running".

-- 16.6.9.2 mc11_7TeV evgen task 714766 failures:Py:Athena INFO leaving
 with code 65: "failure in an algorithm execute".Savannah 126502:MC11
 production - SUSY.44 other tasks affected, all jobs failed repetedly
 in various clouds.Contacted the requester (also now cc'd), who knows
 what the problem is.Tasks are still in "submitted" state.

Yuri