Dear Colleagues,
Please find enclosed a new world-wide weekly Panda
production/ADCoS operation status report of ADCoS team.
With best wishes,
Yuri
------------------------------------------------------------------------
World-wide weekly ATLAS ADCoS summary (Feb.14-20,2012)
________________________________________________________________________
I. General summary:
---------------
A) During the past week (Feb.14-20,2012) Panda production service
- completed 1,224,816 managed group, MC production, validation
and reprocessing jobs
- average ~174,973 jobs per day
- failed 81,926 jobs
- average efficiency:
-- jobs ~93.7%
- active tasks: 2,728
-- distribution by cloud:
CA:175 CERN:15 DE:284 ES:183 FR:124 IT:171 ND:175 NL:466 TW:73 UK:299 US:763
B) - NL/SARA DT (SE,SRM warning:CPU replacement in one f/s), Feb.14.
- FR/IN2P3-CC DT (CE OUTAGE:maintenance on GE batch system),Feb.15-16.
- FR/IN2P3-CC DT (CE,SE,SRM,LFC warning:network connectivity), Feb.16.
- ND/NDGF-T1 DT (CE,SE,SRM warning:reboot pools,f/e), Feb.15.
- IT/INFN-T1 DT (CE OUTAGE:ce07-lcg down for maintenance), Feb.15-17.
- TW/Taiwan-LCG2 DT (SE,SRM waning:replace switch for tape s), Feb.16.
- CERN/CERN-PROD DT (CE warning|OUTAGE:ce207|6 s/w upgrade),Feb.14,16.
- CERN/CERN-PROD DT (CE OUTAGE:ce205|203,204 s/w upgrade), Feb.17|20.
- UK/RAL-LGC2 DT (CE+LFC|SE,SRM OUTAGE(LFC migr.)|warning),Feb.14+|14.
- New new DB Release 18.0.1 subscribed to sites, Feb.16.
- New pilot version 50d released, Feb.14.
- CERN: INTR DB migration and 11g upgrade completed (<3h.), Feb.15.
II. Site and FT/DDM related interruptions/issues/news (+fixed, -not yet)
-------------------------------------------------
1) Tue Feb.14
-> Cloud/site PROD efficiency ever<50% : FR/IN2P3-CC
-> Cloud/site DDM efficiency <50% : -
........................................
+ FR/IN2P3-CC 1K job f./[Errno 4]Interrupted system call/Get er.:Failed to
get LFC replicas.GGUS 79195 solved:timeout in the source of the setup
from CVMFS and not getting LFC replica.Many "lost heartbeat" due to a
crash of Master batch scheduler system at around 12:10,Feb 13.DT,Feb.15
- IT/INFN-TRIESTE-LOALGROUPDISK many FT f./Space Token expired/Space
Management step in srmPrepareToPut failed/SRM_FAILURE:No valid space
tokens/'userSpaceTokenDescription' does not refer to an existing space.
GGUS 79215 assigned.
+ DE/Wuppertal many job f./lost heartbeat.GGUS 79216 solved:jobs needed
too much RAM and were killed by the system.Limit was increased slightly
to give these jobs a chance,although they violate ATLAS computing model.
2) Wed Feb.15
-> Cloud/site PROD efficiency ever<50%: -
-> Cloud/site DDM efficiency <50% : -
...................................................................
+ CERN/Panda-DDM cron config issue causing job f.+analysis autoexclusion/
Setupper._setupDestination()could not register.Fixed.Elog 33818.
+ FR/IN2P3-CC batch system outage.Elog 33823.
- DE/HEPHY-UIBK t.704611,704063 job f./with open(cmtversion) as cmtversion
file:SyntaxError:invalid syntax.Savannah 126363,offlined.
+ DE/DEZY-HH 30% job f./[Errno 4]Interrupted system call.GGUS 79243
solved:the signature for a timeout on sw set.HH is know to have NFS sw
area scaling issues,and cvmfs is limping to the rescue.Reduced number
of jobs.
- DE/CSCS-LCG2 333 job f./[Errno 4]Interrupted system call.GGUS 79261
waiting for reply:also 534 job f./PoolCat_oflcond.xml",Message:invalid
document structure,Feb.17.
+ US/NET2_PHYS-TOP FT f./DEST.:CONN.:SRM.GGUS 79252 solved:probably caused
by an occasional and temporary slowness of atlas.bu.edu.About to replace
theh/w.
- US/NET2_PHYS-EXOTICS DATRI requests pending since Feb.9.Savannah 126386,
91584:looks like the problem of data transferring (subscription exists,
but data is not transferring).
+ US/WISC 260 FT f./failed to contact on remote SRM atlas07.cs.wisc.edu.
GGUS 79265 verified:failure to install the new cert.package fixed.
+ IT/INFN-TRIESTE_LOCALGROUP 25 DATRI subscriptions canceled due missing
files at source on the 13-14th,but the system still keeps on retrying.
Savannah 91565:DaTRI wasn't able to delete subscriptions.The stop
procedure was re-tested,results are positive.DaTRI code was updated for
more precise catching of such situations.
+ NL/csTCDie 50 job f./Too little space left on local disk to run job.GGUS
79264 verified:still 133 job f./12h on Feb.16.Deletion of old files
cleared up some space,no errors on Feb.17.
3) Thu Feb.16
-> Cloud/site PROD efficiency ever<50% : FR
-> Cloud/site DDM efficiency <50% : -
............................................
+ FR/ATLAS-IN2P3-CC-Frontier 1 server unavailable.Elog.33811,33816.Back.
+ FR/no DDM activity,no conn.to CERN.GGUS ALARM 79278 solved:problem on
LHCONE network corrected by the RENATER-GEANT teams.
- CA/TRIUMF-LCG2_MCTAPE ~1K FT f./SRMV2STAGER:StatusOfBringOnlineRequest:
SRM_FAILURE.GGUS 79266 assigned:1 tape got stuck in drive,rewind it,
seems working now.
- IT/INFN-T1 ~2K job f./Get er.:lcg_gt:Invalid argument.GGUS 79269 solved:
seems to be related to a down of bdii(h/w failure) on Feb.16 from 5 to
9am CET.Efficiency is good on Feb.20.
+ ES/IFIC-LCG2 FT f./DEST.:CONN.:SRM.GGUS 79272 solved:machine restarted.
+ UK/UKI-LT2-UCL-HEP many FT f./DEST.:CONN.:SRM.GGUS 79276 solved:DT
for DPM upgrade/migration updated as 'OUTAGE',but still shows 'warning',
new f.,end time change back to 'OUTAGE',Feb.18.
+ US/UTD-HEP 1.7K+ job f./IOVDbSvc ERROR**COOL exception caught.Savannah
126432,126433,126437,91627::corrupted cvmfs on many nodes:database disk
image is malformed.Problematic WNs removed from service.
+ DE/FZK-LCG2->FR many FT f./SOURCE:GRIDFTP_ER.:conn.reset by peer.GGUS
79287 verified:it was due to an assymetric routing between CCIN2P3/
RENATER/LHCONE/FZK.
- NL/SARA-MATRIX still FT f./GRIDFTP_ER.:server err.426 FTP proxy did not
shut down.GGUS 76920 in progress and GGUS 77660 on hold updated.
4) Fri Feb.17
-> Cloud/site PROD efficiency ever<50% : ES
-> Cloud/site DDM efficiency <50% : FR(<-FZK),ND,NL
............................................
- FR/LPC many job f./missing DBRel:FSRelease-0.7.1.2.tar.gz.Savannah 91637
also some DESD DSs missing.Seems DPM issue,site adm informed.Site set
offline,blacklisted.Savannah 126477.GGUS 79344 waiting for reply.
+ FR/IN2P3-CC<-FZK FT f.,but can be copied FZK->CERN.GGUS 79291 verified:
It was due to an assymetric routing between CCIN2P3/RENATER/LHCONE/FZK.
- ND/NDGF-T1 FT f.(6%eff.)/SRMV2STAGER:StatusOfBringOnlineRequest:SRM_
INVALID_PATH:Failed.GGUS 79308 in progress:still 190 f. on Feb.20.
+ CA/CA-VICTORIA-WESTGRID-T2 84% job f./Put er.:Copy command self timed
out after 4357s.GGUS 79315 solved:declared a DT,investigate the SE
issues,DT is over and SE is OK,Feb.18.
+ US/BU_ATLAS_TIER2 many DDM deletion errors.GGUS 79326 solved:see GGUS:
77729(invalid character in the DS name string for user files).
- US/WT2 258/4h deletion errors.GGUS 79335 in progress(see above:wrong
character in user DS|file name).
+ IT/INFN-Genova 64 FT f./DEST.:Request timeout due to internal err.GGUS
79336 solved:Storm b/e crashed,service restarted.Savannah 126479.
+ ES/IFIC-LCG2 2900 FT f./DEST.:CONN.:SRM.GGUS 79339 solved:working
correctly after srm autorestart.
5) Sat Feb.18
-> Cloud/site PROD efficiency ever<50% : -
-> Cloud/site DDM efficiency <50% : ND,NL
..............................
- CERN/CERN-PROD 1K job f./Put err.:Unable to create parent directory.GGUS
79341 waiting for reply:Jobs are trying to write a CASTOR path to EOS
(eosatlas => /castor/cern.ch ). This is some misconfiguration on the
job or ATLAS side and this must fail by definition.Savannah 126500:The
xrdcp mover for stage-out is not writing correctly to TAPE endpoints on
castor.
+ IL/IL-TAU-HEP_DATADISK 5K FT f./DEST.:CONN.:SRM.GGUS 79342 solved:storm
services were down,fixed.Savannah 126476.
+ IT/INFN-MILANO All FT f./DEST>:Request timeout due to internal err.GGUS
79343 solved:StoRM backend restarted after upgrade.
+ FR/IN2P3-CC+other sites many job f./Put err.:registration failed/No
space left on device LFC.GGUS ALARM 79347 verified:FT is affected too.
Increased the allocated space in the concerned tablespace,Feb.19.
6) Sun Feb.19
-> Cloud/site PROD efficiency ever<50% : CERN,FR
-> Cloud/site DDM efficiency < 50%: -
........................................
- ND/ARC job f./transformation not installed in CE.GGUS 79308 in progress
updated.
- US/MWT2 1.3K job f./lost heartbeat.GGUS 78999 reopened in progress:from
logs:file lookup failed,no such file or dir.
+ DE/PSNC_PRODDISK FT f./SOURCE:Unable to connect to se.reef.man.poznan.pl.
GGUS 79352 solved:restart dpm-gsiftp service.
7) Mon Feb.20
-> Cloud/site PROD efficiency ever<50% : CERN-PROD
-> Cloud/site DDM efficiency <50% : -
.......................................................
+ NL/SARA-MATRIX job ./g77: installation problem, cannot exec `f771': No
such file or dir.GGUS79353 solved:compat g77 was not correctly installed.
The g77 binary was there but no package was installed.Installed this
package: compat-gcc-34-g77-3.4.6-4.1.
+ US/SWT2-CPB 15K+ FT f./DEST.:has trouble with canonical path.GGUS 79355
solved:restart of the xrootdfs process on the SRM gateway host,transfers
succeed.
- US/SLACXRD_USERDISK still 252 deletion err.GGUS 79335 in progress
updated:perhaps due to wrong characters users apply to DS/file name.
- CERN/CERN-PROD_LOCALGROUPDISK 159 FT f./SOURCE file doesn't exist.GGUS
79373 in progress:full parent path doesn't exist,deletion issue?
- IT/INFN-MILANO-ATLASC_PHYS-SM DaTRI 289405 is also AWAITING_SUBSCRIPTION
(QUOTA_EXCEEDED).Savannah 126102 updated:10TB is available,the problem
is another one.
III. ATLAS Validation,Repro,DDM,ADC Operation Savannah bug-reports
(++ bug understood/fixed, -+ fixed in part, -- not fixed yet)
----------------------------------------------------------------------
++ 17.0.6.2.2 valid1 Reco task 709311,709262,709318,709305,709304,709318
709556,709271,709561,709295 failures:Unable to build inputFileSummary
from any of the specified input files.Savannah 91508:group production
Tau.All jobs failed multiple attempts at CERN-PROD.Tasks ABORTED.
Understood:request was submitted by mistake with the outdated for this
purpose p851.
++ 17.1.2.1 data11_7TeV Merging task 709238 failures:TRF_UNKNOWN|"Unable
to commit output ESD.709238._000340.pool.root.7"|"commitOutput failed".
Savannah 126362:group production TRIGGER HLT.Many jobs failed repeatedly
at CERN-RELEASE due to too large outut ESD exceeded the limit of 10GB.
Task ABORTED.
++ 16.6.7.18 mc11_7 simu task 710575-710581 failures: Wrong ATLAS layout:
ATLAS-GEO-18-02-00 Either ATLAS geometry tag has been misspelled, or the
DB Release does not contain the geometry specified.Savannah 91598,
126407:MC11 production - EXOTICS.All jobs failed multiple attempts in
NL,US,TW,FR,CA,ND clouds.Tasks ABORTED:configuration problem.
-+ 17.0.5.6 valid1 DigiMReco task 710915 failures: TRF_SEGFAULT | "Caught
signal 11(Segmentation fault).Savannah 91632:T1_McAtNlo_Jimmy.All jobs
repeatedly crashed at different sites in US cloud on the RAWtoESD step
due to the error which seems coming from MuonBoy algorithm.Task ABORTED.
-+ 17.1.3.1 valid1 simu task 710653 failures: Payload stdout file too big.
Savannah 91634:MC11 VALIDATION SampleA 64bit s1421.All jobs failed up to
9 attempts at various sites in US cloud.Task ABORTED.
-+ 16.6.8.3 mc11_7TeV evgen task 711294 failures:"Output file EVNT.711294.
_000006.pool.root.5 failed validation.File not created.Savannah 91662:
MC11 production - HIGGS.All jobs failed 5 attempts each at various sites
in US cloud/region due to the following PYTHIA error (interesting that
somehow ATHENA doesn't catch it and exits with exit code o?).Task
ABORTED.
-+ 17.0.6.4 valid1 DigiMReco task 710918 failures:TRF_UNKNOWN | "Unable to
commit output tmpRDO.pool.root".Savannah 91663:PythiaH120gamgam.All jobs
failed up to 3 attempts at various sites in US cloud/region.Thesame
problem with another 17.1.2.1 vaild1 DigiMReco task 710919,712301
and 17.0.6.4 task 712300 100% failing in US.Tasks ABORTED.
-+ 16.6.8.4 mc11_7TeV evgen task 712322-712581 failures:TRF_UNKNOWN |"You
have requested an unknown run number! 152850 could not be decoded".
Savannah 126474:MC11production - SUSY.All jobs failed multiple attempts
in various clouds.Tasks ABORTED,requesters notified,waiting for feedback
/fix before resubmission.
-- 16.6.8.3 mc11_7TeV evgen task 712582 failures:TRF_EXC|"EnvironmentError:
( | Error downloading tarball MC11JobOpts-00-02-01_v0.tar.gz.Savannah
126492:MC11 production:SM.All jobs but 10 first scouts failed multiple
attempts at different sites in US cloud.Task is still "running".
-- 16.6.9.2 mc11_7TeV evgen task 714766 failures:Py:Athena INFO leaving
with code 65: "failure in an algorithm execute".Savannah 126502:MC11
production - SUSY.44 other tasks affected, all jobs failed repetedly
in various clouds.Contacted the requester (also now cc'd), who knows
what the problem is.Tasks are still in "submitted" state.
Yuri