World-wide weekly ATLAS ADCoS summary (10.September-16.September,2013) ________________________________________________________________________ I. General summary: --------------- A) During the past week (10.September-16.September,2013) Panda production service - completed successfully 1,625,389 managed group, MC production, validation and reprocessing jobs - average ~232,198 jobs per day - failed 88,739 jobs - average efficiency: -- jobs ~95% - active tasks: 694 -- distribution by cloud: CA:61 CERN:13 DE:51 ES:9 FR:61 IT:89 ND:48 NL:28 TW:51 UK:87 US:196 B) Major Downtimes: - BNL-ATLAS (testing the new OSG HTCondor-CE gatekeeper), 5.August-30.September - RAL-LCG2 (CE being decommissioned) 5.September-4.October - INFN-T1 (ce08-lcg maintenance) 5.-17.September, (maintenance) 9.-18.September - FZK-LCG2 (emi 3 update) 12.-23.September C) Other news: - New ATLAS release TRF caches were distributed: -- Release - -- AtlasOffline - -- AtlasProduction 17.0.7.2, 17.0.7.3 -- AtlasPhysics 17.3.11.1.3 -- AtlasMCProd - -- AtlasIBLProd - -- AtlasProd1 17.2.11.9.2 -- AtlasTrigMC - -- EvgenTP4MC11 - II. Site and FT/DDM related interruptions/issues/news (+fixed, -not yet) ------------------------------------------------------------------- 1) Tue September 10 -> Cloud/site PROD efficiency ever<50% : - -> Cloud/site DDM efficiency <50% : - ........................................ + DE/DESY-HH FT f./SOURCE and DEST:LoginBroker is unavailable. GGUS 97177. dCache http domain crashed, after restart problem solved + UK/UKI-NORTHGRID-MAN-HEP_DATADISK FT f./DEST:Insufficient space left associated with token. Savannah 139707. Deletion stopped for downtime. After downtime 50TB is expected to be deleted 2) Wed September 11 -> Cloud/site PROD efficiency ever<50% - -> Cloud/site DDM efficiency <50% : - ................................................................ + US/AGLT2 job failures. Elog 45776. WN removed from production + CA/TRIUMF-LCG2, NL/SARA-MATRIX functional tests were failing for both directions of TRIUMF-LCG2_DATADISK->SARA-MATRIX_DATADISK and SARA-MATRIX_DATADISK->TRIUMF-LCG2_DATADISK. GGUS 97233. There was issue with the network between SARA and TRIUMF, fixed it by resetting the BGP session + UK/RAL-LCG2 migration to SL6 has started + DE/wuppertalprod FT f./SOURCE: locality is UNAVAILABLE. GGUS 97241. Storage node issue fixed. + CERN Crash and restart of dq2deletionagents on voatlas225. Elog 45788. + DE/FZK-LCG2 FT f./DEST: source file doesn't exist. Savannah 102559. Datasets were deleted but subscriptions were not stopped. + CA/CA-SCINET-T2 DATADISK FT f./DEST:No space found. 56TB of data on CA-SCINET-T2_DATADISK changed from primary to secondary and deleted + IT/INFN-T1 jobs f./:Grid proxy not valid. GGUS 97255. WN misconfiguration. - DE/CYFRONET-LCG2 FT f./SOURCE:globus_ftp_client: the operation was aborted, jobs fail with put and get errors. GGUS 97232. Crashed server was fixed but the errors persist. - IT/INFN-BOLOGNA-T3 FT f./DEST with various errors. GGUS 97234. 3) Thu September 12 -> Cloud/site PROD efficiency ever<50% : - -> Cloud/site DDM efficiency <50% : - ............................................ + IT/INFN-ROMA1 FT f./SOURCE:Unable to connect to... . GGUS 97259. gsiftp daemon restarted + US/ FT f./error creating file for memmap. GGUS 97261. The log area was cleaned + IT/INFN-NAPOLI-ATLAS FT f./SOURCE:Unable to connect to... . GGUS 97263. gridftp restarted + NL/ru-Moscow-FIAN-LCG2 FT f./SOURCE:Unable to connect to se4.grid.lebedev.ru. GGUS 97265. dpm-gsiftp was failed + CA/TRIUMF-LCG2 Functional tests FT f./DEST:could not open connection to srm.triumf.ca. GGUS 97277. All network issues have been sorted out + CA/AUSTRALIA-ATLAS FT f./SOURCE and DEST:failed to contact on remote SRM. GGUS 97269. Related to downtime at Triumf-LCG2 + US/AGLT2_SL6 jobs f./:getProperSiterootAndCmtconfig: Missing installation. GGUS 97273. WN repaired. 4) Fri September 13 -> Cloud/site PROD efficiency ever<50% : - -> Cloud/site DDM efficiency <50% : CA/TRIUMF ............................................ + UK/UKI-SOUTHGRID-BHAM-HEP FT f./SOURCE:The host credential has expired. GGUS 97278. Fixed + FR/IN2P3-CC jobs f./:stage-in errors . GGUS 97280. Errors disappeared. + US/BU_ATLAS_Tier2_SL6 job failures. Elog 45843. WN set offline + ND/NDGF-T1_DATADISK FT f./DEST:putting on a "Ready" Queue. GGUS 97284. srm was under really heavy load + CERN/ Savannah tracker unavailable. Elog 45855 - CERN/ Central Catalogue monitoring service at voatlas01 unavailable. Voatlas01 is having h/w issues, ss the service has 2 other nodes, so the issue shouldn't cause troubles to the production. - ND/NDGF-T1 Missing files in the ND cloud. Savannah 102573. No response so far 5) Sat September 14 -> Cloud/site PROD efficiency ever<50% : - -> Cloud/site DDM efficiency <50% : - .............................. + US/high number of holding jobs. Elog 45866. DaTRi did not handle properly the new DN. + CERN/CERN-PROD jobs f./:Error accessing path/file. GGUS 97309. DN mapped - UK/UKI-SCOTGRID-GLASGOW jobs f./:Permission denied. GGUS 97304. - ND/NDGF-T1 FT f./SOURCE:Source file size is 0. GGUS 97306, savannah 102586. Files are lost. They were transferring when we had downtime earlier this month. - IT/INFN-ROMA2 jobs f./:Unable to find local user. GGUS 97308. SE gridmapfile is stale - DE/DESY-HH FT f./SOURCE and DEST:LoginBroker is unavailable. GGUS 97310. Transfers seem to have normalised again, - FR/RO-07-NIPNE jobs f./:Get error: Globus system error. GGUS 97311. Site is investigating, errors continue 6) Sun September 15 -> Cloud/site PROD efficiency ever<50% : - -> Cloud/site DDM efficiency < 50% : - ........................................ + TW/Taiwan-LCG2 jobs f./:CGSI-gSOAP running on w-wn0848.grid.sinica.edu.tw reports Error reading token data header. GGUS 97323. Map file updated to new robot DN. - ND/ARC-T2, DE/MPPMU jobs f./:Disk quota exceeded. GGUS 97313. Should be solved now. - FR/IN2P3-LPSC FT f./:The available CRL has expired. GGUS 97314. Server issues - UK/RAL-LCG2 FT f./SOURCE:No such file or directory. GGUS 97320, savannah 102589. File should be declared lost - UK/UKI-SCOTGRID-ECDF FT f./:The available CRL has expired. GGUS 97319. Problems with fetch-crl cron job - ND/NDGF-T1 FT f./SOURCE:Source file/user checksum mismatch. GGUS 97322, savannah 102590. The file is most likely amongst the batch which was partly transferred during the downtime. 7) Mon September 16 -> Cloud/site PROD efficiency ever<50% : - -> Cloud/site DDM efficiency <50%: - ....................................................... + CERN/PilotFactory_voatlas171 down. Elog 45893. + FR/TR-10-ULAKBIM: Frontier squid is down. GGUS 97339. Server was down due to the climatization of the system room. - CA/SFU-LCG2 FT f./SOURCE: locality is UNAVAILABLE. GGUS 97340. dcache misconfiguration after the upgrade - CERN/ many sites of different clouds FT f./:Bad GSS name: No common name in subject. GGUS 97359. fts3-pilot.cern.ch run out of disk space under /var - FR/IN2P3-LPC jobs f./:getProperSiterootAndCmtconfig: Missing installation. GGUS 97345. Issue with a squid server. III. ATLAS Validation,Repro,DDM,ADC Operation Savannah bug-reports (++ bug understood/fixed, -+ fixed in part, -- not fixed yet) ---------------------------------------------------------------------- ++#102538: mc12_valid evgen tasks 1341315-7:all jobs failing with "Failed to load library 'libProc_P2_2_1_2_7_16_5_0.so'", tasks aborted ++#139706: Task 1341434 fails with "Transform input file contains too few events", tasks aborted ++#102554: valid1 recon task 1341442:all jobs failing with "No module named RunDMCTriggerRunsInfo", task cancelled ++#102555: valid1 DigiMReco task 1341401:all jobs failing with "Unckecked StatusCode", task aborted ++#102565: valid task 1341496 failing: "Could not find RunDMCTriggerRunsInfo to get event/runNumber list", task cancelled ++#139725: mc12_2TeV evgen tasks 1341700 and 1341704 are failing with "Beam particles have incorrect energy", task cancelled ++#139726: mc12_14TeV TrigFTKMergeReco tasks 1341719 and 1341927 are failing with "Non-zero exit code from transform substep exec", tasks aborted ++#139727: mc12_8TeV reco task 1341642 requires a file lost at BNL, task finished ++#139728: mc12_8TeV AtlasG4 task 1341743 is failing with "runNumber is not defined in ConfigDic", task aborted ++#139737: Task 1309194 is waiting in the UK cloud, task was running too long but was not assigned to RAL-LCG2_HIMEM because of low memory requirements. Minimum memory limit removed from the queue, task finished. ++#139758: tasks 1342404, 1342405 failing with FTKMergerAlgo did NOT succeed, tasks aborted ++#139764: tasks 1342471-6, 1342469,1342461-3, and 1342493-6 are failing with Sherpa bug, tasks aborted --#139769: task 1341458 fails with Too many/too large input files