World-wide weekly ATLAS ADCoS summary (10.September-16.September,2013)
________________________________________________________________________

I. General summary:
    ---------------

A)  During the past week (10.September-16.September,2013) Panda production service
	- completed successfully 1,625,389 managed group, MC production, 
	  validation and reprocessing jobs 
	- average ~232,198 jobs per day
	- failed 88,739 jobs 
	- average efficiency:
	  -- jobs ~95%
	- active tasks: 694
	  -- distribution by cloud: 
CA:61 CERN:13 DE:51 ES:9 FR:61 IT:89 ND:48 NL:28 TW:51 UK:87 US:196

B)  Major Downtimes: 
	- BNL-ATLAS (testing the new OSG HTCondor-CE gatekeeper), 5.August-30.September
	- RAL-LCG2 (CE being decommissioned) 5.September-4.October
	- INFN-T1 (ce08-lcg maintenance) 5.-17.September, (maintenance) 9.-18.September
	- FZK-LCG2 (emi 3 update) 12.-23.September

C)  Other news: 
   - New ATLAS release TRF caches were distributed: 
    -- Release	       -
    -- AtlasOffline    -
    -- AtlasProduction 17.0.7.2, 17.0.7.3
    -- AtlasPhysics    17.3.11.1.3
    -- AtlasMCProd     -
    -- AtlasIBLProd    - 
    -- AtlasProd1      17.2.11.9.2
    -- AtlasTrigMC     -
    -- EvgenTP4MC11    -
     

II. Site and FT/DDM related interruptions/issues/news (+fixed, -not yet)
     -------------------------------------------------------------------

1) Tue September 10
      -> Cloud/site PROD efficiency ever<50% : -
      -> Cloud/site DDM  efficiency <50% : -
        ........................................
+ DE/DESY-HH FT f./SOURCE and DEST:LoginBroker is unavailable. GGUS 97177. dCache http domain crashed, after restart problem solved
+ UK/UKI-NORTHGRID-MAN-HEP_DATADISK FT f./DEST:Insufficient space left associated with token. Savannah 139707. Deletion stopped for downtime. After downtime 50TB is expected to be deleted

2) Wed September 11
      -> Cloud/site PROD efficiency ever<50%  -
      -> Cloud/site DDM  efficiency <50% : -
        ................................................................
+ US/AGLT2 job failures. Elog 45776. WN removed from production
+ CA/TRIUMF-LCG2, NL/SARA-MATRIX functional tests were failing for both directions of TRIUMF-LCG2_DATADISK->SARA-MATRIX_DATADISK and SARA-MATRIX_DATADISK->TRIUMF-LCG2_DATADISK. GGUS 97233. There was issue with the network between SARA and TRIUMF, fixed it by resetting the BGP session
+ UK/RAL-LCG2 migration to SL6 has started 
+ DE/wuppertalprod FT f./SOURCE: locality is UNAVAILABLE. GGUS 97241. Storage node issue fixed.
+ CERN Crash and restart of dq2deletionagents on voatlas225. Elog 45788.
+ DE/FZK-LCG2 FT f./DEST: source file doesn't exist. Savannah 102559. Datasets were deleted but subscriptions were not stopped.
+ CA/CA-SCINET-T2 DATADISK FT f./DEST:No space found. 56TB of data on CA-SCINET-T2_DATADISK changed from primary to secondary and deleted
+ IT/INFN-T1 jobs f./:Grid proxy not valid. GGUS 97255. WN misconfiguration.
- DE/CYFRONET-LCG2 FT f./SOURCE:globus_ftp_client: the operation was aborted, jobs fail with put and get errors. GGUS 97232. Crashed server was fixed but the errors persist.
- IT/INFN-BOLOGNA-T3 FT f./DEST with various errors. GGUS 97234. 

3) Thu September 12
      -> Cloud/site PROD efficiency ever<50% : -
      -> Cloud/site DDM efficiency <50% : -
        ............................................
+ IT/INFN-ROMA1 FT f./SOURCE:Unable to connect to... . GGUS 97259. gsiftp daemon restarted
+ US/ FT f./error creating file for memmap. GGUS 97261. The log area was cleaned
+ IT/INFN-NAPOLI-ATLAS FT f./SOURCE:Unable to connect to... . GGUS 97263. gridftp restarted
+ NL/ru-Moscow-FIAN-LCG2 FT f./SOURCE:Unable to connect to se4.grid.lebedev.ru. GGUS 97265. dpm-gsiftp was failed
+ CA/TRIUMF-LCG2 Functional tests FT f./DEST:could not open connection to srm.triumf.ca. GGUS 97277. All network issues have been sorted out
+ CA/AUSTRALIA-ATLAS FT f./SOURCE and DEST:failed to contact on remote SRM. GGUS 97269. Related to downtime at Triumf-LCG2
+ US/AGLT2_SL6 jobs f./:getProperSiterootAndCmtconfig: Missing installation. GGUS 97273. WN repaired.


4) Fri September 13
      -> Cloud/site PROD efficiency ever<50% : -
      -> Cloud/site DDM efficiency  <50% : CA/TRIUMF
        ............................................
+ UK/UKI-SOUTHGRID-BHAM-HEP FT f./SOURCE:The host credential has expired. GGUS 97278. Fixed
+ FR/IN2P3-CC jobs f./:stage-in errors . GGUS 97280. Errors disappeared.
+ US/BU_ATLAS_Tier2_SL6 job failures. Elog 45843. WN set offline
+ ND/NDGF-T1_DATADISK FT f./DEST:putting on a "Ready" Queue. GGUS 97284. srm was under really heavy load 
+ CERN/	Savannah tracker unavailable. Elog 45855
- CERN/ Central Catalogue monitoring service at voatlas01 unavailable. Voatlas01 is having h/w issues, ss the service has 2 other nodes, so the issue shouldn't cause troubles to the production. 
- ND/NDGF-T1 Missing files in the ND cloud. Savannah 102573. No response so far

5) Sat September 14
      -> Cloud/site PROD efficiency ever<50% : -
      -> Cloud/site DDM  efficiency <50% : -
        ..............................
+ US/high number of holding jobs. Elog 45866. DaTRi did not handle properly the new DN.
+ CERN/CERN-PROD jobs f./:Error accessing path/file. GGUS 97309. DN mapped
- UK/UKI-SCOTGRID-GLASGOW jobs f./:Permission denied. GGUS 97304.
- ND/NDGF-T1 FT f./SOURCE:Source file size is 0. GGUS 97306, savannah 102586. Files are lost. They were transferring when we had downtime earlier this month.
- IT/INFN-ROMA2 jobs f./:Unable to find local user. GGUS 97308. SE gridmapfile is stale
- DE/DESY-HH FT f./SOURCE and DEST:LoginBroker is unavailable. GGUS 97310. Transfers seem to have normalised again,
- FR/RO-07-NIPNE jobs f./:Get error: Globus system error. GGUS 97311. Site is investigating, errors continue

6) Sun September 15
      -> Cloud/site PROD efficiency ever<50% : -
      -> Cloud/site DDM  efficiency < 50% : -
        ........................................
+ TW/Taiwan-LCG2 jobs f./:CGSI-gSOAP running on w-wn0848.grid.sinica.edu.tw reports Error reading token data header. GGUS 97323. Map file updated to new robot DN.
- ND/ARC-T2, DE/MPPMU jobs f./:Disk quota exceeded. GGUS 97313. Should be solved now.
- FR/IN2P3-LPSC FT f./:The available CRL has expired. GGUS 97314. Server issues
- UK/RAL-LCG2 FT f./SOURCE:No such file or directory. GGUS 97320, savannah 102589. File should be declared lost
- UK/UKI-SCOTGRID-ECDF FT f./:The available CRL has expired. GGUS 97319. Problems with fetch-crl cron job
- ND/NDGF-T1 FT f./SOURCE:Source file/user checksum mismatch. GGUS 97322, savannah 102590. The file is most likely amongst the batch which was partly transferred during the downtime.

7) Mon September 16
      -> Cloud/site PROD efficiency ever<50% : -
      -> Cloud/site DDM  efficiency <50%: -
        .......................................................
+ CERN/PilotFactory_voatlas171 down. Elog 45893.
+ FR/TR-10-ULAKBIM: Frontier squid is down. GGUS 97339. Server was down due to the climatization of the system room.
- CA/SFU-LCG2 FT f./SOURCE: locality is UNAVAILABLE. GGUS 97340. dcache misconfiguration after the upgrade
- CERN/ many sites of different clouds FT f./:Bad GSS name: No common name in subject. GGUS 97359. fts3-pilot.cern.ch run out of disk space under /var
- FR/IN2P3-LPC jobs f./:getProperSiterootAndCmtconfig: Missing installation. GGUS 97345. Issue with a squid server.

III. ATLAS Validation,Repro,DDM,ADC Operation Savannah bug-reports
       (++ bug understood/fixed, -+ fixed in part, -- not fixed yet)
----------------------------------------------------------------------
	++#102538: mc12_valid evgen tasks 1341315-7:all jobs failing with "Failed to load library 'libProc_P2_2_1_2_7_16_5_0.so'", tasks aborted
	++#139706: Task 1341434 fails with "Transform input file contains too few events", tasks aborted
	++#102554: valid1 recon task 1341442:all jobs failing with "No module named RunDMCTriggerRunsInfo", task cancelled
	++#102555: valid1 DigiMReco task 1341401:all jobs failing with "Unckecked StatusCode", task aborted
	++#102565: valid task 1341496 failing: "Could not find RunDMCTriggerRunsInfo to get event/runNumber list", task cancelled
	++#139725: mc12_2TeV evgen tasks 1341700 and 1341704 are failing with "Beam particles have incorrect energy", task cancelled
	++#139726: mc12_14TeV TrigFTKMergeReco tasks 1341719 and 1341927 are failing with "Non-zero exit code from transform substep exec", tasks aborted
	++#139727: mc12_8TeV reco task 1341642 requires a file lost at BNL, task finished
	++#139728: mc12_8TeV AtlasG4 task 1341743 is failing with "runNumber is not defined in ConfigDic", task aborted
	++#139737: Task 1309194 is waiting in the UK cloud, task was running too long but was not assigned to RAL-LCG2_HIMEM because of low memory requirements. Minimum memory limit removed from the queue, task finished.
	++#139758: tasks 1342404, 1342405 failing with FTKMergerAlgo did NOT succeed, tasks aborted
	++#139764: tasks 1342471-6, 1342469,1342461-3, and 1342493-6 are failing with Sherpa bug, tasks aborted
	--#139769: task 1341458 fails with Too many/too large input files