The High Energy Physics
Community Grid Project
Inside D-Grid
ACAT 07
Torsten Harenberg - University of Wuppertal
[email protected]
D-Grid organisational structure
2/27
technical infrastructure
Communities
Portal (GridSphere based)
User API
GAT API
D-Grid Services
Nutzer
Scheduling und
Workflow Management
Grid services
Monitoring
UNICORE
LCG/gLite
Accounting und
Billing
Security and VO
management
Core services
Globus Toolkit V4
Data management
I/O
D-Grid resources
Distributed
data services
Daten/
Software
network
Distributed
computing
resources
3/27
HEP Grid effords since 2001
today
LCG R&D
Mar-Sep
pp run
WLCG Ramp-up
Okt.
HI run
...
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
EDG
EGEE
EGEE 2
EGEE 3 ?
GridKa / GGUS
DGI
DGI 2
???
D-Grid Initiative
HEP CG
???
4/27
LHC Groups in Deutschland
Alice: Darmstadt, Frankfurt,
Heidelberg, Münster
ATLAS: Berlin, Bonn,
Dortmund, Dresden, Freiburg,
Gießen, Heidelberg, Mainz,
Mannheim, München, Siegen,
Wuppertal
CMS: Aachen, Hamburg,
Karlsruhe
LHCb: Heidelberg, Dortmund
5/27
German HEP institutes participating in
WLCG
WLCG: Karlsruhe (GridKa &
Uni), DESY, GSI, München,
Aachen, Wuppertal, Münster,
Dortmund, Freiburg
6/27
HEP CG participants:
Participants: Uni Dortmund, TU
Dresden, LMU München, Uni
Siegen, Uni Wuppertal, DESY
(Hamburg & Zeuthen), GSI
Associated partners: Uni Mainz,
HU Berlin, MPI f. Physik
München, LRZ München, Uni
Karlsruhe, MPI Heidelberg, RZ
Garching, John von Neumann
Institut für Computing, FZ
Karlsruhe, Uni Freiburg,
Konrad-Zuse-Zentrum Berlin
7/27
HEP Community Grid
WP 1:
Data management (dCache)
WP 2:
Job Monitoring and user support
WP 3:
distributed data analysis (ganga)
==> Joint venture between physics and computer
science
8/27
WP 1: Data management
coordination: Patrick Fuhrmann
An extensible metadata catalogue for semantical data access:
Central service for gauge theory
DESY, Humboldt Uni, NIC, ZIB
A scaleable storage element:
Using dCache on multi-scale installations.
DESY, Uni Dortmund E5, FZK, Uni Freiburg
Optimized job scheduling in data intensive applications:
Data and CPU Co-scheduling
Uni Dortmund CEI & E5
9/27
WP 1: Highlights
Establishing a metadata catalogue for the gauge theory
Production service of a metadata catalogue with > 80.000 documents.
Tools to be used in conjunction with LCG data grid
Well established in international collaboration
http://www-zeuthen.desy.de/latfor/ldg/
Advancements in data management with new functionality
dCache could become quasi standard in WLCG
Good documentation and automatic installation procedure helps to
provide useability for small Tier-3 installations up to Tier-1 sites.
High troughput for large data streams, optimization on quality and load
of disk storage systems, giving high performant access to tape systems
10/27
dCache.ORG
dCache based scaleable storage
element
- thousands of pools
- >> PB Disk Storage
- >> 100 File transfers/ sec
- < 2 FTEs
- single host
- ~ 10 TeraBytes
- Zero Maintenance
dCache project well established
New since HEP CG:
Professional product management, i.e. code versioning,
packaging, user support and test suits.
11/27
dCache: principle
dCache.O
dCache Controller
protocol Engines
Backend Tape
Storage
Information Prot.
Managed Disk
Storage
HSM Adapter
Storage Control
P
SRM
EIS
Streaming Data
(gsi)FTP
http(g)
Posix I/O
xRoot
dCap
12/27
dCache: connection to the Grid world
OUT - SITE
IN - SITE
Information System
Firewall
Compute
Element
Storage Element
File Transfer Service
Storage Resource
Manager Protocol
SRM
FTS Channels
dCap/rfio/root
gsiFtp
gsiFtp
13/27
dCache: achieved goals
Development of the xRoot protocol for distributed analysis
Small sites: automatic installation and configuration
(dCache in 10mins)
Large sites (> 1 Petabyte):
Partitioning of large systems.
Transfer optimization from / to tape systems
Automatic file replication (freely configurable)
14/27
dCache: Outlook
Current usage
7 Tier I centres with up to 900 Tbytes on disk (pre
center) plus tape system. (Karlsruhe, Lyon, RAL,
Amsterdam, FermiLab, Brookhaven, Nordu Grid)
~ 30 Tier II centres, including all US CMS in USA,
planned for US ATLAS.
Planned usage
dCache is going to be included in the Virtual Data
Toolkit (VDT) of the Open Science Grid: proposed
storage element in the USA.
Planned US Tier I will break the 2 PB boundary end of
the year.
15/27
HEP Community Grid
WP 1:
Data management (dCache)
WP 2:
Job Monitoring and user support
WP 3:
distributed data analysis (ganga)
==> Joint venture between physics and computer
science
16/27
WP 2: job monitoring and user support
co-ordination: Peter Mättig (Wuppertal)
Job monitoring- and resource usage visualizer
TU Dresden
Expert system classifying job failures:
Uni Wuppertal, FZK, FH Köln, FH Niederrhein
Job online steering:
Uni Siegen
17/27
Job monitoring- and resource usage
visualizer
Us e r
Worke
Worke rr Node
Node
✗ Browse r
✗ Visua lis a tion Apple t
✗ Visua lis a tions
J ob Monitoring
✗ monitoring s e ns ors
J ob Exe cution Monitoring
✗ s te pwis e
Us e r Applica tion
Us e(Phys
r Applica
ics )tion
(Phys ics )
●
●
●
●
Inte ra ctivity
Ove rvie ws
De ta ils
Time line s, His togra ms
...
Worke
Worke rr Node
Node
J ob Monitoring
✗ monitoring s e ns ors
J ob Exe cution Monitoring
✗ s te pwis e
Us e r Applica tion
Us e(Phys
r Applica
ics )tion
(Phys ics )
●
●
●
Mo nito ring Bo x
✗ R-GMA
Po rtal S e rve r
✗ GridS phe re
✗ Monitoring P ortle t
R -GMA
Worke r Node
J ob Monitoring
✗ monitoring s e ns ors
Us e r Applica tion
(Phys ics )
Analys is
✗ We b-S e rvice
✗ Inte rfa ce to
monitoring s ys te ms
e .g. R-GMA Cons ume r
18/27
Integration into GridSphere
19/27
Job Execution Monitor in LCG
Motivation
1000s of jobs each day in LCG
submitted
Job status unknown while running
waiting
Manual error detection: slow and
difficult
ready
GridICE, ...: service/hardware
based monitoring
scheduled
Conclusion
Monitor job while running
running
 JEM
Automatical error detection needed
What is going
on here ?
cancelled
aborted
 expert system
done (failed)
done (ok)
cleared
20/27
gLite/LCG
Workernode
Pre-execution test
Bash
Script monitoring
Python
JEM:
Job Execution Monitor
Information exchange: R-GMA
Visualization: e.g. GridSphere
Experten system for
classification
Integration into ATLAS
Integration into GGUS
post D-Grid I:
...
?
21/27
JEM - status
 Monitoring part ready for use
 Integration into GANGA
(ATLAS/LHCb distributed analysis tool) ongoing
 Connection to GGUS planned
 http://www.grid.uni-wuppertal.de/jem/
22/27
HEP Community Grid
WP 1:
Data management (dCache)
WP 2:
Job Monitoring and user support
WP 3:
distributed data analysis (ganga)
==> Joint venture between physics and computer
science
23/27
WP 3: distributed data management
Co-ordination: Peter Malzacher (GSI Darmstadt)
GANGA: distributed analysis @ ATLAS and
LHCb
Ganga is an easy-to-use frontend for job
definition and management
Python, IPython or GUI interface
Analysis jobs are automatically splitted into
subjobs which are sent to multiple sites in
the Grid
Data management for in- and output.
Distributed output is collected.
Allows simple switching between testing on
a local batch system and large-scale data
processing on distributed resources (Grid)
Developed in the context of ATLAS and
LHCb
Implemented in Python
24/27
GANGA schema
catalog
files
Storage
queues
query
data file splitting
jobs
myAna.C
merging
final analysis
submit
manager
outputs
25/27
PROOF schema
catalog
files
Storage
scheduler
query
PROOF query:
data file list, myAna.C
MASTER
feedbacks
final
outputs
(merged)
26/27
HEPCG: summary
DESY, Dortmund
Dresden, Freiburg,
GSI, München,
Siegen,
Wuppertal
Dortmund,
Dresden, Siegen,
Wuppertal, ZIB,
FH Köln,
FH Niederrhein
Physics Departments
Computer Sciences
D-GRID:
Germany‘s contribution to HEP computing:
dCache, Monitoring, distributed analysis
Effort will continue,
2008: Start of LHC data taking
challenge for GRID Concept
==> new tools and developments needed
27/27

Harenberg