Computing Technical Design Report

4.8 Database and Data Management Services

4.8.1 Distributed Database Services

The distributed database services activity in the DB project provides the physical database services and supporting software (e.g. client connection software) making up the distributed database infrastructure supporting database-based ATLAS applications, including conditions, geometry, DDM and other databases.

ATLAS is addressing this area through a combination of in-house effort and close collaboration with the LCG. ATLAS was instrumental in the establishment of the Distributed Deployment of Databases (3D) project in LCG [4-10] to address the facilities and software needs of establishing the database service, replication and scalable access infrastructure required to support DB applications. To meet these needs ATLAS relies heavily on and collaborates with 3D. ATLAS database applications are based on centralized writing and distributed reading, with no multi-master requirements on the DB services. DB inputs originating far from the centralized services are written to the central services, either via remote write or by first transporting the inputs close to the centralized services. This approach avoids the complexities of multi-mastering by leveraging the fact that for all our DB applications, writes are much less frequent and have lesser scalability requirements than reads.

The common LCG database replication framework under development by the 3D project will be used for data distribution to Tier-1s and Tier-2s. Tier-1s will typically offer Oracle and possibly also MySQL services. For Tier-2 support in particular, distribution/replication mechanisms must be as automated as possible because of the low support levels available. ATLAS will deploy replica servers to Tier-2s using centrally issued Grid job-submission commands that fully automate the process. Thus, no local site administration will be necessary.

Both Tier-1 and Tier-2 servers will be continuously replicating data published on the master servers at CERN. We expect on the order of 100 servers to be deployed and maintained across all major Tier-2 sites. In addition, some support is likely to be required for smaller local sites that wish to deploy replicas. For all replica servers, automatic updates from a central source will be possible. For replication management and monitoring, ATLAS will use a suite of tools now under development by the 3D project as well as tools we have for ATLAS-specific use cases.

Tier centres will in general not have to manage their conditions-data selection policy locally. Following the ATLAS Computing Model, Tiers do not locally control the ATLAS-wide production processed at them. ATLAS-level processing management, through automated mechanisms, will drive the population of needed conditions data sets at the local resource based on the processing allocated there. Likewise they will not have to locally manage the database storage issues. Several mechanisms are available or being explored to distribute conditions data subsets of interest: selective replication out of Oracle masters into MySQL or SQLite replicas; tiered DB access with caching intermediaries (the FroNtier approach); and data block subscriptions for file-based data. We expect to employ all of these in appropriate contexts.

However, we further expect that some Tier centres will want to have some control over what conditions data they get. There will be calibration activities centred around particular Tier-2s, e.g. an institute or group of institutes which has responsibility for calibrating a particular sub-detector will want its conditions data to do analysis and preparation of new calibrations. To support this we will have conditions data `streams', analogous to event-store data streams. These streams represent subsets of the whole conditions database, and particular Tiers will `subscribe' to particular conditions streams. This implies distribution of both the IOV/relational database data associated with a conditions stream, and any corresponding conditions data payload files (e.g. POOL ROOT files). It is similar to sites that `subscribe' to particular classes of event datasets, a concept being built into the distributed data management model, and the same mechanism will support this case. A further use case for `subscription' replication to tiers is the distribution of larger conditions data payloads that are stored in files.

4.8.2 Caching Tiers for Distributed Database Access

From our experiences to date, relying on database replication alone to achieve scalability would require substantial manpower and logistics both centrally and at Tier centres. We are in the planning stages of pursuing an alternative approach that would reduce or eliminate the need for wide dissemination of database replicas, while providing fully automated local caching of needed conditions (and potentially other) data. The approach is that of the FroNtier project developed at FNAL for Run II experiments [4-9], in which client/server communication is via conventional HTTP, with the URL constituting the upstream message and containing the specification of the requested data, and the page content of the returned document (prepared by a web application server interacting with the database) packaging the requested data in a format understood by the client application. Because the protocol is that of the web itself, standard web proxy caching tools (such as Squid) can be used to establish automatic caching of requested data close to the client, e.g. on gateway servers at a Tier site or even a contributing site not affiliated with ATLAS.

A number of issues need to be addressed, in particular the form and granularity of the URL-requested data packets and a mechanism to prevent retrieval of stale data from the cache. Studies are planned soon by 3D and COOL to examine the compatibility of FroNtier itself with the COOL interface. Approaches to the cache validity problem are being explored. For example, any incorporation of data packet version information into the request will ensure the retrieval of the correct packet. A two-stage retrieval could support this. In the first stage a request goes to the IOV database for the data packet valid at a specified time (this exchange is not cached, since the request is too specific to have a significant probability of reuse, but it would also return only a small message). The data message returned would contain URL(s) for the specific, versioned data packets required to fulfil the request (this is an example of the so-called REST approach to web-based services, the architectural model used by the web itself). The URL or URLs would be requested in the second, cached, step (and any other local process requesting calibration data in a similar time frame will have a high probability of a cache hit on these data packets). CDF tests show orders of magnitude performance improvement due to local Squid data caching with this approach.

The 3D project provides the context for the common effort on applying this approach for the LHC. We are also examining other open-source projects for potentially relevant work in managing cache validation in tiered data exchanges on the web.

4.8.3 Operational Experience

Operational experience in database services during DC2 has identified a number of performance bottlenecks including server capacity limitations and concurrent DB-connection count limits. Another related performance limiter was TCP/IP socket timeouts and failures on the client side. This condition (similar to a denial-of-service attack) happens in the case of high query rates and particularly affects geographically remote sites. For all these cases, deployment of more geographically distributed database server replicas resolved the problem.

As expected, we found that the deployment of replica servers is an effective tool towards achieving scalable database services. To ease the burden of replica-server administration and support on remote sites, the database-server replica installation was reduced to a one line command. However, we have found that only the centres that have problems with access to the central databases (because of firewalls or geographical remoteness resulting in low data throughput) deployed the replica servers. In addition, numerous concerns were expressed regarding replica synchronization with the central servers. To address these, the new deployment model involving centralized distribution of replicas via Grid commands is being prototyped, which will enable centralized, coherent management of the distribution and synchronization of replicas. A proof-of-the-principle Grid-based replica server deployment has been demonstrated at the SMU site of Grid3.

An alternative to Grid deployment of replicas on site head nodes is delivery of database services directly to worker nodes, using conventional Grid transport mechanisms for services installation. During ATLAS DC1 this approach was extensively tested on NorduGrid and remains a deployment option. It is particularly attractive for sites where worker nodes have no outbound connectivity.

4.8.4 Monitoring

During ATLAS Data Challenges and other large-scale production efforts we have deployed (and steadily improved) several database monitoring tools. The most detailed information has been collected through server-side logging. This information has been crucial in debugging database service problems and errors reported by clients. To complement this server-side data, an ATLAS Client Library is being developed to provide monitoring and logging information on the client side.

In addition, higher-level monitoring information of server status has been collected both on the server side and remotely by database monitoring servers that collect and present global information about all ATLAS database servers. The operating system level information was collected with the help of the LEMON monitoring services provided by the CERN/IT FIO group and the Ganglia monitoring services at BNL. Database server level information was collected via the dedicated open-source monitoring tools deployed outside of CERN.

Experience shows that experiment-specific monitoring tools are important for efficient monitoring of database services. We are pleased with progress in monitoring capabilities for ATLAS-specific needs that has been provided by the central physics database services at CERN. As these monitoring needs are shared between all the LHC experiments, we anticipate that the 3D project will be a vehicle to provide effective, distributed database-services monitoring capabilities for ATLAS. The 3D project is evaluating the client-side monitoring system developed at Fermilab for monitoring Oracle-query performance.

In planning for database services capacities for robust Grid operations one has to provide for the maximum load. Traditionally, in a controlled centralized environment of organized data processing these capacities are easy to calculate and provide for. The much less predictable environment of distributed processing on a federation of Grids makes provision of required capacity a challenging task. It is important to study the collective dynamics of the "chaotic" workload exhibited in this environment in order to understand load behaviours and optimize both the provisioning of the services and their utilization. To this end we have deployed load-monitoring services for our databases and are using them to track collective usage, adjust server deployments and usage patterns, and study workload patterns.

The collective dynamics of database access were studied in DC2. Each computational task that retrieved input parameters required by the job accessed the same database. With thousands of tasks accessing the database daily, the daily average database server load was 122 queries per second with a standard deviation of 0.24 of the daily average value, which is fourteen times more than purely statistical fluctuations. We believe these large fluctuations reflect a high degree of correlations among the computational tasks running on the three Grids used by ATLAS, which together encompass more than ten thousand processors (e.g. jobs starting simultaneously on many nodes of a processing farm). Monitoring and analyzing such fluctuations will give us guidance in developing database deployment and production-usage approaches that can minimize them, while at the same time allowing us to track service needs and growth.

4.8.5 Roles

In accordance with established best practices adopted in ATLAS, users will be granted only those database-access rights required for their work, as determined by their role. To this end we envisage separation of user roles in database access, with access privileges granted according to the following roles: administrators, developers, editors, writers and readers. Administrators manage the installation of database servers and give access rights to other users. To develop database applications for particular software domains, e.g. online, calibrations database, geometry database, and so on, developers are granted full access rights to a particular database. The editor role covers cases in which UPDATE or DELETE privileges are required. For improved operational security the main applications or scripts that write data to the database are granted only the INSERT and SELECT privileges, so that the data - a growing collection of records - cannot be deleted when a writer account becomes compromized. The most frequent database operation is performed in the reader role that has only the SELECT privilege.

It is expected that only a small number of users will have administrative rights on production servers. The number may be larger on development servers. We are investigating long-term solutions for password management; in particular, a database authentication mechanism based on Grid proxy certificates that eliminates the need for clear-text passwords.

4.8.6 Database Authentication

Several approaches exist to provide database services in the Open Grid Services Architecture (OGSA). A key element in the architecture is the Grid security model providing database-access control. Two implementation options have been identified. In the first option a separate server (e.g. LDAP) performs Grid authorization on behalf of the database server. The alternative is to do Grid authentication and authorization in the database server itself, which avoids significant data transfer overheads inherent in the multilayer authentication option. The second option has been implemented in an ATLAS Grid-enabled MySQL server prototype, avoiding the use of clear-text passwords and eliminating significant encryption and cryptographic handshake overheads generated when a stateless protocol is used. The prototype was successfully tested with certificates issued by DOE, CERN and NorduGrid certificate authorities, and was tested in ATLAS Data Challenge 1 production for authentication and database write-access control. This experience has been promising, and further work is under way to expand backward compatibility of Grid proxy tools and to add the server purpose information to Grid host certificates. Work is under way in 3D to implement the first option using an LDAP server together with Oracle. Oracle itself is exploring the second option, but the timescale is unclear.

An extended Grid security model will benefit from the introduction of shared Grid certificates (similar to the privileged accounts traditionally shared in HEP computing for production, librarian, data management and database administration tasks) as well as tools for establishing a Grid proxy upon login (similar to the AFS token acquisition procedure).



4 July 2005 - WebMaster

Copyright © CERN 2005