نبذة مختصرة : The last decade has seen an unprecedented growth in grid infrastructures. Grid characteristics, such as high heterogeneity, complexity and distribution create many new technical challenges, which need to be addressed. Among these technical challenges, failure management is a key area, important for both the applications and for the grid operation activities. In this thesis work i have undertaken a comprehensive analysis and assessment of several services of the gLite middleware currently in use in the EGEE Grid, the largest grid infrastructure in the world. Sites in the EGEE production grid infrastructure are required to provide their services on a continuous basis. The same is true for central grid infrastructure services. Therefore, it is important not only to know the current status of the various sites and central services but also to obtain information about this status in the long run. Service level agreement (SLA) negotiation plays a very important role in manufacturing grid. I extended the Nagios monitoring framework with high availability features in order to implementan efficient grid monitoring system. The main goal of this system is to achieve better availability of grid hosts and services, by precise problem detection and instant notification. This would also enable utilizing system's mechanisms for automatic recovery of services in order to improve the present rates, and so improve the availability and reliability of the EGEE grid production infrastructure. The aim of such an initiative is to provide a sustainable infrastructure based on National Grid Initiatives (NGIs), with the final result of delivering a large-scale production Grid infrastructure able to provide reliable and predictable services.
No Comments.