نبذة مختصرة : Also available as an INRIA Research Report 5091: http://www.inria.fr/rrrt/rr-5091.html; A new kind of application is born. Code coupling applications consist of applications that can be divided into modules. They often need to run on several clusters. However, in these huge architectures that we call ``cluster federations'', there's a large number of nodes. Faults may appear very frequently. Thus a fault tolerance mechanism that fits these architectures and these kind of applications should be provided. We propose a hierarchical checkpointing protocol that combines synchronized methods inside clusters and communication induced methods between clusters. Our protocol has been evaluated by a discrete event simulation. The first results show that it works well for the targeted applications.
No Comments.