Facoltà di scienze informatiche

On multicast primitives in large networks and partial replication protocols

Schiper, Nicolas ; Pedone, Fernando (Dir.)

Thèse de doctorat : Università della Svizzera italiana, 2009 ; 2009INFO005.

Recent years have seen the rapid growth of web-based applications such as search engines, social networks, and e-commerce platforms. As a consequence, our daily life activities rely on computers more and more each day. Designing reliable computer systems has thus become of a prime importance. Reliability alone is not sufficient however. These systems must support high loads of client... Plus

Ajouter à la liste personnelle
    Summary
    Recent years have seen the rapid growth of web-based applications such as search engines, social networks, and e-commerce platforms. As a consequence, our daily life activities rely on computers more and more each day. Designing reliable computer systems has thus become of a prime importance. Reliability alone is not sufficient however. These systems must support high loads of client requests and, as a result, scalability is highly valued as well. In this thesis, we address the design of fault-tolerant computer systems. More precisely, we investigate the feasibility of designing scalable database systems that offer the illusion of accessing a single copy of a database, despite failures. This study is carried out in the context of large networks composed of several groups of machines located in the same geographical region. Groups may be data centers, each located in a local area network, connected through high-latency links. In these settings, the goal is to minimize the use of inter-group links. We mask failures using data replication: if one copy of the data is not available, a replica is accessed instead. Guaranteeing data consistency in the presence of failures while offering good performance constitutes the main challenge of this thesis. To reach this goal, we first study fault-tolerant multicast communication primitives that offer various message ordering guarantees. We then rely on these multicast abstractions to propose replication protocols in which machines hold a subset of the application's data, denoted as partial replication. In contrast to full replication, partial replication may potentially offer better scalability since updates need not be applied to every machine in the system. More specifically, this thesis makes contributions in the distributed systems domain and in the database domain. In the distributed systems domain, we first devise FIFO and causal multicast algorithms, primitives that ease the design of replicated data management protocols, as we will show. We then study atomic multicast, a basic building block for synchronous database replication. Two failure models are considered: one in which groups are correct, i.e., groups contain at least one process that is always up, and one in which groups may fail entirely. We show a tight lower bound on the minimum number of inter-group message delays required for atomic multicast in the first failure model. When an arbitrary number of processes may fail and process failures may not be predicted, we demonstrate that erroneous process failure suspicion cannot be tolerated. We then present atomic multicast protocols for the case of correct and faulty groups and empirically compare their performance. The majority of the proposed algorithms are latency-optimal. In the database domain, we extend the database state machine (DBSM), a previously proposed full replication technique, to partial replication. In the DBSM, transactions are executed locally at one database site according to the strict two-phase locking policy. To ensure global data consistency, a certification protocol is triggered at the end of each transaction. We present three certification protocols that differ in the communication primitives they use and the amount of information related to transactions they store. The first two algorithms are tailored for local area networks and ensure that sites unrelated to a transaction T only permanently store the identifier of T. The third protocol is more generic since it is not customized for any type of network. Furthermore, with this protocol, only sites that replicate data items read or updated by T are involved in T's certification.