Show simple item record

dc.contributor.authorSanz-Marco, Vicent
dc.date.accessioned2016-04-15T10:13:24Z
dc.date.available2016-04-15T10:13:24Z
dc.date.issued2016-04-15
dc.identifier.urihttp://hdl.handle.net/2299/17110
dc.description.abstractA distributed system is defined as a collection of autonomous computers connected by a network, and with the appropriate distributed software for the system to be seen by users as a single entity capable of providing computing facilities. Distributed systems with centralised control have a distinguished control node, called leader node. The main role of a leader node is to distribute and manage shared resources in a resource-efficient manner. A distributed system with centralised control can use stream processing networks for communication. In a stream processing system, applications typically act as continuous queries, ingesting data continuously, analyzing and correlating the data, and generating a stream of results. Fault tolerance is the ability of a system to process the information, even if it happens any failure or anomaly in the system. Fault tolerance has become an important requirement for distributed systems, due to the possibility of failure has currently risen to the increase in number of nodes and the runtime of applications in distributed system. Therefore, to resolve this problem, it is important to add fault tolerance mechanisms order to provide the internal capacity to preserve the execution of the tasks despite the occurrence of faults. If the leader on a centralised control system fails, it is necessary to elect a new leader. While leader election has received a lot of attention in message-passing systems, very few solutions have been proposed for shared memory systems, as we propose. In addition, rollback-recovery strategies are important fault tolerance mechanisms for distributed systems, since that it is based on storing information into a stable storage in failure-free state and when a failure affects a node, the system uses the information stored to recover the state of the node before the failure appears. In this thesis, we are focused on creating two fault tolerance mechanisms for distributed systems with centralised control that uses stream processing for communication. These two mechanism created are leader election and log-based rollback-recovery, implemented using LPEL. The leader election method proposed is based on an atomic Compare-And-Swap (CAS) instruction, which is directly available on many processors. Our leader election method works with idle nodes, meaning that only the non-busy nodes compete to become the new leader while the busy nodes can continue with their tasks and later update their leader reference. Furthermore, this leader election method has short completion time and low space complexity. The log-based rollback-recovery method proposed for distributed systems with stream processing networks is a novel approach that is free from domino effect and does not generate orphan messages accomplishing the always-no-orphans consistency condition. Additionally, this approach has lower overhead impact into the system compared to other approaches, and it is a mechanism that provides scalability, because it is insensitive to the number of nodes in the system.en_US
dc.language.isoenen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectfault toleranceen_US
dc.subjectS-Neten_US
dc.subjectLeader Electionen_US
dc.subjectCheckpoint restoreen_US
dc.subjectlog-based rollback recoveryen_US
dc.subjectdistributed systemsen_US
dc.subjectcompilersen_US
dc.subjectstreaming networksen_US
dc.subjectshared memory systemsen_US
dc.titleFault Tolerance for Stream Programs on Parallel Platformsen_US
dc.typeinfo:eu-repo/semantics/doctoralThesisen_US
dc.identifier.doi10.18745/th.17110
dc.identifier.doi10.18745/th.17110
dc.type.qualificationlevelDoctoralen_US
dc.type.qualificationnamePhDen_US
herts.preservation.rarelyaccessedtrue


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record