Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems

Size: px

Start display at page:

Download "Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems"

Lora Mitchell
5 years ago
Views:

University of Colorado, Boulder CU Scholar Computer Science Graduate Theses & Dissertations Computer Science Spring 1-1-2014 Towards a Next-Generation Runtime Infrastructure Engine for Configuration

1 University of Colorado, Boulder CU Scholar Computer Science Graduate Theses & Dissertations Computer Science Spring Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems Ali Alzabarah University of Colorado at Boulder, li.alzabarah@cs.colorado.edu Follow this and additional works at: Part of the Systems Architecture Commons Recommended Citation Alzabarah, Ali, "Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems" (2014). Computer Science Graduate Theses & Dissertations This Dissertation is brought to you for free and open access by Computer Science at CU Scholar. It has been accepted for inclusion in Computer Science Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact cuscholaradmin@colorado.edu.

2 Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems by Ali Alzabarah B.S., University of Northern Iowa, 2009 M.S., University of Colorado at Boulder, 2011 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2014

3 This thesis entitled: Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems written by Ali Alzabarah has been approved for the Department of Computer Science Kenneth M. Anderson Prof. Richard Han Prof. John Black Prof. James Martin Dr. Mazdak Hashemi Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 iii Alzabarah, Ali (Ph.D., Computer Science) Towards a Next-Generation Runtime Infrastructure Engine for Configuration Management Systems Thesis directed by Prof. Kenneth M. Anderson A common approach to configuration management is to couple a high-level declarative programming language with a runtime engine. The language is used to specify configurations and the engine is used to deliver and apply those configurations on a set of computing resources. The design and architecture of current runtime engines of configuration management systems lack 1) essential coordination and synchronization of actions between computing resources and 2) strong security mechanisms. This thesis examines a number of techniques that can be applied to the area of configuration management to address these limitations. In particular, the combination of these techniques leads to a new architecture for the runtime engines of modern configuration management systems, providing them with secure coordination and synchronization capabilities. A prototype of this new approach was developed and evaluated in an environment that simulates highly-demanding computing landscapes and the results show that the new architecture is able to reduce the occurrence and impacts of configuration errors in these environments.

5 To Hamad, Swidah, and Maha. Dedication

6 v Acknowledgements I would like to thank my advisor, mentor, and professor, Ken Anderson, for his support. This work would not have been possible without his guidance. Ken supported me at every level during my time in school and was always there when I needed him. I would also like to thank my committee for their help, comments, and motivation throughout the process. I especially want to thank Professor John Black for his time and effort to keep my graduate life as easy as it could be and for all his valuable advices. I would like to thank my colleagues at the Twitter Site Reliability Engineering Organization for their valuable advices and guidances, especially my mentor David Barr, along with Drew Dickson, Greg Maccarone, Fernando Aguasvivas, Charlie Moore, and Toby Weingartner. I also thank Harold Gonzales from Google, Matthew Woitaszek from Walmart Labs, and Matthew Monaco, Andy Sayler, Mark Dehus, and Chris Schenk from CU Boulder for their early feedback on this research. Thanks also to Professors Bryan Dixon and Alireza Mahdian for their advice and feedback on producing a thesis and I thank Yolande Mclean and Peter Schares for proofreading this document. Finally, I would like to thank the great people at the Project EPIC lab for their support and for letting me use their environment to run my experiments.

7 vi Contents Chapter 1 Background Late Eighties Computing Late Nineties Computing Modern Development Ecosystem Introduction CMS History CMS Components CMS Shortcomings Dissertation Outline Problem Statement 13 4 Related Work CMS Design and Architecture CMS Constraints CMS Workflows and Access Control Design and Architecture Overview Motivation

8 vii 5.3 Supporting Frameworks and Techniques Coordination Framework Messaging Framework Distributed Systems Techniques System Components Change Worker Coordinator Change Consumer Coordinator Client Change Agent Monitoring Layer Security Layer Implementation Overview ZooKeeper RabbitMQ Evaluation Overview Environment Experimental Data Set Deployment and Experimental Configurations Experimental Criteria Correctness Latency Scalability Impact of Non-Functional Design Choices Contributions 73

9 9 Future Work 75 viii 10 Conclusion 78 Bibliography 79 Appendix A Additional Data 83

10 ix Tables Table 5.1 Zookeeper znodes used to Track a Change Request Interface to the Coordination Layer A.1 Latencies in Seconds with Different Queue & Message Types

11 x Figures Figure 1.1 The Development Ecosystem Hiding the Complexities of Hypervisors and Cloud Providers Puppet Language Dataflow Distribution of CMS Client-To-Master Connections Distribution of CMS Client-To-Master Connections by Nodes in Same Context Shift in Time of CMS Client-To-Master Connection by Single Node CWC Authentication and Authorization Workflow CWC Time Coordinating, Change Registration, and Change Delivery Workflow CCC Workflow Size of Change Requests Used in Evaluation Hour of Minute when Change Request Submitted Coordination of CMS Masters by Caerus Time Required by Master Nodes to Apply Changes Times at which CMS Clients Contacted CMS Masters for Changes Times at which CMS Clients Applied Changes to their Host Machines to CWC Latency with Message Size CCC Latency with Message Size to CWC Latency and Queue Durability

12 7.10 CCC Latency and Queue Durability xi

13 Chapter 1 Background This thesis concerns itself with research issues related to the task of configuration management in large-scale computing environments and the manageability of modern configuration management systems. While significant progress has been made in configuration management, in large-scale environments much of the work in keeping a system alive and functional is still manual, labor intensive, and prone to expensive failures and downtime. This chapter details the current need for further research in configuration management and presents a brief history of how challenges faced by configuration management systems are tied to changes in modern computing environments Late Eighties Computing In the late eighties and early nineties, developers were able to work more efficiently and compile programs more quickly due to a significant growth in the memory and CPU power of computing systems. During this period, new operating systems Linux and Windows were created alongside new object-oriented programming languages, such as C++ and Java. However, advances in hardware easily outpaced the progress being made in software development. In fact, as Brooks so famously claimed, no other technology has advanced as quickly as the improvements seen in computing hardware [4]. In contrast, the slow progress in our ability to create and develop software stems from essential difficulties intrinsic to the nature of software design and development and that progress on the accidental difficulties difficulties associated with a single tool or technique are not enough to match the improvements in hardware predicted by Moore s Law.

14 2 Not everyone had access to computers in the eighties; wide-spread use was limited to research labs, big businesses, and government agencies. In the early nineties, the personal computer revolution began, and the computer was more commonly used as a means of communication, especially via the Internet. The revolution exposed a wider audience to computers, and to programming and that, in turn, led to advances in compilers and to the birth of new programming languages and paradigms. Since the beginning of the personal computer revolution and the computer s widespread use as a means of communication, software development has become increasingly complex. As a result, it was during this time frame that a wide range of methodologies, practices, and tools were developed to try to improve the software development process, including architectural patterns, design patterns, agile methods, and more Late Nineties Computing In the late nineties and early 2000s, computing experienced yet another technological revolution, driven by the pervasive adoption of the Internet. More users started using the online capabilities of their computers and the amount of data being generated by that usage began to increase rapidly. As a result, society s computing needs are now far beyond what a single computer can handle. However, during this same time frame, advances in computing hardware has slowed as the limits of Moore s Law are encountered. The clock speeds of new CPUs can no longer double every eighteen months as they once did, now instead, CPUs are growing wider with increasing number of cores and memory bandwidth. This shift has significant impacts on software development since software developers must now grapple with the complexities of concurrent software systems in order to see significant performance improvements in their systems. For example, in this age of big data, the amount of information that is routinely collected and analyzed by large organizations is in the petabytes and the old models of single powerful machines and single-threaded software systems are no longer sufficient. With the advent of cheaper (not necessarily faster) hardware, large companies such as Google realized that the future of large-scale software development involved designing, developing, and deploying distributed, scalable, failure-resilient, and robust software on

15 3 commodity hardware. It was no longer feasible to build a single, powerful (and expensive) server; instead they chose to scale out instead of scaling up [10]. With this transition alongside other factors such as cloud computing, dynamic resource allocation, and big data analytics new challenges emerged. Relevant to this thesis is the challenge of managing the configuration of the machines that host these large, distributed software systems. A related challenge is managing the deployment of updates and changes to the software system itself. If updates occur out-of-order or to just a subset of machines, the entire infrastructure can grind to a halt as incompatibilities cause systems to fail and, in these environments, such failures can lead to the loss or corruption of critical data sets. These software systems require not just a few machines, but potentially clusters of thousands of heterogeneous machines each with different operating systems, compilers, libraries, applications, etc Modern Development Ecosystem In this section we look at how the need to develop distributed, scalable, and failure-resilient software introduced challenges to the relationship between software engineers, operators and end users. We then go over two major research directions addressing those challenges Software Engineers s In this new software development context, software engineers can no longer design and deploy systems without the knowledge, skills, and help of system administrators and other IT operations personnel. It is a common software engineering practice to minimize the communication cost between development teams [5] and communication with the operational side of the equation was minimal or non-existent. Now, software engineers need to design software that will operate on clusters of machines and when it comes time to build, test, and deploy these systems, they find themselves at the mercy of system administrators and their schedule. Of course, creating, operating, and maintaining clusters of machines is not an easy task and so delays in responding to these requests are reasonable. This situation of additional communication and coordination costs means

16 4 Software Developers End Users IT s Figure 1.1: The Development Ecosystem of Cluster Computing that once again the complexity of software development has increased Software Engineers End Users This new paradigm of cluster computing has also presented challenges to the end users of these systems. Research scientists and business analysts no longer have the ability to use traditional data analysis tools and techniques on their increasingly large data sets. In addition, the nature of these data sets has shifted from well-ordered, structured relational data to include more unstructured, messy data types that are difficult to process without a lot of filtering and preprocessing. This type of processing requires the use of unfamiliar (to them) programming paradigms and storage mechanisms, such as MapReduce [10] [18] and NoSQL [8] [11]. As a result, end users require the skills of software engineers to write the software that will do the analysis for them. This places additional constraints and complexity on the user-developer relationship just as the complexity of the developer-operator relationship has increased Research Directions Driven by these challenges, researchers are now working to address the gaps between the three elements of the triangle (users, developers, and operators) shown in Fig For instance, to allow end users the ability to participate in the analysis of large data sets, abstractions have

17 5 been built on top of MapReduce that hide the complexity of writing MapReduce jobs as well as the complexity of moving data to a cluster, submitting jobs, and retrieving the results. Indeed, projects like Pig and Hive developed respectively at Yahoo and Facebook attempt to mimic familiar conventions of SQL-based queries so that analysts can re-use as much of their skills from the relational world as possible. The abstraction provided by these tools (SQL-like queries) and the functionality they provide behind that abstraction to auto-generate MapReduce jobs, execute them, and assemble the results in an understandable format helps to reduce the overall complexity on the development relationships between users, developers, and operators. This thesis concerns itself with improving the relationship between software engineers (developers) and system administrators (operators). There are two active areas of research that address this relationship. The goal of the first aims to reduce the complexity in provisioning, deploying, and allocating computational resources. This type of research develops simple APIs that allow computational resources (both virtual machines and physical servers) to be allocated, managed, and terminated as needed. Fig. 1.2 presents an example of two such technologies, Libcloud [17] and Libvirt [36]. Libcloud provides an abstraction that allows the services of various cloud providers for instance, Amazon EC2 [26], Openstack [29], and Softlayer [30] to be treated uniformly. A developer can write code that interacts with a set of virtual machines that live in the cloud and have this same code work on multiple providers. This functionality allows developers to choose a cloud provider that meets their needs and easily switch to a different provider if those needs change. Libvirt performs a similar service for managing a set of virtual machines on a physical server while abstracting away the details of the particular hypervisor being used to implement and execute the virtual machines. The second research direction attempts to simplify the configuration of computing resources to ensure that they are ready to execute the applications that are deployed on them. To do this, all of an application s dependencies need to be tracked and the configuration management system must ensure that these dependencies are installed and functioning correctly before it downloads and installs the application that requires them. Furthermore, it must do this for all of the applications

18 6 Virtualization Controller Cloud Controller Libvirt Libcloud Vmware KVM EC2 Openstack Opennebula VirtualBox Hyper-V Xen Softlayer Eucalyptus VCloud Figure 1.2: Hiding the Complexities of Hypervisors and Cloud Providers that are expected to run on a particular host. Finally, to be truly useful for the new development paradigm of cluster computing, a configuration management system must do this for all of the machines present in a cluster and, more typically, for all machines present in multiple clusters and data centers. While much progress has been made in configuration management, it is at this largest scale (multiple data centers with multiple clusters of thousands of machines) where research is needed to provide better abstractions to reduce the amount of manual labor required to manage these systems and to reduce the possibility of expensive and hard-to-fix configuration errors. In this area, a common approach to configuration management is to couple a high-level declarative programming language with a runtime engine. The language is used to specify configurations and the engine is used to execute commands to take a set of machines and place them in a known configuration. Examples of such systems include [3, 6, 12, 14, 19, 28, 32]. Despite these past systems, much work remains to be done in the area of configuration management. At the language level, issues of semantics, packaging, and analysis are all areas of active research while at the engine level, issues of scalability, synchronization, and efficiency require additional work in order to have techniques and technology that have a chance to operate in large-scale environments. This research has a lot of potential for significantly reducing the complexity of cluster computing and reducing the gaps that exist between the roles of developer and operator. Indeed, this area of research is responsible for the recent advances in software engineering such as continuous delivery and is the primary motivation behind the DevOps movement in the computing industry.

19 Chapter 2 Introduction In this chapter, we present an introduction to configuration management (CM) and the notion of a configuration management system (CMS). We briefly go over the history of this field and then describe the software architecture that most CMSs use. Finally, we present common CMS operational practices, their shortcomings, and the challenges of operating CMSs in largescale computing environments. The following terms related to CMSs are used in the rest of this thesis: Node: A computing resource managed by a configuration management system. This can be a physical machine, a cloud instance, a virtual machine, etc. Resource Configuration: A block of code in a domain-specific language that defines a resource to be managed. A resource can be a file, a package, a service, etc. on a node. Resource Package: A file that contains one or more resource configurations. In Puppet [32], a resource package is called a manifest while in Chef [28] they are called recipes. Configuration Repository: A directory of resource packages. This directory is typically under the management of a version control system (VCS), such as Git or Subversion CMS History Configuration management is an interesting and challenging problem; it has attracted the attention of researchers from a variety of disciplines. It is an important area of work as configu-

20 8 ration errors in large-scale environments are among the more difficult errors to track down and, thus, take the longest time to repair; are often unpredictable; and account for 50% of the errors that lead to service failures [12, 37, 38]. Changing a system from one configuration to another is highly problematic due to the large number of dependencies that exist among networked systems. Additional complexity is encountered due to the heterogeneous nature of these systems in terms of hardware, operating systems, libraries, and software as well as the the changing nature of software requirements and the dynamicity of allocating computing resources in these environments. CMSs have a long history; they were initially designed to automate the management and deployment of changes to software systems. Examples of CMSs include CFEngine [6], Bcfg2 [14], LCFG [3], Puppet [32], Chef [28] and Smartfrog [19]. While these systems differ in implementation, their level of abstraction, their language type, user interface, and runtime infrastructure, they all have the same goal: hiding the complexity of change management, deployment, and maintenance in a complex, heterogeneous computing environment behind a simple API [13] CMS Components We have analyzed the following CMSs in detail: Puppet, Chef, Bcfg2, and CFEngine; they each consist of two primary components. The first component is a high-level programming language that is used to create resource configurations. The programming language can be a declarative, domain specific language (DSL) as it is in Bcfg2 and CFEngine or it can be both declarative and imperative as it is in Puppet and Chef. In Listing 2.1, we present a typical resource package for Puppet that contains a resource configuration for the OpenSSH server daemon, sshd. This package ensures that this service will be installed and running for any machine that includes this package. The second component is the runtime infrastructure engine, which interprets resource packages and performs the steps to apply a specified configuration to a machine or set of machines. These runtime engines typically adopt a client-server architecture and are responsible for managing all aspects of change management including responding to requests to apply a particular configuration, deploying resource packages to machines, and applying updates and changes to individual

21 Listing 2.1: Example of a Puppet Resource Configuration c l a s s ssh s e r v e r { package { openssh s e r v e r : ensure ==> i n s t a l l e d } s e r v i c e { sshd : ensure => running, enable => true, r e q u i r e => Package [ openssh s erver ] } f i l e { / e t c / ssh / s s h d c o n f i g : 9 } } n o t i f y => S e r v i c e [ sshd ], mode => 600, owner => root, group => root, content => template ( ssh / c o n f i g. erb ) machines in parallel to bring a cluster of machines to a known state. The focus of this thesis is on developing new techniques and technology in the area of runtime infrastructure engines. However, we will also discuss some issues and work related to a CMS s highlevel language in Chapter 4. One important aspect of CMSs is client-server authentication. A CMS needs to be able to verify that it is interacting with just the clients that are supposed to be in the clusters that it is managing. The mechanism for verifying that a client is authorized to participate in the CMS is known as authoritative configuration. This mechanism can be different across CMSs. For instance, in Puppet, a Puppet client that has been launched for the first time will accept and store the certificate of the first Puppet server (also known as a master) that contacts it. It will generate a pair of cryptographic keys (known as its public/private keys) and a certificate signing request (CSR). The CSR is sent to the master for signing and once this process completes the server and client have everything they need to communicate securely with one another via the SSL protocol. At that point, every future communication between the client and the maser is validated using standard SSL verification mechanisms. It is important to note that in this software architecture,

as a group of nodes in a Hadoop cluster).

22 Figure 2.1: Puppet Language Dataflow [40] 10 clients do not communicate with each other; they are not even aware of each other s existence. Indeed they do not coordinate or synchronize their communication with the master in any way, even if they belong to the same cluster and participate in the same high-level application (for instance as a group of nodes in a Hadoop cluster). With respect to the task of deploying resource packages from a configuration repository to a set of machines, there are two primary approaches that are in current use. A client can pull packages from the server or a server can push packages to a client. Regardless, once the resource package has made it to the client, the client will interpret the resource configurations in the package and apply them to its local machine to reach the specified state. As an example, in Fig , Puppet s client connects to the server to retrieve a set of resource configurations. The server transforms these resource configurations into an intermediate state, known as a catalog, and returns the catalog to the client. The client applies the changes associated with the catalog and returns a report of the results. As discussed above, resource configurations are described using a DSL and stored in text files. As a result, it is a common practice to keep these configurations in a version control system to make it easy to track changes, share code, revert changes, and encourage contribution from various

23 11 stakeholders. These files are usually edited by an operator (such as a software developer, system administrator, etc.) on their local machine, and then pushed to a remote repository that is stored on the same machine as the server of the CMS. Once pushed, these changes are immediately visible to the CMS and can be deployed to those clients that need to be updated CMS Shortcomings With the size of today s computing environments and data centers, one server cannot handle the load generated by the (potentially thousands of) clients. It is also impractical to rely on one server, since it would represent a single point of failure for a very large computing environment. Multiple servers serving multiple clients are thus required in large companies that manage thousands of machines in multiple data centers. However, one of the shortcomings of current CMSs is that they do not coordinate or synchronize resource configuration versions between servers. This lack of synchronization can lead to two clients in the same context (cluster) seeing different configurations. This incongruity will likely lead to unpredictable behavior and eventually to service failure. A second shortcoming of current CMSs is the lack of coordination when applying changes between clients participating in the same context, e.g. Hadoop workers, which might lead to service interruption, and decrease overall cluster availability and/or performance. For example, if we have a 100 node Cassandra cluster in which 50 nodes applied a change that resulted in Cassandra being upgraded to version 1.1, while the other 50 nodes are still using a previous version of Cassandra, the data stored in Cassandra s repository might become corrupted or lost as the cluster ceases to function properly. The client-server model and the assumption that a CMS is running in a trusted environment can lead to an error prone situation and these errors can have a serious impact on the performance of a large-scale computing infrastructure. If an operator changes a definition of a resource configuration and pushes it to the server, all clients will eventually pick up the changes and apply them. This means that anyone who has access to the version control system that manages these definitions has the power to change any resource configuration that will eventually be applied to

24 12 all machines, including those to which this person normally does not have access. Current CMSs handle the authentication between a server and client very well, assuming they are operating in a trusted zone. That is, they are well defended against external intrusion, but not against unskilled operators making changes to resource configurations and pushing them such that these authentication measures are bypassed. Pushing a change to configuration repository on CMSs masters should not necessary mean authorizing the change on the clients. While CMSs have proven helpful in addressing some of the challenges in managing and deploying changes in configuration, they have several shortcomings in large-scale computing environments as described above. This thesis will describe new techniques and technology that can be used to address these shortcomings by improving the capabilities and characteristics of the runtime infrastructure engines that are a core components of CMSs Dissertation Outline Next, we present the problem statement of this thesis. Then, in Chapter 4, we review related work and present current CM research directions. Our approach to solving the identified shortcomings of CMSs along with its challenges and benefits is described in Chapter 5. We present our evaluation in Chapter 7 and identify future work in Chapter 9.

25 Chapter 3 Problem Statement This thesis examines open issues in the area of configuration management. Current techniques and technology in this field are sufficient in the context of a single machine or a small cluster of machines. However, in large-scale computing environments (multiple clusters of thousands of machines across multiple data centers), several limitations become apparent: security mechanisms can be bypassed and the impact of updates cannot be guaranteed or predetermined. These limitations can (and do) lead to configuration errors that are difficult to debug, costly to fix [41], and lead to a range of problems from impeded performance to complete service failure. This thesis examines a number of techniques that can be applied to the area of configuration management to address these limitations and allow this class of techniques and technology to scale to large-scale computing environments. The combination of these techniques leads to a new design for the runtime infrastructure engines of modern configuration management systems, focused on providing scalable and secure coordination and synchronization capabilities. To evaluate the utility of these techniques, a prototype system, called Caerus, was designed and its new capabilities were evaluated in an environment that simulates the highly-demanding computing landscapes of large-scale technology companies. The new runtime infrastructure engine was evaluated along a range of metrics including overall response time, its ability to correctly coordinate and synchronize changes under sustained load, and the latency between its components. To our knowledge, we believe that this thesis advances the state of the art in configuration management providing a new combination of techniques that greatly reduce the amount of manual

26 14 intervention required by human operators to make use of configuration management systems in large-scale computing environments while also making the environment more secure and greatly reducing the number of configuration errors that occur.

27 Chapter 4 Related Work Much research has been invested in the area of configuration management and this work has led to new techniques and numerous improvements to CMSs at every level. Some researchers have dramatically overhauled CMS design and architecture, while others have focused on fixing problems associated with their high-level languages. Some attack problems at the runtime infrastructure level and, finally, some work on changes to management workflow. In this chapter, we briefly present research directions closely related to the topic of this thesis and describe how our approach and prototype differs from other projects in this field CMS Design and Architecture Motivated by the need for a CMS that provides the right balance between abstraction and the user s ability to gain control over operating system details, Puppet was developed by Kanies et al. [32]. Puppet is a DevOps tool that aims to hide the complexity of operating system configuration from developers and end users, while also acting as an operating system abstraction layer to help system administrators automate their job [32]. The Puppet API introduces functionality that a developer cares about such as package management, user information, file permissions, service status, etc. while providing another set of API methods meant for use by advanced operators. This latter set of API methods allow operators to manipulate the underlying operating system, grant them the ability to execute shell commands on a host, and monitor a system s status. Puppet is currently the most widely-used CMS in industry. This status makes it the perfect CMS to use as

28 16 the basis of our prototype. While Puppet does a great job of taking a machine to a desired state as expressed in its declarative language, it also has all of the limitations we mentioned in Chapter 2. We propose a new approach for solving these problems in Chapter 5 and made modifications to Puppet to test that new approach in a simulated large-scale computing environment CMS Constraints There are always constraints that define how a resource can be applied to a machine to reach a desired state. Operational constraints define how resource configurations should be applied and in what order. User preferences can impose additional constraints, such as selecting the particular version of a compiler or library to be used for a resource configuration. Furthermore, technical constraints such as which network card driver is in operation, and business constraints, such as restrictions on resource allocations and budgeting, can limit the ways in which resources can conceivably be configured and used. The task of choosing which configuration out of all possible configurations should be applied to a machine is usually defined manually by system administrators. This makes the configuration process subject to errors regarding, for example, what role is assigned to a particular machine, or what packages are to be installed. These problems served as the motivation for a research effort that attempted to automate constraint-solving in the context of configuration management, and to detect and analyze the impact of changes based on a variety of criteria [12, 22, 24]. Hewson et al. [22] performed work on ConfSolve, a constraint-based declarative language that allows constraints to be defined to generate valid configuration solutions from all acceptable configurations. ConfSolve cannot replace current configuration management systems such as Puppet, because while the language allows for the definition of constraints, it does not specify the steps needed to reach a particular goal state. However, ConfSolve can serve as the basis for a system that detects and analyzes the impact of configuration changes before or during runtime to improve configuration management system efficiency. Delaet et al. [12] took the approach of creating a high level rule language known as PoDIM

29 17 that is used to generate valid resource configurations for multiple machines based on user-defined rules. When invoked, the PoDIM interpretor attempts to satisfy all the rules and generate a valid resource configuration that a CMS (e.g. CFEngine) can then deploy and act on. This approach allows constraints to be specified independent of a particular CMS and thus operates at a higherlevel of abstraction than other constraint solvers that integrate directly with CMSs themselves. With respect to limitations, PoDIM currently does not guarantee unique output for the same input, nor that a user s rules are satisfied, and has not been proven to scale to large numbers of rules and machines. Finally, Hinrichs et al. [24] describe an approach to automate the generation of configurations that satisfy constraints by modeling the configuration management problem as an objectoriented constraint satisfaction problem (OOCSP) and then solving it using well-studied techniques of resolution-based theorem proving. This approach models configuration as an OOCSP, translates configurations to first-order logic sentences, and finally invokes a resolution-based theorem prover to find the configuration that satisfies the constraints. While this approach was able to meet the requirements, it does not scale due to the use of the resolution-based theorem prover which requires time and space exponentially proportional to the size of the submitted first-order logic sentences. While the work in this area is important and has been proven to improve the ability of CMSs to put systems into a known state, our work discussed in Chapter 5 is one layer above the specifics of a CMS s high-level language, and is targeting problems in the runtime infrastructure engine and changes in a CMS s workflow management instead. As will be discussed in Chapter 5, we aim to limit the number of changes that need to occur in a CMS s high-level language and instead introduce services that exist independent of the CMS, allowing our approach to work with multiple CMSs. With respect to the area of constraint satisfaction, we intend to allow a CMS to use its existing constraint solver. With respect to the issue of managing updates to a CMS s resource configurations, typically several different operators will contribute data that alters a resource configuration in a central configuration repository. These changes are then propagated to different clients in an infrastructure.

30 Current CMSs assume they are operating in a trusted zone and therefore do not enforce accesscontrol by themselves, but rather rely on the access control mechanisms of a version control system 18 or file system permissions to authorize changes to resource configuration files. Also, CMSs do not enforce a specific workflow regarding how these changes are applied to a central configuration repository, e.g., whether a user needs further approval before changes are pushed to the repository. These shortcomings motivated researchers to investigate the possibility of adding access control to CMS systems, thereby more effectively enforcing a predefined workflow for changing resource configurations [2, 42, 43] CMS Workflows and Access Control Desi et al. [15] discuss how the use of Subversion, a version control system, and their particular CMS, Bcfg2, solved several administrative problems. They made changes to Bcfg2, enabling it to leverage the power of tracing changes as provided by Subversion. In this case, the server, clients, and reporting engine were made aware of Subversion s revision ID, allowing the content of a server configuration repository to be tracked at any time. This allowed for a historical view of the configuration representation. In addition, clients can now be reconfigured using specific versions traceable to a specific revision ID. The ability to associate changes with revision ID, and the integration of that ability into the Bcfg2 reporting systems, established a base for implementing policies that manage the deployment of changes, e.g., managing the order in which changes are applied and taking action based on the return status of applying a change. Vanbrabant et al. present ACHEL, a prototype that adds access-control and workflow enforcement to system configuration [42]. Their approach is based on comparing the abstract syntax tree (AST) representations of the original and changed files of a resource configuration and using that to generate a tree edit script. Edit scripts record the additions, deletions, and update node operations that transform the original resource configuration into the new resource configuration. To evaluate the access-control rules, ACHEL translates the edit script to semantically meaningful changes that are specific to the configuration language. For example, an edit script in Listing 4.1

31 19 will be translated to the semantically meaningful changes shown in Listing ,3 +1,2 var1 = 6 var2 = 6 p r i n t ( var1 var2 ) +p r i n t ( var1 7) Listing 4.1: Edit script Listing 4.2: Semantically meaningful changes add = {node ( t e x t = 7, parent =205)} d e l e t e = {node ( id =102), node ( id =106), node ( id =107), node ( id =110)} update = {node ( id =103, t e x t= p r i n t )} ACHEL uses a set of methods to determine each node s owner based on historical information provided by a version control system Mercurial in this case and decides if the user making the changes has the right authorization to change the statement represented by these nodes. Changes to workflow management are based on the access control solution, in which a user cannot commit changes directly to the centralized repository unless he receives approval from a user who owns the altered statement. While this work is helpful and presents a new approach to adding access-control to a CMS, it is complex, difficult to use, and hard to adopt in modern CMSs [43]. Another approach to adding access-control to CMSs was proposed by Anderson et al. [2]. They suggest making use of provenance techniques developed in the context of databases. Their work is preliminary and, unfortunately, did not provide specifics about how such techniques could be applied in modern CMSs. We believe the problem of adding access-control to a CMS remains unsolved and that it needs more research and attention. Our proposed solution in Chapter 5 is to take a new approach that acts on the layer between the operator and the configuration repository. In our approach, we analyze the impact of the change to be applied on a host and determine if the change should be authorized. Our approach prevents such a change from being made without first obtaining proper approval or by requiring a different user with the proper credentials to submit that change.

32 Chapter 5 Design and Architecture 5.1 Overview To address the limitations of modern CMSs discussed above, we have designed Caerus, a next generation CMS runtime infrastructure engine. Caerus adds a number of new layers not found in existing systems. It adds coordination, messaging, and policy layers that make use of an extensive array of distributed systems techniques to coordinate and synchronize the state and actions of CMS master nodes and CMS client nodes. It adds parsing and security layers that make it possible to both authenticate operators and ensure that they are authorized to make the changes included in their submitted change requests. Finally, it adds a monitoring layer that tracks the state of submitted change requests and can alert operators if a problem should arise. Each layer is briefly described next and then in further detail below. Coordination Layer: The coordination layer provides functionality to coordinate when changes are applied by a CMS to the machines under its care. Our prototype implementation for this layer is based mainly on services provided by ZooKeeper [25] and is discussed in Chapter 5.3. Messaging Layer: The messaging layer pushes messages to all servers when coordination for a change is needed and provides functionality to deliver changes to all servers participating in a CMS. This layer in our prototype makes use of RabbitMQ, a messaging service that implements the Advanced Message Queuing Protocol standard [27] and is discussed in

33 21 Chapter 5.3. Policy Layer: The policy layer allows the behavior of CMS services to be customized based on user preferences and business constraints. This layer is used by our prototype to determine when changes should be applied. In particular, it can determine if a time to apply a change is violating a company policy (such as applying a change when the company is dealing with a service incident) and override the scheduled time with a new time that will allow the change to be applied with no violation. This layer is discussed in more detail in Chapter Monitoring Layer: The monitoring layer is used to monitor the status of changes being managed by a CMS. This layer is also responsible for reporting on the status of those changes (such as when they are initiated and when they have been applied.) The monitoring layer is discussed in Chapter Parsing Layer: The parsing layer is used to analyze the changes submitted to a CMS. It will parse a change request and the configuration repository to build a mapping between nodes, resource configurations, and resource packages. This layer is used to help address the CMS security-related issues discussed in Chapter 2. The parsing layer is discussed in Chapter Security Layer: The security layer is used to handle the authentication and authorization of a change request and is discussed in Chapter The design of our prototype system, Caerus, adopts a hybrid server and agent model that works alongside the infrastructure of an existing CMS. 1 The server of our system hosts and implements all of the layers listed above. It also implements a service known as the Change Worker Coordinator. The agents of Caerus are installed on the CMS s master and client nodes. The Change Consumer Coordinator agent is installed on a CMS s master nodes (one per master 1 In Greek mythology, Caerus was the god of opportunity, luck, and favorable moments. Caerus would act at the right time to ensure that a goal was achieved.

34 22 node). The Client Change Agent is installed on a CMS s client nodes (one per client node). In this chapter, we present the design and architecture of these components and discuss the design choices and trade-offs made to meet our goal of producing a new style of CMS runtime infrastructure engine that can scale to large-scale computing environments. The overall goal of this software architecture for a next-generation CMS runtime infrastructure engine is to provide a set of services that allow configuration management services to be provided in large-scale computing environments without encountering the issues of coordination and security that lead to expensive configuration errors today. The new services will provide a layer of abstraction that ensures that only authorized changes get submitted to a CMS and the changes get applied in a synchronized, coordinated fashion that reduces or eliminates configuration errors. It will be important for the new layers to operate in an efficient fashion such that changes can still be made in a timely fashion. There are situations when changes need to be applied as soon as possible, such as patches that address security vulnerabilities, and our new design must allow that to happen while also ensuring that no configuration errors occur as a result. In addition, there are times when no changes should be allowed, such as when a data center has gone down and traffic is falling over from it to a back-up data center. In those situations, the focus is on getting the current version of the infrastructure back to an operational state, not making changes or reconfigurations to that infrastructure. In those situations, our new services must make it possible for operators to turn off the stream of updates and not allow any machines to be updated until the incident has been handled and the organization s computing environment is back to being operational. 5.2 Motivation In the large-scale computing environment of an organization, there will be thousands of nodes distributed across multiple data centers that are hosting software services that implement that organization s business processes. For instance, these nodes may run the organization s web servers, payroll systems, data analysis systems, customer-management systems, databases, etc. Each of these nodes are also running the client software of the organization s CMS. That client

35 23 software in turn will access another set of nodes that are running the CMS s server software; those nodes are known as the master nodes of the CMS. Managing this network of nodes is an army of operators (system administrators) who work with the CMS to ensure that the environment operates as expected. If some aspect of the environment needs to change, then an operator will update the relevant resource packages and check the updates into a configuration repository that is monitored by the CMS master nodes. We will refer to this operation as submitting a change request to the CMS. At some point, the CMS client software on the client nodes will contact the master nodes, discover the changes, download the update, and apply it. Those nodes will now be configured to reflect the changes and operate in the indicated manner. Not all nodes are independent. Many client nodes will work together to operate a single service. For instance, hundreds of nodes may all work together as a Hadoop cluster, distributing data across those nodes in HDFS and processing that data by running MapReduce jobs on all of the nodes at once. We refer to these groupings of client nodes as a context and we take note of two types of change requests that can be applied to the nodes in a context. The first type of change a context-independent change request is one in which it does not matter if the nodes participating in a context apply the change in a random order. As an example, an operator may alter a resource package to allow a new user to access a Hadoop cluster. Prior to the change, the user is not able to submit jobs to the cluster but after the change the user s jobs will be accepted and processed. This change may affect only a subset of the nodes in the context and the order in which this change is applied is not critical. Eventually all the nodes that need to know about this change will query a CMS master node, receive the change, and apply it. The user may have to wait a few minutes before they can submit their jobs but there is otherwise no impact to the overall work and productivity of the organization. The second type of change a context-specific change request is one in which all nodes participating in a context must apply the change at the same time. That is, the nodes in the context cannot be allowed to enter a state where some of the nodes have applied the change and others have not. For instance, there was a point in Cassandra s version history where the format

36 24 of the repository changed. If you had an existing Cassandra cluster and wanted to update to the version with the new repository format, you needed to upgrade all of the nodes at the same time. If the nodes within the cluster were allowed to update to the new format at different times, data corruption and data loss would occur. Modern CMSs can easily handle context-independent change requests even in large-scale computing environments but they cannot currently handle context-specific change requests and, indeed, this situation can lead to configuration errors which in turn bring down services and directly impact an organization s ability to get work done. In order to address context-specific change requests in large-scale computing environments, we assert that a CMS must ensure that 1) when such a change request has been submitted that all master nodes update themselves (and are thus in sync ) before a single client node is made aware of the update and 2) that when the nodes in a context become aware of the update that a) they work with the CMS to designate a time when all nodes will apply the update and b) wait until that time to apply the update. The first step synchronization of master nodes is necessary because large-scale computing environments never run just a single CMS master node. A single CMS master node would not be able to scale to handle all of the requests that would be generated by tens of thousands of client nodes and it also represents a single point of failure, unacceptable in modern computing environments. In addition, current CMSs do not provide any services that ensure that the configuration repositories of master nodes are synced; instead, this process of synchronization is left to the operators and thus is laborious and error prone. This lack of synchronization can lead to client nodes in the same context seeing different configurations from different master nodes and this incongruity can lead to unpredictable behavior and eventually to service failure. The second step synchronization of client nodes is necessary since failure to update all client nodes in the same context in the presence of context-specific change requests can lead to data loss and service failure as discussed above with the Cassandra example. Before presenting the details of the design of Caerus and how this design solves these problems in a scalable fashion we must first establish that these issues occur in real-world contexts. We

37 Figure 5.1: Distribution of CMS Client-To-Master Connections 25

38 26 do this by first establishing that these concerns occur in a relatively small computing environment that supports the daily operation of a large research project Project EPIC at the University of Colorado [39]. We argue that if these issues arise in a small environment with a handful of servers, they will definitely occur in a large-scale environment with thousands of servers. Project EPIC is an NSF-funded project at the University of Colorado that investigates how members of the public make use of social media during times of mass emergency. To perform this research, Project EPIC must collect large amounts of social media data. Project EPIC s data collection infrastructure EPIC Collect is deployed on a cluster of twelve machines; four of those machines participate in a Cassandra cluster, the other machines host apps and other services that are used to control the data collection process as well as analyze and back-up the collected data. EPIC Collect s cluster is managed by the Puppet CMS. Each node of the cluster has Puppet s client software installed and this software is configured to check in periodically with EPIC s Puppet master node to update their software in response to any submitted change requests. In Fig. 5.1, we graph the minute of the hour when each client node connects to the Puppet master node over a time period long enough to monitor 400 connection attempts by each node. The data shows that there is no coordination between client nodes; connection times are distributed all across the sixty minutes of an hour and, indeed, the time shifts for an individual node over time. With this situation, a change submitted to the master node will be seen and applied by nodes in no particular order and with a maximum delta of thirty minutes before all nodes have been updated. This situation can easily lead to configuration errors that can threaten the health of the entire cluster. This divergence of connect times is also seen for nodes that participate in the same context. In EPIC Collect, four nodes act as a Cassandra cluster and, as seen in Fig. 5.2, each of those nodes connect to the Puppet master node at different times. With the data shown in Fig. 5.2, we can see that for an update that has to be applied to all four nodes at once such as the real-world example given above for when Cassandra s repository format changes it would have been possible for epic-n3 to receive notice of that update shortly after minute 30. It would have applied that update and rejoined the cluster within a few minutes and then operate for nearly half an hour

39 Figure 5.2: Distribution of CMS Client-To-Master Connections by Nodes in Same Context 27

40 Figure 5.3: Shift in Time of CMS Client-To-Master Connection by Single Node 28

41 29 before node-n0 was even aware of the update. Such a situation would have guaranteed data loss or data corruption of the social media being gathered by Project EPIC and EPIC s analysts cannot afford to lose data being generated during mass emergency events. Finally, we note, again, that if we observe the connection behavior of a single node, as done in Fig. 5.3, the time at which a Puppet client connects to a Puppet master shifts over time. This shift is due to the small amount of time it takes to transfer changes between a Puppet client and master node and apply them. This time shift then guarantees the divergent behavior seen in Fig. 5.1 and Fig With these three figures, we have demonstrated that even in a small computing environment, modern CMSs do not have the capabilities to synchronize changes across a set of machines such that the changes are applied at the same time. This evidence, in turn, motivates the need for the design and capabilities of Caerus, the prototype CMS runtime infrastructure engine developed for this dissertation. 5.3 Supporting Frameworks and Techniques The key aspect of our vision for a next-generation CMS runtime infrastructure is its ability to coordinate and synchronize the actions of master nodes and client nodes in a large-scale computing environment. To do that we needed to carefully select the frameworks we would use to handle coordination and messaging. In addition to these frameworks, we made use of three distributed systems techniques throughout the implementation of Caerus to achieve its desired functionality: dealing with connection failures between distributed components, dealing with concurrent updates, and implementing distributed locks. We now discuss each framework or technique in more detail Coordination Framework For the coordination of our prototype s distributed agents, we needed a framework and service that would provide high availability, scalability, and reliability. The gold standard for distributed coordination services currently is Apache ZooKeeper; as such, we selected ZooKeeper to form the basis of our coordination layer. ZooKeeper is a distributed coordination service that provides

42 30 distributed applications with a simple set of synchronization, grouping, and naming services; its implementation focuses on high performance, availability, durability, and the atomic ordering of operations. ZooKeeper has a simple data model described as a shared hierarchical name space of data registers (a.k.a. znodes) similar to a file system. With respect to the file system analogy, a znode is a named directory that can have a small amount of data associated with it and can have any number of directories (znodes) beneath it. As we will show, next-generation CMSs require these services in order to scale to modern computing environments with thousands of nodes and multiple contexts. To ensure that a CMS operating at this level can continue to function, it requires a coordination service that is also resilient to failure and always available to respond to requests Messaging Framework To enable communication between our prototype s distributed services, we again looked for a framework that has been proven to scale to large computing environments and has been implemented with resilience and reliability in mind. We selected RabbitMQ for these reasons. It implements the emerging Advanced Message Queuing Protocol standard which provides a wide range of services useful in sending messages in large-scale computing environments. In addition, it provides highly-desirable features such as durability (ensure that all messages in a queue are persisted to guard against data loss) and delivery confirmation (reliable acknowledgment that a message has been successfully delivered to its destination) that are critical to achieving our goals with respect to scalable configuration management services. As you will see in Chapter 7, we make heavy use of these features to provide reliable configuration management services after confirming that the performance hit of persisting messages and message queues has no significant impact on the operation of Caerus Distributed Systems Techniques With respect to connection failures, a distributed system like Caerus must have a consistent strategy to handle the situation where one component on one machine cannot be reached by other

43 31 components on other machines. We adopt two strategies in this situation and these strategies were implemented uniformly throughout the software architecture of Caerus. The first strategy is known as exponential backoff. If a component A fails to connect to a component B, A will retry the connection X number of times and it will increase the amount of time in between each attempt exponentially up to some limit Y. The second strategy is to impose a timeout on all requests. If A makes a request to B and B does not respond in Z seconds, the request is terminated and will be retried at some point in the future based on the current state of the exponential backoff strategy. All of these values X, Y, and Z are configurable since different system components have different workloads and behaviors. With respect to dealing with concurrent updates, Caerus encounters situations where the information in the coordination layer may be read or written by different clients at the same time. Given the nature of Zookeeper s eventual consistency model, some information might get lost or overwritten mistakenly by our prototype s various agents. To solve this problem, we implement the strategy of reading a znode s stat object before we change the content of that znode. The stat object contains a variety of metadata about a znode including critically its version number. We extract the version number of the znode from the stat object and send it as a parameter to the write operation. If ZooKeeper detects that the version has changed since we read it, then some other agent updated the znode after we read it. In this situation, our attempt to write will fail since ZooKeeper will throw an exception. Our agents are written to catch that exception and then retry the update after reading the znode s new content and version number. We use this technique when CMS client nodes in the same context are working together to determine the time they will all apply a context-specific update. With respect to distributed locks, Caerus implements the shared lock protocol defined by ZooKeeper [25]. A ZooKeeper client seeking to take out a lock will create a sequential ephemeral znode, e.g. Z (some integer), under a specific path, such as /X. 2 The ZooKeeper client will then ask for all of the child nodes located under /X. If /X/Z is the lowest sequential number of all children 2 Ephemeral znodes are discussed in more detail in Chapter 5.4.2

44 32 returned then the ZooKeeper client has acquired the lock and may proceed with its update. Once the client has finished its work, it closes its session with ZooKeeper which causes the Z znode to automatically be deleted and releases the lock. If /X/Z is not the lowest sequential child of /X, then the client must wait for a short time and then check again to see if Z is now the lowest sequential child. At some point, Z will be the lowest and the client will finally have acquired the lock. 5.4 System Components Our additions to a modern CMS include the following system components. Change Worker Coordinator (CWC): A new service that sits between an operator and a configuration repository. s will no longer submit change requests directly to their configuration repository. Instead, they submit the change request to the CWC. This component will then ensure that the change request is properly synced to all master nodes and then distributed to client nodes as needed. This component is also responsible for rejecting change requests that are submitted by unauthorized operators. Change Consumer Coordinator (CCC): A new service that lives on CMS master nodes. It facilitates the master s ability to respond to a change request and ensures that all master nodes are sync with the most recent changes before they are delivered to client nodes. It operates on information supplied by the CWC and will update that information with status updates such as when a change request was received and when it was applied. Client Change Agent (CCA): A new service that runs on CMS client nodes and handles the communication and coordination between CMS client nodes and CMS master nodes. It coordinates, schedules, and applies all change requests for a particular client node. We now discuss these three system components in detail and how they interact with the service layers presented in Chapter 5.1.

45 Change Worker Coordinator The Change Worker Coordinator is responsible for processing change requests from operators that normally would have been submitted directly to a configuration repository. Instead, the change request is bundled into a message and submitted to a change request queue managed by the messaging layer. The CWC is then configured to be a consumer of the change request queue so that it can process the change requests. This model was adopted since it can easily scale as the workload increases. To handle increased load, multiple instances of the CWC can be launched all working together to process change requests sitting in the queue. The advance message queuing protocol that RabbitMQ implements will handle the distribution of messages to multiple instances of the CWC. Using RabbitMQ s features, this queue is configured to be durable such that each message to the queue is persisted to disk before the client submitting the message is told that the message has been received. With this design, if the node managing the queue goes down, we know that all received messages will still be in the queue when the node is brought back on-line. In addition, a message is never removed from the queue until the CWC has sent an acknowledgment that it has received the change request. With this design, we greatly increase reliability of the CMS even in large-scale computing environments. Once a message (change request) has been received, the CWC performs the following process: Authenticate the change request with the security layer. Send the change request to the parsing layer to determine the impact of the change on the computing environment. If the change will impact a context, then the change request is updated accordingly. In addition, the worker uses this information to determine with the security layer if a change is authorized Fig Register the change request with the coordination layer. Determine the next time slot when all CMS master nodes can apply the change.

46 34 Set the change request s delivery attribute to the determined time slot. We now discuss these steps in detail Authenticating Change Requests All changes must be authenticated before they are processed by the rest of the components. We make use of pam python which exposes the internal Pluggable Authentication Module API which we make use of to determine an operator s eligibility to use the system. We selected PAM because of its simple yet mature and rich API that has been used by many systems and services in production environments. Our current PAM module determines if a user is allowed to use the system by checking if the user submitting the change has an account on the server and exists in a simple flat text file. Adopting more advance authentication techniques requires changing only a few lines of the PAM module configuration file Change Request Evaluation All change requests will eventually impact the CMS client nodes. It is the client nodes that ultimately follow the instructions of the change request and reconfigure themselves to match the indicated state. As discussed above, some change requests impact individual client nodes and do not require coordination among all CMS clients to apply without danger of a configuration error. However, some change requests do require all CMS clients who participate in a particular context to coordinate and ensure that the change is applied at exactly the same time. To address this concern, the CWC must compute which hosts will be impacted by the resource configurations that have been updated by a change request. The first step in this process is to use the lsdiff command to list all files that have been updated by a change request. The parsing layer can then parse those files and return a list of resource configurations and resource packages that were contained in those files. That list is then submitted to an in-memory database that maintains a mapping between resource packages and the CMS client names (hosts) that make use of them. This mapping is precomputed because it is an expensive operation and it would impact

47 Figure 5.4: CWC Authentication and Authorization Workflow 35

48 36 the performance of the CWC if it had to be computed for each change request. The CWC uses the returned host names and pass them along with the operator information to the security layer to determine if the change should be authorized. Finally, the CWC uses the returned host names and checks them against another mapping to see if they participate in a context. If so, the CWC updates the serialized form of the change request to set its context attribute. This attribute will be used by the CCCs and CCAs to trigger coordination logic that will ensure that all CMS client nodes running in the same context apply this change request at the same time (discussed in more detail below). Once the processing of the change request is complete, the CWC will publish a message informing all CCC agents that it is available. Our messaging framework will ensure that all CCCs will eventually receive that message and act on it. This will occur even if a CCC was offline when the message was published due to a warm up phase that occurs as a CCC is initializing. This warm-up logic will be discussed in more detail below Change Registration The third step in handling a submitted change request is to register the change with the coordination layer. Since our architecture makes use of ZooKeeper to implement its coordination layer, registering a change request involves creating a set of znodes that will be used to track the status of the change request as it works its way through the stages of its life cycle. A summary of the znodes that are created for a change request is shown in Table 5.1. The first znode created for a change request is a child node under the /changes znode. The name of the new child znode corresponds to the id associated with the change. For simplicity, the id is taken from the configuration repository used in the computing environment (for instance, the commit id generated by git for the change request). The data associated with this znode is a serialized version of the change request used to apply the changes by the CMS. As discussed below, this content can be unpacked by any CCC running on a CMS master node to apply the change to the master and prepare it to deliver the change to its associated CMS client nodes.

49 Table 5.1: Zookeeper znodes used to Track a Change Request 37 znode Purpose /changes placeholder for all submitted change requests. The direct children of this node is an integer that represents the id assigned to an individual change request, e.g. </changes/0/>. The remaining rows in this table describe children nodes that hang off the change id node, e.g. </changes/0/received/>. /received placeholder for all master nodes to confirm they received notification of a change request. /applied placeholder for all master nodes to confirm they have applied the change request. /result placeholder to track the status of a change request. The current possible values are scheduled and completed. This znode is used primarily by the monitoring layer. The CWC then creates three znodes under the change id: /received, /applied, and /result as shown in Fig These nodes provide a single location that all CCC agents running on CMS master nodes can use to coordinate around the change request. Indeed, we refer to this newly created change id znode with its serialized change request and these three child znodes as the change skeleton. The process of creating the skeleton is an atomic one to ensure all the skeleton is created and no partial skeleton is ever seen by CCC. This skeleton will be filled in by the CCC agents as they process the change request. The CWC also sets the value of the /result znode to scheduled to indicate that this change request is now being processed Master Node Coordination The next step of the process is to determine the time when all CMS master nodes can apply this change request so that it can (later) be delivered to CMS client nodes. The goal here is to keep the set of changes that are being offered to client nodes consistent across all CMS master nodes. To do this, the CWC will examine data that is stored under the /masters znode that is created and maintained by the CCC agents discussed in Chapter Each CCC registers its CMS master as a child znode under /masters. There are three possible states for the data of these znodes: No Data or Timestamp in the Past: This CMS master node is available to apply the

50 Figure 5.5: CWC Time Coordinating, Change Registration, and Change Delivery Workflow 38

51 39 change immediately. Timestamp in the Future: This CMS master node has indicated that it is not available to accept new change requests until the indicated time. If all of these znodes contain either no data or expired timestamps then the CWC sets the time of delivery for this particular change request to the next available time slot. 3 If one or more of the znodes contains an unexpired timestamp then the CWC sets the time of delivery for this particular change request to the first available time slot after the times indicated by these znodes. The time of delivery for a change request is an attribute that is stored on the change request itself. Once the time has been determined, the CWC unpacks the change request and sets its delivery attribute accordingly. This design of setting delivery attributes on the change request itself was chosen to make the entire system more flexible with respect to real-time changes in delivery priorities. This is because the default algorithm described above can be overridden by the policy layer. After the algorithm above has been applied, the policy layer is consulted to see if any rules override the default behavior. For instance, the policy layer might contain a rule that bans the processing of any change request during a system incident. In that case, the rule would tell the revoke the delivery time set for the current set of change requests until the system incident has been resolved. Rules can also cause change requests to be applied sooner than the default scheduling allows by, for instance, requiring security-related patches to be applied as soon as possible, causing other scheduled change requests to wait until the security patch has been applied Change Consumer Coordinator The Change Consumer Coordinator is an agent that runs on every CMS master node. When the CCC has been initialized, it contacts the coordination layer and registers its CMS master node under the /masters znode by creating an ephemeral znode. Ephemeral nodes have special 3 Time slots are currently set to ten minute intervals; this value is configurable.

52 40 semantics: the Zookeeper client that creates such a node opens a session and keeps that session open for as long as the node should exist. If the client closes the session, Zookeeper automatically deletes that node. The reason these nodes are ephemeral is that it makes it easy to add and remove CMS master nodes as needed to keep the entire set of nodes in a consistent state (discussed below). The purpose of the CCC is to listen for messages (change request notifications) from the CWC. The CCC works directly with the CMS invoking its services when needed to ensure that all CMS master nodes are kept in sync. The configuration repository that the CMS uses on the CMS master node is completely isolated from any updates except those submitted by the CCC. This is different from the typical set-up in which operators submit changes directly to these configuration repositories. As was discussed in Chapter 2 and again in Chapter 5.2, this practice is a security risk and it can lead to system and service failures. As a result, it is explicitly disallowed in our system and, instead, all change requests are submitted to the CWC and all configuration repositories used by CMS master nodes are updated by CCC agents. When the CCC receives a message from the CWC, it saves the message to a local, temporary data store and updates the znode /changes/id/received/ to include the CCC s ip address in a new znode under its path (see Fig. 5.6). The change request is added to a group of changes that have all been tagged with the same timestamp. The key here is that it is easy for the CCC to retrieve a group of changes that have all been scheduled to be applied at the same time. Once the change request has been added to the appropriate group, the message sent by the CWC is now considered delivered and a confirmation is sent back to the CWC. Every two seconds, the CCC checks its local data store to see if it is time to apply a set of changes (see Fig. 5.6). If so, it closes its session with ZooKeeper (which deletes the ephemeral znode) taking its CMS master node out of rotation. It then applies the group of changes associated with this timestamp, updates the /changes/id/applied znode, and then it registers itself with ZooKeeper once again so that it can receive new messages from the CWC. By temporarily removing its child znode from the /masters znode, the CCC is making sure that no CMS client node ask its master node for the latest set of changes while the CCC is updating the CMS master node itself.

53 41 In this way, Caerus ensures that no CMS client receives out-of-date configuration information from the set of active CMS master nodes. If any of the changes in a group has its context attribute set, then the workflow is modified slightly. In this situation, the CCC determines the set of CMS client nodes that are impacted by this change (i.e. participate in the same context) and prepares (in the case of Puppet) a catalog that can be applied by all such clients. This catalog is stored in ZooKeeper in a special /catalog znode created for this purpose. As will be discussed below, the CCAs will then follow a special workflow when applying change requests that have data stored for them under the /catalog znode. Finally, when a CCC is first initialized it follows a special workflow that warms up the node with all of the current changes that need to be applied to ensure that this particular CMS master is in sync with all of the already-running master nodes. Before the CCC registers this CMS master under the /masters znode, it will retrieve any pending change requests from its local storage (that may not have been processed due to a crash) as well as for any change requests that have been stored under the /changes znode since it was last running. All of these pending changes are then processed as described above and placed in the groups corresponding to their delivery timestamps. At this point, the CCC will register its master under the /masters znode and start the process that checks the local data store every two seconds for changes that need to be applied Client Change Agent The Client Change Agent is an agent that runs on every CMS client node in a computing environment and handles all communications and interactions with the CMS master nodes and the layers of our system. It is allowed to contact any available CMS master node at any time since the CWC and the CCCs have ensured that all CMS master nodes are in sync. The only restriction on this behavior is when it must communicate with its peers in the same context to coordinate a change that must be applied by all the nodes in that context at the same time. We initially thought that we would need to ensure that all CCAs contacted master nodes at the same time. Since the masters are in sync, if all CCAs made requests in sync then the

54 Figure 5.6: CCC Workflow 42

55 43 chance for a configuration error is greatly reduced. But we quickly realized that such an approach leads to massive loads on the master nodes in a short span of time and essentially would mimic a coordinated denial of service attack on that portion of our software architecture. To be clear, this is what would be required if every change request impacted a context. But since the vast majority of change requests do not require all nodes in a context to be updated at once, we allow CCAs to access the master nodes at different times reducing the load and allowing changes to more quickly propagate through the system. Instead, CCAs are configured to detect when synchronization is required (by looking to see if a change is placed in /catalog and assigned to them), and then take the time and effort to coordinate with their peers to apply that particular change at the same time. The workflow for a CCA is the following. It examines the /masters znode and picks one of the available master nodes. It then contacts the coordination layer to determine if there are any change requests that have appeared since the last time that this CCA updated itself. If so, it checks the /catalog znode to see if any catalogs were stored for one of the available change requests. If no catalogs are present, then all new change requests are non-critical (context independent) and can be applied immediately. If that is the case, the CCA pulls from the master, applies the changes, and then sleeps until its time to check in again. However, if one or more catalogs are present, then the change is critical (context specific) and it needs to be coordinated and applied alongside all other CMS client nodes participating in the same context. In this situation, the CCA pulls all of the catalogs from the /catalog znode and persists them to a local data store. The CCA then determines when to apply the change by: (1) take all precomputed changes under the /catalog znode that were created after the last time the CCA applied changes and sort them (2) Find the latest created change request that has been agreed upon. (3) if one was not found, take the latest change, suggest a time and then go to sleep until that time.

56 44 (4) if one was found, sleep until the agreed upon time (5) once the CCA wakes up, apply all the changes in the catalog and update the internal data store to reflect the last time a change was applied To make this process work, all CCAs in the same context must participate in a workflow that will ensure that they all pick the same time to apply the change. In step 2, to determine if a time has been suggested for the most recent change request in the catalog, the CCA looks for the presence of a znode with this path: </catalogs/id/current>. If that node exists, it contains the time that this change should be applied. The CCA will read that value and sleep until that time as indicated in step 4 above. If that node does NOT exist, then the CCA must perform the suggest a time process of step 3. To suggest a time, the CCA does the following. It first acquires a distributed write lock from ZooKeeper on /catalogs/id/ and then creates a /current znode underneath it. If other CCAs were checking for the existence of this node, then they will fail to acquire the distributed write lock and will pause for a short time before checking once again if the /current znode exists. In the mean time, the CCA that has the write lock will compute the next available time slot for applying the change and write that as the value of the /current znode. It will then go to sleep until that time as indicated in step 3 above. 4 With the use of ZooKeeper s distributed lock mechanism, our system ensures that all CCAs within a context will discover the time that has been set to apply a context-specific change. Note that we take the step of sorting all of the catalogs that are received from ZooKeeper to ensure that all client nodes are coordinating on the most recent context-specific change request. We can do this since all previous unapplied changes will be included in the most recent catalog received from ZooKeeper. 4 Currently, the CCA will pick a time slot that is five minutes in the future; this time interval is configurable.

57 Monitoring Layer The monitoring layer is an agent that runs on a server and is used to verify that all change requests that enter the scheduled state eventually get applied by all CMS master nodes. It does this by periodically checking the data associated with the /changes/id/received and /changes/id/applied znodes to see if they match. If they do, the monitoring layer will update the /changes/id/result znode to completed. If not, it will send an notification to the operator to indicate that a CMS master node has gone down (that is it received the change request but crashed before it was able to apply it). This notification will allow the operator to reboot the CMS master after investigation. Note that this situation does not lead to configuration errors because as soon as the CMS master dies, its ZooKeeper session is closed and its associated ephemeral znode under the /masters znode will be deleted. Since it is deleted, no CMS client will ever be directed to it since the CCAs will only connect to master nodes that are listed under the /masters znode. In this way, CMS client nodes can never receive out-of-date information from a CMS master Security Layer The security layer handles both the authentication and authorization aspects of our prototype. The authentication process determines if an operator is allowed access to the prototype at all. The security layer makes use of the Pluggable Authentication Modules for Linux to handle authentication; if authentication fails, the operator will not be allowed to submit a single change request to our prototype. The authorization process determines if an operator is allowed to update the client nodes that will be impacted by his change request. For instance, an operator may be allowed to configure the OpenSSH package on machines in cluster X but not cluster Y. If he submits a change request written such that cluster Y machines would be impacted, Caerus will detect that and not allow that change to enter the queue of change requests being processed by the CCC. To perform authorization, the security layer asks the parsing layer to determine what CMS

58 clients are impacted by a given change request (similar to the process discussed in Chapter 5.4.1). It then uses a simple local datastore to determine if a given operator is allowed access to the 46 destination clients. If the answer is no, then the security layer removes the change request from its queue and the CCC will never see the change request and, hence, never process it. The security layer can easily be extended to make use of more advanced security protocols, such as the Lightweight Directory Access Protocol, to determine if operators have access to a given host.

59 Chapter 6 Implementation 6.1 Overview We implemented Caerus with two goals in mind. We make use of a single programming language and its standard library (Python) for simplicity and consistency. The second goal was to have Caerus s implementation match its design and provide a reliable and robust distributed CMS platform with high availability and failure resiliency. In this chapter, we present insight into some of the implementation details related to our prototype s use of ZooKeeper and RabbitMQ. 6.2 ZooKeeper We make use of the Kazoo open source library to communicate with ZooKeeper. We wrapped access to ZooKeeper via a single Coordination class; that class provides the functionality shown in Table 6.1. The statlistener encapsulates the logic for connecting and reconnecting to ZooKeeper. Whenever we lose a connection to ZooKeeper, statlistener will open a new connection, establish a session, and keep the session open for further communication. The remaining functions are wrappers around equivalent operations in Kazoo. In particular, our Coordination class would ensure that these operations handled exceptions in a way that met our usage requirements. Kazoo provides an asynchronous API on top of ZooKeeper but we did not use this functionality; most of our use cases required use of synchronous calls. Fortunately, our evaluation proved that the synchronous API performed efficiently and did not negatively impact the overall performance

60 Table 6.1: Interface to the Coordination Layer 48 Function /statlistener /connect /createnode /deletenode /getnodestat /getnodedata /setnodedata Purpose Monitor ZooKeeper connection status. Connect to ZooKeeper and establish a session. Create a znode, set the data, and create the parent path. Delete a znode. Retrieve a znode s stat object. Return the data stored in a znode. Set the data in a znode. of our prototype. The vast majority of our prototype s layers and system components required only the functions shown above to interact with ZooKeeper. The sole exception was the monitoring layer. That layer made use of ZooKeeper s notification functionality. In particular, it registered as a watcher of the /changes znode. As a result, it would be notified whenever a change occurred to the /changes znode or its children. 6.3 RabbitMQ We make use of the pure Python implementation of the Advance Messaging Queue Protocol called Pika. We then implemented two of RabbitMQ s protocols: Producer-Consumer: Handle the process in which an operator submits a change request to the CWC. Publisher-Subscriber: Handle the process in which the CWC sends a message to all CCC agents running on CMS master nodes to notify them of a new change request that must be processed. With respect to the first protocol, the producer is responsible for submitting a change request from the operator to the queue that is monitored by the CWC. The producer first starts by opening

61 49 a connection to the RabbitMQ server and then maintains a channel over the opened connection. All messages (change request submissions) go through the channel; if it closes or the connection is terminated, the connection is automatically re-established and the channel is opened again. The message requests the name of a durable queue that is managed by RabbitMQ. This ensures that the message is persisted once received by the queue. The message is also flagged to require delivery notification to ensure that our prototype can detect those situations in which a message fails to get processed. The consumer follows similar protocols as the producer with respect to connection and failure handling. Its responsibility is to consume the messages sent by the producer. It will send the change request to the security layer for authorization and authentication as described in Chapter 5. If authentication or authorization fails, it deletes the message and the change request will never be processed. If the change request is authenticated and authorized, the consumer persists it to a local data store. It then sends back the acknowledgment required by the producer. With respect to the second protocol, the publisher is part of the CWC (see Chapter 5.4.1). The publisher implements the same techniques described above for connection, reconnection, message delivery, and queue durability. In a publish-subscribe scenario, however, RabbitMQ does not allow the publisher to interact directly with the queue. Instead, the publisher communicates with a broker that will ensure delivery of the message to the subscribers. To protect against data loss, we configured the broker to be durable as well; this means that it makes use of durable queues to communicate with the subscribers. Finally, the subscriber is part of the CCC (see Chapter 5.4.2) and acts like the consumer above with the exception that its messages are notifications that a new change request is ready to be applied to all CMS master nodes. The subscriber has no need to communicate with the security layer and will instead forward the message immediately to the CCC for processing.

62 Chapter 7 Evaluation 7.1 Overview In this chapter we present our evaluation methodology and the details and results of the experiments we performed to evaluate our prototype system, Caerus. 7.2 Environment The computing environment used for evaluating Caerus consists of twelve machines maintained and operated by Project EPIC at the University of Colorado Boulder. These systems have been making use of Puppet for several years; we analyzed the logs from those machines to identify the connection behavior of Puppet client nodes to Puppet master nodes as discussed in Chapter 5. We made use of one of these twelve servers to perform our evaluation experiments. The evaluation server has two Intel Xeon E5620 CPUs 2.40Ghz; these CPUs support sixteen cores and have access to 24GB of memory. It has 19TB of disk space formatted with the XFS file system. The operating system for this server is Ubuntu LTS running the generic version of the Linux kernel. To be able to simulate a large number of machines, we made use of the Linux Containers [20] command, lxc, to create thirty-two containers (machines) to use for our experiments. The use of lxc provides the necessary isolation needed to run our experiments without the overhead introduced by a virtualization layer. Using lxc simplified managing our experiments since the host machine can access the file systems of the containers directly; this made it straightforward, for instance,

63 51 to access the log files generated by each container during our experiments and to configure each instance before an experimental run. To ensure parallelism and not only concurrency, we used a feature of lxc that allows specific containers to be assigned to specific cores. Since our evaluation server has sixteen cores, we assigned two containers to each core, allowing us to simulate an environment consisting of thirty-two machines. Each container was granted 512MB of memory and used Ubuntu as its operating system. Finally, we installed dfsg1 1ubuntu1 version of ZooKeeper and ubuntu4 version of RabbitMQ on the evaluation server and increased the number of file descriptors that could be open at the same time since the default value of 1024 was too low. This configuration was then used to run all of our evaluation experiments. 7.3 Experimental Data Set To perform our experiments, we made use of a large set of Puppet files available from the Harvard SEAS Academic Computing website [9]. Critical to our evaluation, this data set is distributed as a git repository allowing us access to the change requests (i.e. git commits) made to this repository by Harvard s system administrators (i.e. operators). This repository consisted of 1395 commits that we could use to generate change requests for our prototype CMS. There are two interesting things to note about this real-world repository. The first is the size of a typical commit. As can be seen in Fig. 7.1, the vast majority of these commits are under 100 KB in size. The second is the time at which a commit is entered into the repository. As can be seen in Fig. 7.2, the Harvard operators would submit commits at all times of the day. This data further underscores the issues discussed in Chapter 5.2. Not only do CMS clients connect to masters in an unpredictable fashion across the hour but at any moment a new change request can be received, further increasing the chance for configuration errors if not properly managed and synchronized as performed by our prototype. When performing an experiment, we used this repository to drive the change requests that our

64 Figure 7.1: Size of Change Requests Used in Evaluation 52

65 53 prototype had to handle. We wrote a simple python script that would read the commits from this repository, package them up in a message, and send them to the CWC for processing. Our script would send one change request per second to our system, simulating a load that is not matched by any large-scale computing environment today. That is, with this repository, our prototype was asked to perform 1395 change requests over the course of approximately 24 minutes. 7.4 Deployment and Experimental Configurations To deploy the software needed to run our experiments, we performed the following steps. (1) Operating System: Ubuntu was installed to one of the containers and that configuration was then cloned to the remaining thirty-one machines. (2) Python Dependencies: Once Ubuntu was installed on all thirty-two containers, we installed Python s pip package manager on each container and used pip to install all of the required libraries needed by Caerus s agents and layers. (3) Caerus Software: The software for Caerus s layers and the CWC was installed on the evaluation server, outside of the thirty-two containers. The software for the CCC and the CCA was installed on each container. This included any software the CCC and the CCA needed to communicate with ZooKeeper, RabbitMQ, and Caerus s layers. (4) Puppet: The last step of the deployment was to install and configure Puppet on each of the the thirty-two containers. We had two distinct experimental configurations. The first type of experiment is used to test the ability of the CCC agents to synchronize all of the CMS master nodes. In this configuration, the following steps were performed: (1) Launch Apache ZooKeeper and RabbitMQ on the evaluation server. (2) Launch Caerus on the evaluation server. For the evaluation server, this includes all of Caerus s layers plus the CWC agent.

66 Figure 7.2: Hour of Minute when Change Request Submitted 54

High usability and simple configuration or extensive additional functions the choice between Airlock Login or Airlock IAM is yours!

High usability and simple configuration or extensive additional functions the choice between Airlock Login or Airlock IAM is yours! Airlock Login Airlock IAM When combined with Airlock WAF, Airlock Login