Seattle Conference on Scalability: Building a Scalable Resource Management
Google Tech Talks
June 23, 2007
2007 Google Seattle Conference on Scalability: Building a Scalable Resource Mgmt System for Grid Computing
Speaker: Khalid Ahmed, Platform Computing Corp.
This talk will describe the architecture and implementation details for building a highly scalable resource management layer that can support a variety of applications and workloads. This technology has evolved from large scale computing grids deployed in
production at customers such as Texas Instruments, AMD, JP Morgan, and various government labs. We will show how to build a centralized dynamic load information collection service that can handle up to 5000 nodes/20,000 cpus in a single cluster. The service is able to gather a variety of system level metrics and is extensible to collect up to 256 dynamic or static attributes of a node and actively feed them to a centralized master. A built-in election algorithm ensures timely failover of the master service ensuring high-availability without the need for specialized interconnects. This building block is extended to multiple clusters that can be organized hierarchically to support a single resource management domain that can span multiple data centers.
We believe the current architecture could scale to 100,000 nodes/400,000 cpus. Additional services such as a distributed process execution service, and a policy-based resource allocation engine which leverage this core scale-out clustering service are described. The protocols, communication overheads, and various design tradeoffs that were made the development of these services will be presented along with experimental results from various tests, simulations and production environments.
Khalid Ahmed works as the Chief Architect and Director of Technology Research at Platform Computing. In over 12 years at Platform he worked in a number of roles including development, product management and architecture. His work on distributed scheduling, wide-area resource sharing, workload management, system automation, virtualization management, and high availability.