Seattle Conference on Scalability: CARMEN: A Scalable Science Cloud
CARMEN is a $9M project building a scalable science cloud. Its focus is on supporting neuroscientists who will use it to store, share and analyze 100s of TBs of data. Understanding how the brain works is a major scientific challenge which will benefit medicine, biology and computer science. Globally, over 100,000 neuroscientists are working on this problem. However, the data that forms the basis for their work is rarely shared even though it is difficult and expensive to produce.
The CARMEN project (www.carmen.org.uk) is addressing these challenges by developing a scalable cloud architecture to enable data sharing, integration, and analysis supported by metadata. An expandable range of services are provided in the cloud to extract value from raw and transformed data. This promotes the sharing of analysis services as well as data, and allows services to execute close to the data on which they operate. This is essential to avoid having to ship vast quantities (TBs) of data out of the cloud to the user's machine for analysis. Internally, the CARMEN cloud is built as a set of Web Services. Through experience of a wide variety of e-scientific projects over the past 8 years, we have identified a core set of generic services that we believe are needed to support science. These services, their scalability issues and novel features are:
- Data repository. Most of the primary data is time series signal data. Searching for patterns (such as neuronal spikes) is a key requirement. CARMEN uses a novel parallel search infrastructure to find patterns quickly, even in vast quantities of data.
- Metadata repository. Users need to be able to quickly search metadatametdata describing tens of thousands of datasets in order to locate data that is of interest. Ontologies are used to structure experimental metadata, and techniques are needed to quickly search this type of data.
- Service repository and dynamic deployment. A novel feature of the architecture is that the analysis services are stored in a repository in the cloud. Users can write services in a variety of languages, package them as web services and then upload them into the cloud. These are then dynamically deployed on compute nodes as required to meet user requests.
- Workflow Enactment Engine. Users can build workflows from the available services in order to orchestrate the entire process of analysis. These are then executed in the cloud.
- Security. Scientists wish to control precisely who has access to their data and services. This service ensures that these desires are met.
The talk will describe the design of the CARMEN system and show how it addresses the key scalability issues. It will cover the cloud services, explaining how each is designed to scale up to support thousands of users analysing TBs of data. We will present results from the CARMEN prototype to illustrate solutions and issues.
Speaker: Paul Watson
Paul Watson is Professor of Computer Science and Director of the North East Regional e-Science Centre. He graduated in 1983 with a BSc (I) in Computer Engineering from Manchester University, followed by a PhD in 1986. In the 80s, as a Lecturer at Manchester University, he was a designer of the Alvey Flagship and Esprit EDS systems. From 1990-5 he worked for ICL as a system designer of the Goldrush MegaServer parallel database server, which was released as a product in 1994. In August 1995 he moved to Newcastle University, where he has been an investigator on research projects worth over $20M. His research interests are in scalable information management, in particular parallel database systems and data-intensive e-science.
Slides for this talk are available at http://groups.google.com/group/seattle-scalability-conference
Google Tech Talks
June 14, 2008