3.8 Proceedings Paper

Proteus: agile ML elasticity through tiered reliability in dynamic resource markets

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3064176.3064182

Keywords

-

Funding

  1. Intel as part of the Intel Science and Technology Center for Visual Cloud Systems (ISTC-VCS)
  2. National Science Foundation [CNS-1042537, CCF-1533858, CNS-1042543]
  3. DARPA Grant [FA87501220324]

Ask authors/readers for more resources

Many shared computing clusters allow users to utilize excess idle resources at lower cost or priority, with the proviso that some or all may be taken away at any time. But, exploiting such dynamic resource availability and the often fluctuating markets for them requires agile elasticity and effective acquisition strategies. Proteus aggressively exploits such transient revocable resources to do machine learning (ML) cheaper and/or faster. Its parameter server framework, AgileML, efficiently adapts to bulk additions and revocations of transient machines, through a novel 3-stage active-backup approach, with minimal use of more costly non-transient resources. Its BidBrain component adaptively allocates resources from multiple EC2 spot markets to minimize average cost per work as transient resource availability and cost change over time. Our evaluations show that Proteus reduces cost by 85% relative to non-transient pricing, and by 43% relative to previous approaches, while simultaneously reducing runtimes by up to 37%.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available