4.5 Article

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

Journal

IEEE-ACM TRANSACTIONS ON NETWORKING
Volume 20, Issue 5, Pages 1466-1478

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNET.2011.2181864

Keywords

Combinatorial network optimization; multi-armed bandits (MABs); online learning

Funding

  1. U.S. Army Research Laboratory under the Network Science Collaborative Technology Alliance [W911NF-09-2-0053]
  2. U.S. National Science Foundation [CNS-1049541]
  3. AFOSR [FA9550-10-1-0307]
  4. NSF [CNS-0954116]
  5. Direct For Computer & Info Scie & Enginr
  6. Division Of Computer and Network Systems [0954116] Funding Source: National Science Foundation
  7. Div Of Information & Intelligent Systems
  8. Direct For Computer & Info Scie & Enginr [0917410] Funding Source: National Science Foundation

Ask authors/readers for more resources

We formulate the following combinatorial multi-armed bandit (MAB) problem: There are N random variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined as the difference between the reward obtained by a genie that knows the mean of each random variable, and that obtained by the given policy. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weighted matching, shortest path, and minimum spanning tree computations. Prior work on multi-armed bandits with multiple plays cannot be applied to this formulation because of the general nature of the constraint. On the other hand, the mapping of all feasible combinations to arms allows for the use of prior work on MAB with single-play, but results in regret, storage, and computation growing exponentially in the number of unknown variables. We present new efficient policies for this problem that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables. Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. For problems where the underlying deterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regret that is suitable when a polynomial-time approximation algorithm is used.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available