Stochastic Bandits for Egalitarian Assignment
We address a problem where an agent assigns users to arms in a stochastic multi-armed bandit setting to maximize the minimum expected cumulative reward for all users. It presents a UCB-based policy with upper bounds on cumulative regret and an impossibility result for policy-independent approaches.