邁向實際的線上學習

本文研究線上決策問題，其問題設定如下。玩家需要在一個環境中重複決定要採取什麼行動，並得知該行動所帶來的影響的訊息。玩家希望有一個線上演算法，可以從過往的歷史中學習，並且在往後能夠作出更好的決策。我們使用後悔度來衡量一個演算法的優劣。近來的研究著重於線上最佳化(傳統最佳化問題的變形)以及專家問題(有限數目的專家可供選擇)，因為這些問題模型可以用來模擬許多現實生活中的挑戰。本文將報告我們在這個方向上的貢獻。首先，我們研究逐漸改變的環境。我們定義一個新測度稱為偏離度來衡量環境的變化程度。在這個新的問題模型中，我們藉由修改一個廣受研究的FTRL演算法來設計我們的演算法，並證明其表現能以偏離度的函數表示。因此我們的演算法在變化和緩的環境中，能夠表現得比以往的演算法更好。接著，我們研究逐漸改變的環境，並假設玩家所能獲得的只有部分的訊息。我們希望得知在什麼樣的回饋訊息下，依舊能夠表現得和得知完整訊息的玩家一樣好。我們設計了一個新穎的抽樣機制，並且使用取得的訊息估計未知的梯度向量。我們的研究成果顯示，每回合僅需知道損益函數的兩個點的值，我們演算法的表現便能與得到所有訊息的玩家一樣好。在第三部分中，我們允許玩家可以主動探詢他想要的部分訊息。我們提出一個新的問題模型，讓玩家可以在某個上限內探詢他所需的訊息。這個問題模型推廣以往不能允許主動探詢的問題。在這個新的問題模型下，我們設計幾乎完美的演算法並且將其表現以玩家的探詢上限的函數表示。最後在第四部份，我們研究含有文意的最少訊息問題。與傳統最小訊息問題的差別在於，玩家在選擇策略之前可以得知額外的訊息。由於擁有所有訊息可以讓演算法有更好的表現，所以我們定義一種新訊息稱為虛擬獲益，並使用虛擬獲益來估計未知的訊息，協助演算法學習。我們設計並分析一個名為LINPRUCB 的新演算法，它是目前已知最好的LINUCB 演算法的衍伸。

關鍵字

線上學習；後悔度；查詢；偏離度；最少訊息；含文意的最少訊息

並列摘要

We study the online decision problem in which a player iteratively chooses an action and then receives certain loss information from the environment for a number of rounds. The player would like to have an online algorithm that makes it possible to learn from past experiences, make better decisions as time goes by, and achieve small regret which is defined as the difference between the total loss of the algorithm and that of the best fixed action. Recently, a class of studies consider variants of the Online Convex Optimization (OCO, which is a variant of the classical convex optimization problem) and the Prediction with Expert Advice (PEA, which models an online decision problem in which the set of actions consists of finite number of actions) problems that can be used to model more realistic challenges. In this thesis, we will present our contributions in this direction. In the first part, we study instances of gradually changing environments in our daily life. We define a new notion that is referred to as the deviation that measures the total difference between consecutive loss functions in order to describe a gradually changing environment. We show that a modification of the well understood FOLLOW THE REGULARIZED LEADER algorithm leads to regret in terms of deviation, thereby implying a small regret for environments with small deviations. In the second part, we ask the same question in a setting where there is partial information. Our goal is to determine how much information is needed, in situations where the deviation constraint is in effect, in order to obtain regret bounds that are close to the regret bounds that were obtained in our previous study [28]. A novel sampling scheme is designed in order to estimate the unknown gradient information that is needed in order to apply the algorithms that are introduced in [28]. We construct two-point bandit algorithms that are able to achieve regret bounds that are close to the regret bounds from our previous study [28] that were based on the full information setting. In the third part, we consider the possibility of active players, since there are scenarios where it seems possible for the player to spend some effort or resources to actively collect some intentionally selected information about the loss functions. We describe a new scenario where the player is allowed to actively issue a limited number of queries in order to obtain the loss information she needs in each round before making a decision. This scenario generalizes the previous problem models in which no query is allowed to be issued before a decision is made. We design an algorithm that achieves the regret as a function of the number of bits that can be queried in one round and provide lower bounds showing that the upper bound in general cannot be improved. In the last part, we study the contextual bandit problem, which is different from the traditional bandit problem, the learner can access additional information about the environment (i.e., context) before making selections. Motivated by the better regret bounds that are available in settings with full information, we create a new notion called pseudo-reward that can be employed for making guesses about unseen rewards and develop a forgetting mechanism for handling fallacious rewards that were computed in the early rounds. Combining the pseudo-rewards and the forgetting mechanism, we propose and analyze a new algorithm called LINPRUCB that is an extension of a state-of-the-art algorithm called LINUCB.

並列關鍵字

Online Learning ； Regret ； Query ； Deviation ； Bandit ； Contextual Bandit

參考文獻

[1] Jacob Abernethy and Alexander Rakhlin. Beating the adaptive bandit with high probability.

[2] Jacob Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert

[4] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization

analog circuits with value injection queries. Machine Learning, 72(1–2):113–138, 2008.

[6] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method:

國際替代計量

邁向實際的線上學習

主題瀏覽