DSpace at EWHA: An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

Browse

My Repository

DSpace at EWHA공과대학 전자전기공학전공 Journal papers

View : 357 Download: 0

An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

Title: An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

Authors: Huang X.; Choi S.H.

Ewha Authors: 최선한

SCOPUS Author ID: 최선한

Issue Date: 2022

Journal Title: Electronics (Switzerland)

ISSN: 2079-9292

Citation: Electronics (Switzerland) vol. 11, no. 7

Keywords: Markov decision process; optimal computing budget allocation; simulation-based policy improvement; stochastic system optimization

Publisher: MDPI

Indexed: SCIE; SCOPUS

Document Type: Article

Abstract: Markov decision processes (MDPs) are widely used to model stochastic systems to deduce optimal decision-making policies. As the transition probabilities are usually unknown in MDPs, simulation-based policy improvement (SBPI) using a base policy to derive optimal policies when the state transition probabilities are unknown is suggested. However, estimating the Q-value of each action to determine the best action in each state requires many simulations, which results in efficiency problems for SBPI. In this study, we propose a method to improve the overall efficiency of SBPI using optimal computing budget allocation (OCBA) based on accumulated samples. Previous works have mainly focused on improving SBPI efficiency for a single state and without using the previous simulation samples. In contrast, the proposed method improves the overall efficiency until an optimal policy can be found in consideration of the state traversal property of the SBPI. The proposed method accumulates simulation samples across states to estimate the unknown transition probabilities. These probabilities are then used to estimate the mean and variance of the Q-value for each action, which allows the OCBA to allocate the simulation budget efficiently to find the best action in each state. As the SBPI traverses the state, the accumulated samples allow appropriate allocation of OCBA; thus, the optimal policy can be obtained with a lower budget. The experimental results demonstrate the improved efficiency of the proposed method compared to previous works. © 2022 by the authors. Licensee MDPI, Basel, Switzerland.