View : 357 Download: 0

An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples

Title
An Efficient Simulation-Based Policy Improvement with Optimal Computing Budget Allocation Based on Accumulated Samples
Authors
Huang X.Choi S.H.
Ewha Authors
최선한
SCOPUS Author ID
최선한scopus
Issue Date
2022
Journal Title
Electronics (Switzerland)
ISSN
2079-9292JCR Link
Citation
Electronics (Switzerland) vol. 11, no. 7
Keywords
Markov decision processoptimal computing budget allocationsimulation-based policy improvementstochastic system optimization
Publisher
MDPI
Indexed
SCIE; SCOPUS WOS scopus
Document Type
Article
Abstract
Markov decision processes (MDPs) are widely used to model stochastic systems to deduce optimal decision-making policies. As the transition probabilities are usually unknown in MDPs, simulation-based policy improvement (SBPI) using a base policy to derive optimal policies when the state transition probabilities are unknown is suggested. However, estimating the Q-value of each action to determine the best action in each state requires many simulations, which results in efficiency problems for SBPI. In this study, we propose a method to improve the overall efficiency of SBPI using optimal computing budget allocation (OCBA) based on accumulated samples. Previous works have mainly focused on improving SBPI efficiency for a single state and without using the previous simulation samples. In contrast, the proposed method improves the overall efficiency until an optimal policy can be found in consideration of the state traversal property of the SBPI. The proposed method accumulates simulation samples across states to estimate the unknown transition probabilities. These probabilities are then used to estimate the mean and variance of the Q-value for each action, which allows the OCBA to allocate the simulation budget efficiently to find the best action in each state. As the SBPI traverses the state, the accumulated samples allow appropriate allocation of OCBA; thus, the optimal policy can be obtained with a lower budget. The experimental results demonstrate the improved efficiency of the proposed method compared to previous works. © 2022 by the authors. Licensee MDPI, Basel, Switzerland.
DOI
10.3390/electronics11071141
Appears in Collections:
공과대학 > 전자전기공학전공 > Journal papers
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

BROWSE