DSpace at EWHA: 행동 제약이 존재하는 재고관리문제에서 강화학습 에이전트의 조정방안 연구

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 552 Download: 0

행동 제약이 존재하는 재고관리문제에서 강화학습 에이전트의 조정방안 연구

Title: 행동 제약이 존재하는 재고관리문제에서 강화학습 에이전트의 조정방안 연구

Other Titles: A study on the adjustment method of reinforcement learning agents in inventory management problems with action constraints

Authors: 김지헌

Issue Date: 2022

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 민대기

Abstract: 재고관리문제에 강화학습을 응용한 연구들이 진행되고 있으나 다품종 재고관리에 예산 제약을 고려한 연구는 박나희, LAU & 민대기 (2021)가 유일하다. 이들은 강화학습 모형의 학습 중 이차계획법을 통해 예산 제약에 대한 최적화를 수행하고 Q-table 갱신에 활용하는 OptLayer 기법을 도입하였다. 그러나 모형의 학습 중 최적화가 수행되는 해당 모형은 에이전트 수와 상품의 수가 증가할수록 계산 복잡도와 계산시간이 증가할 것으로 예상된다. Q-Learning은 전체 상태와 행동 공간에 대해 가치를 계산하기 때문이다. 그뿐 아니라 각 예산에 대해 개별적으로 모형을 학습시켜야 하므로 전체 실험 시간이 증대된다. 본 논문에서는 선행연구의 개념을 확장하여 분할 학습 후 예산 제약에 대한 통합 조정을 수행하는 DLCD (decentralized learning & centralized decision making) 방식의 강화학습 모형을 제안한다. 제안 모형에서 소매업체는 각자 Q-Learning으로 재고정책을 학습한다. 중앙의 의사결정자인 공급업체는 이차계획법을 사용해 예산 제약식에 대한 최적화를 수행하여 공급량을 조정한다. 제약식에 대한 계산이 순차적으로 이뤄지므로 선행연구 모형 대비 계산시간 절감이 예상된다. 재고유지비용과 재고부족비용 단가 변화에 따른 3가지 경우에 대해 4가지 예산 규모를 적용하여 총 12개의 실험을 진행하였다. 실험 결과 제안 모형은 비교 모형보다 주문량이 감소하여 주문비용, 재고유지비용, 위반비용이 감소하였으나 재고부족비용이 증가하였다. 그러나 재고부족비용이 크게 증가한 탓에 총비용이 증가한 것으로 나타났다. 한편 상품 유통 중에 발생하는 총비용과 예산을 초과하여 발생하는 위반비용은 서로 상반관계이므로 단순 비교할 수 없다. 따라서 총비용과 주문금액을 각각의 축으로 가지는 좌표평면 상에 실험결과를 위치시키고 원점으로부터의 거리를 계산하여 비용거리로 정의한다. 전체 12개의 실험에서 제안 모형은 총비용이 위반비용보다 컸으나 비용거리가 가장 작았다. 또한 비용거리는 예산 규모가 작을수록 작게 나타났다. 연구가설대로 제안 모형은 선행연구 모형 대비 학습 시간이 절약되었으며 학습 후 여러 예산에 대해 실시간으로 실험이 가능했다. 본 논문은 강화학습을 이용한 다품종 재고관리문제에 예산 제약을 고려했다는 점에서 이전 연구들과 차별점이 있다. 또한 모든 예산에 대해 학습을 요구했던 선행연구 모형과 비교해 전체 실험시간 절감에 효과적이면서도 더 강력한 성능을 보였다는데 연구 의의를 가진다.;Although there are studies that apply reinforcement learning to inventory management problems, the only study that considers the budget constraints for multi-item inventory management is LAU & Min Dae-ki (2021). They applied the OptLayer method which optimizes budget constraints through quadratic programming while learning the reinforcement learning model and uses it to update the Q-table. However, it is expected that the computational complexity and computation time will increase as the number of agents and the number of products increases since optimization is performed during learning. This is because Q-Learning computes values for the entire state and action space. In addition, the overall experiment time increases because the model must be trained individually for each budget. In this paper, by extending the concept of previous research, we propose a reinforcement learning model of the DLCD (decentralized learning & centralized decision making) approach that performs integrated adjustment for budget constraints after divisional learning. In the proposed model, each retailer learns an inventory policy through Q-Learning. Supplier, which is assumed to be the central decision-maker, uses quadratic programming to optimize budget constraints and adjust supply. Since the calculation of the constraint formula is performed sequentially, the computation time is expected to be reduced compared to the previous research model. A total of 12 experiments were conducted by applying 4 budget sizes to 3 cases according to the change in the unit price of inventory holding cost and inventory backlog cost. As a result, the proposed model decreased the order quantity compared to the comparative model to reduce the order cost, inventory holding cost, and penalty cost, but increased the backlog cost. However, it was found that the backlog cost increased significantly, increasing the total cost. On the other hand, the total cost and the penalty cost cannot be compared because they are in the trade-off. Therefore, the experimental result is placed on the coordinate plane having the total cost and the order amount as each axis, and the distance from the origin is calculated and defined as the cost distance. In all 12 experiments, the total cost of the proposed model was greater than the penalty cost, but the cost distance was the smallest. Also, the cost distance is smaller as the budget size is smaller. According to the research hypothesis, the proposed model saved computation time compared to the previous research model, and after learning, real-time experiments were possible for various budgets.