Journal of Advanced Computational Intelligence and Intelligent Informatics 28(2) 403-412 2024年3月20日
In recent years, inverse reinforcement learning has attracted attention as a method for estimating the intention of actions using the trajectories of various action-taking agents, including human flow data. In the context of reinforcement learning, “intention” refers to a reward function. Conventional inverse reinforcement learning assumes that all trajectories are generated from policies learned under a single reward function. However, it is natural to assume that people in a human flow act according to multiple policies. In this study, we introduce an expectation-maximization algorithm to inverse reinforcement learning, and propose a method to estimate different reward functions from the trajectories of human flow. The effectiveness of the proposed method was evaluated through a computer experiment based on human flow data collected from subjects around airport gates.
Journal of Advanced Computational Intelligence and Intelligent Informatics 28(2) 393-402 2024年3月20日
Sequential decision-making under multiple objective functions includes the problem of exhaustively searching for a Pareto-optimal policy and the problem of selecting a policy from the resulting set of Pareto-optimal policies based on the decision maker’s preferences. This paper focuses on the latter problem. In order to select a policy that reflects the decision maker’s preferences, it is necessary to order these policies, which is problematic because the decision-maker’s preferences are generally tacit knowledge. Furthermore, it is difficult to order them quantitatively. For this reason, conventional methods have mainly been used to elicit preferences through dialogue with decision-makers and through one-to-one comparisons. In contrast, this paper proposes a method based on inverse reinforcement learning to estimate the weight of each objective from the decision-making sequence. The estimated weights can be used to quantitatively evaluate the Pareto-optimal policies from the viewpoints of the decision-makers preferences. We applied the proposed method to the multi-objective reinforcement learning benchmark problem and verified its effectiveness as an elicitation method of weights for each objective function.
Electrical Engineering in Japan 214(2) 2021年1月 査読有り責任著者
<jats:title>Abstract</jats:title><jats:p>The effective utilization of regenerative power generated by trains has attracted the attention of engineers due to its promising potential in energy conservation for electrified railways. Charge control by wayside battery batteries is an effective method of utilizing this regenerative power. Wayside batteries requires saving energy by utilizing the minimum storage capacity of energy storage devices. However, because current control policies are rule‐based, based on human empirical knowledge, it is difficult to decide the rules appropriately considering the battery's state of charge. Therefore, in this paper, we introduce reinforcement learning with an actor‐critic algorithm to acquire an effective control policy, which had been previously difficult to derive as rules using experts’ knowledge. The proposed algorithm, which can autonomously learn the control policy, stabilizes the balance of power supply and demand. Through several computational simulations, we demonstrate that the proposed method exhibits a superior performance compared to existing ones.</jats:p>
Proceedings of the Tenth International Workshop on Agents in Traffic and Transportation (ATT 2018) co-located with with the Federated Artificial Intelligence Meeting, including ECAI/IJCAI, AAMAS and ICML 2018 conferences (FAIM 2018), Stockholm, Sweden, Ju 63-69 2018年 査読有り
Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, Volume 2, Funchal, Madeira, Portugal, January 16-18, 2018. 276-283 2018年 査読有り責任著者
In this paper, we introduce an intelligent vehicle in traffic flow where a phantom traffic jam occurs for ensuring traffic-flow stability. The intelligent vehicle shares information on the speed and gap of the leading vehicle. Furthermore, the intelligent vehicle can foresee changes in the leading vehicles through shared information and can start accelerating faster than human-driven vehicles can. We propose an intelligent vehicle model, which is a generalized Nagel-Schreckenberg model can arbitrarily set the number of leading vehicles to share information with and set maximum distance of inter-vehicle communication. We found that phantom traffic jams are suppressed by an intelligent vehicle that can share information with two or more vehicles in front and information at least 30 meters away.
2016 IEEE INTERNATIONAL CONFERENCE ON AGENTS (IEEE ICA 2016) 110-111 2016年 査読有り
Multi-Objective Reinforcement Learning (MORL) can he divided into two approaches according to the number of acquired policies. One approach learns a single policy that makes the agent reach a single arbitral Pareto optimal solution, and the other approach learns multiple policies that correspond to each Pareto optimal solution. The latter approach finds the multiple policies simultaneously; however, it incurs significant computational cost. In many real-world cases, learning a single solution is sufficient in the multi-objective context. In this paper, we focus on the former approach where a suitable weight of each object must be defined. To estimate the weight of each object as parameters, we utilize Q-values on the expert's trajectory, which indicates the optimal sequence of actions. This approach is an analogy obtained from apprenticeship learning via inverse reinforcement learning. We evaluate the proposed method using a well-known MORL benchmark problem, i.e., the Deep Sea Treasure environment.
In this paper, we introduce an intelligent vehicle in traffic flow where a phantom traffic jam occurs for ensuring traffic-flow stability. The intelligent vehicle shares information on the speed and gap of the leading vehicle. Furthermore, the intelligent vehicle can foresee changes in the leading vehicles through shared information and can start accelerating faster than human-driven vehicles can. We propose an intelligent-vehicle model, which is a generalized Nagel-Schreckenberg model that allows sharing information with leading vehicles. The generalized Nagel-Schreckenberg model can arbitrarily set the number of leading vehicles to share information with, and we found that phantom traffic jams are resolved by an intelligent vehicle that shares information with two or more vehicles in front.
IEEJ Transactions on Electronics, Information and Systems 134(9) 1310-1317 2014年 査読有り
In this paper, we propose a method to diminish the state space explosion problem of a multiagent reinforcement learning context, where each agent needs to observe other agents' states, and previous actions at each step of its learning process. However, both the number of state and action become exponential in the number of agents, leading to enormous amount of computation and very slow learning. In our method, the agent considers other agents' statuses only when they interfere with one another to reach their goals. Our idea is that each agent starts with its state space which does not include information of others'. Then, they automatically expand and refine their state space when agents detect interference. We adopt the information theory measure of entropy to detect the interference status where agents should take into account the other agents. We demonstrate the advantage of our method over the properties of global convergence in a time efficient manner.
<jats:p>Although a large number of reinforcement learning algorithms have been proposed for the generation of cooperative behaviors, the question of how to evaluate mutual benefit or loss among them is still open. As far as we know, an emerged behavior is regarded as a cooperative behavior when embedded agents have finally achieved their global goal, regardless of whether or not mutual interference has had any effect during the course of the learning process of each agent. Thus, we cannot detect any harmful interaction on the way to achieving a fully-converged policy. In this paper, we propose a measure based on information theory for evaluating the degree of interaction during the learning process from the viewpoint of information sharing. In order to discuss the bad effects of concurrent learning, we apply our proposed measure to a situation in which there exist conflicts among the agents, and we show the availability of our measure.</jats:p>
INTELLIGENT AGENTS AND MULTI-AGENT SYSTEMS, PROCEEDINGS 5357 34-41 2008年 査読有り
Although a large number of algorithms have been proposed for generating cooperative behaviors, the question of how to evaluate mutual benefit among them is still open. This study provides a measure for cooperation degree among the reinforcement learning agents. By means of our proposed measure, that is based on information theory, the degree of interaction among agents can be evaluated from the viewpoint of information sharing. Here, we show the availability of this measure through some experiments on "pursuit game", and evaluate the degree of cooperation among hunters and prey.