Journal of Advanced Computational Intelligence and Intelligent Informatics 28(2) 403-412 2024年3月20日
In recent years, inverse reinforcement learning has attracted attention as a method for estimating the intention of actions using the trajectories of various action-taking agents, including human flow data. In the context of reinforcement learning, “intention” refers to a reward function. Conventional inverse reinforcement learning assumes that all trajectories are generated from policies learned under a single reward function. However, it is natural to assume that people in a human flow act according to multiple policies. In this study, we introduce an expectation-maximization algorithm to inverse reinforcement learning, and propose a method to estimate different reward functions from the trajectories of human flow. The effectiveness of the proposed method was evaluated through a computer experiment based on human flow data collected from subjects around airport gates.
Journal of Advanced Computational Intelligence and Intelligent Informatics 28(2) 393-402 2024年3月20日
Sequential decision-making under multiple objective functions includes the problem of exhaustively searching for a Pareto-optimal policy and the problem of selecting a policy from the resulting set of Pareto-optimal policies based on the decision maker’s preferences. This paper focuses on the latter problem. In order to select a policy that reflects the decision maker’s preferences, it is necessary to order these policies, which is problematic because the decision-maker’s preferences are generally tacit knowledge. Furthermore, it is difficult to order them quantitatively. For this reason, conventional methods have mainly been used to elicit preferences through dialogue with decision-makers and through one-to-one comparisons. In contrast, this paper proposes a method based on inverse reinforcement learning to estimate the weight of each objective from the decision-making sequence. The estimated weights can be used to quantitatively evaluate the Pareto-optimal policies from the viewpoints of the decision-makers preferences. We applied the proposed method to the multi-objective reinforcement learning benchmark problem and verified its effectiveness as an elicitation method of weights for each objective function.
現在,多くの交通流シミュレーションでは定常状態の交通流を評価するモデルが採用されており,過渡的な現象を再現することは困難である.しかし,道路ネットワークの変更をともなう交通施策の評価においてこの過渡的な現象を無視することはできない.そこで本研究では,道路ネットワークが変化する状況下でどちらの現象も扱うことのできる経路選択モデルを新たに提案し,マルチエージェント型交通流シミュレータに実装する.はじめに,基本として用いた強化学習の枠組みに基づくQ-routingアルゴリズムの説明を行い,交通流シミュレーションに適用するため加えたいくつかの改良について説明する.不規則な格子状のネットワークにおいて,渋滞や信号制御に対する本経路選択モデルの基本的な振舞いを検証し,最後に岡山市内の現実のLRT延伸計画を対象としてシミュレーションを行い,延伸実行後に小規模な渋滞が発生する可能性を示す.また,この渋滞現象は運転者の保持する延伸前の走行経験がバイアスとして影響した結果,一時的に生じるもので,十分な時間経過の後定常状態に至ることを示す.This paper describes the impact of changes in road network on driver's behavior. In agent-based traffic simulations, agents typically choose the shortest route, while drivers in a real world choose their routes through their own past experience. Since traffic simulations are used extensively in the evaluation and verification of traffic policy, accurate simulation reproducing the routing behavior of the real world is strongly demanded. Thus, we newly develop a reinforcement learning based routing algorithm, and implement it in Traffic Simulator. Then we explains some improvements in the Q-routing for traffic simulation. Firstly, we perform a preliminary experiment using an irregular grid network with various loads or signal control, and get good performance, i.e. robustness of the improved Q-routing in the cases of heavy traffic with signal control. Second, the simulator with/without the improved Q-routing is applied to simulate the LRT expansion project in Okayama city, and we evaluate transient traffic behaviors after the implementation of the plan. We observe transient congestion phenomenon only in the simulation with the improved Q-routing. Though the detailed analysis of the results, we find that such transient traffic congestions are caused due to the bias of the drivers' past experience.
協調フィルタリングを用いたユーザベースの授業推薦システムのプロトタイプを作成し評価を行なったところ,網羅性,精確性の改善,ユーザインタフェースの改善,時間割による制限への対応が必要であることがわかった.それに対応するため,アイテムベースの推薦を追加し類似度の計算尺度の改善を行なった.また,自分と類似した先輩の履修授業を推薦する先輩推薦と,時間割を考慮した協調推薦の機能を設けた.さらにアイテムベースの相関推薦も併用することで学生の主体性の維持を図った.推薦画面を時間割ベースに変更した.この評価の結果,この改良版プロトタイプは,網羅性と精確性を備えていること,高い主体性の維持が確認できた.The Lecture Recommendation System based on User-Based Collaborative Filtering had been completed. The improvement of comprehensiveness, precision and user interface would be needed. In order to improve the system, Item-basedCollaborative Filtering and Cosine Similarity were introduced. The Senior Recommendation could recommend lectures the seniors who are similar to the user registered. The Collaboration Recommendation based on User-Based Collaborative Filter ing and the Item Recommendation based on Item-Based Item-to-item Collaborative Filtering had been developed to improve the initiative of students. In evaluation of the improved prototype with the form of a timetable, it is confirmed that the system performance was increased.
大学組織において,教職員は質の高い教育の提供という共通の目的を持つにもかかわらず,互いが提供できる情報を共有する場は限られている.また,学生が潜在的に必要とする情報を的確にとらえることは難しい.本論文は,大学組織を構成する教員·附属図書館員,学生が,必要な情報や提供できる情報を相互に補完し合うためのシステムを提案する.提案システムは,パスファインダとフォークソノミの2つの部分から構成され,前者は,図書館員と教員が作成し,学生に提示する情報提供部分,後者は,講義シラバスに基づいて学生が創り出し,教職員に向けた情報発信部分である.本提案システムの特徴は,教職員や学生がシステムを利用することにより,データベースが利用目的別に構造化され,さらに必要なデータが付加されてデータベースが豊富になっていく仕組みを実現している点にある.本論文では,フォークソノミを「学生側のニーズ抽出機能」として今後の大学組織にとって重要な部分であるという認識から,この実現方法を中心に説明する.検証実験では,Social TagとSocial Linkを学生に付加してもらい,抽出された学生の潜在的ニーズは,パスファインダに掲載する内容を選定するうえでの貴重な情報源となることが示された.また,千葉大学で実用を開始しているパスファインダとフォークソノミとの統合によって教職員と学生間に情報が流通し,効果的に利用し合うことが可能になることを示す.On university campus, teaching faculty and staff share only a limited amount of the information they can mutually provide although they share the common goal of quality education for students. Neither can they appropriately perceive the kind of information students potentially need. The present paper proposes and verifies a system in which all constituents of university including faculty, librarians and students can share and supplement each other for information necessitate for their life and potentially sharable by them. The proposed system consists of the “Pathfinder” and “Falksonomy” components, with the former targeting at the student body, designed collaboratively by teaching faculty and librarians, and the latter addressing the faculty and staff on behalf of students. The currently proposed system is characterized by the implemented mechanism in which the use by the faculty and students helps structurize in terms of their needs the database, which will eventually enrich the database with sorting by purpose and addition of data. The description in the paper concentrate on the implementation of the “Folksonomy” component as the functionality of extracting needs from students in the belief that it will be more important in future. In the testing experiment, students are asked to tag the data in terms of their preferences. The resulting analysis shows that the revealed data structure could be important input for the selecting of items in the pathfinders. It will be also demonstrated that the integration of the creation of pathfinders augmented by folksonomy will potentially result in better communication in teachers and doctors/nurses communicate more effectively.
In this paper, we discuss guidelines for a <I>reward design problem</I> that defines when and what amount of reward should be given to the agent/s, within the context of reinforcement learning approach. We would like to take <I>keepaway</I> soccer as a standard task of the multiagent domain which requires skilled teamwork. The difficulties of designing reward for this task are due to its features as follows: i) since it belongs to the <I>continuing task</I> which has no explicit goal to achieve, it is hard to tell when reward should be given to the agent/s. ii) since it is a multiagent cooperative task, it is hard to decide what is a fair share of reward for each agent's contribution to achieve the goal. Through some experiments, we show that the reward design have a major effect on the agent's behavior, and introduce the successful reward function that makes agents perform <I>keepaway</I> better and more interesting than the conventional one does. Finally, we explore the relationship between `reward design' and `acquired behaviors' from the viewpoint of teamwork.
International Joint Conference on Autonomous Agents and Multiagent Systems Workshops on Adaptation and Learning in Autonomous Agents and Multiagent Systems (H.18.5発表予定) 2006年