[RL-Python] Multi-armed bandit

import numpy as np
from scipy import stats
import random
import matplotlib.pyplot as plt


def get_reward(prob, n=10):
    reward = 0
    for i in range(n):
        if random.random() < prob:
            reward += 1
    return reward


def softmax(av, tau=1.12):
    softm = ( np.exp(av / tau) / np.sum( np.exp(av / tau) ) )
    return softm

def update_record(record, action, r):
    # 지금 까지의 평균 * 시행 수 = 지금 까지 reward의 합
    # -> 이걸 이용 해서 시행 수 + 1, 새로운 reward 포함하여 record update
    new_r = (record[action, 0] * record[action, 1] + r) / (record[action, 0] + 1)
    record[action, 0] += 1
    record[action, 1] = new_r
    return record


n = 10
record = np.zeros((n, 2))
probs = np.random.rand(n)
eps = 0.2

fig, ax = plt.subplots(1, 1)
ax.set_xlabel("Plays")
ax.set_ylabel("Avg Reward")

fig,ax = plt.subplots(1,1)
ax.set_xlabel("Plays")
ax.set_ylabel("Avg Reward")
fig.set_size_inches(9, 5)
rewards = [0]
for i in range(500):
    p = softmax(record[:, 1], tau=0.7)
    choice = np.random.choice(np.arange(n),p=p)     # Epsilon-Greedy 방법이 아닌 softmax 처리한 확률 분포에서 선택
    r = get_reward(probs[choice])
    record = update_record(record, choice, r)
    mean_reward = ((i+1) * rewards[-1] + r)/(i+2)
    rewards.append(mean_reward)
ax.scatter(np.arange(len(rewards)), rewards)

plt.show()

'RL with python > Python example code' 카테고리의 다른 글

[RL-PyTorch] Policy gradient method, REINFORCE algorithm (0)	2023.08.28
[RL-PyTorch] Deep Q-learning - Experience replay (0)	2023.08.22
[RL-PyTorch] Basic Deep Q-learning (2)	2023.08.22
[RL-Python] Contextual Bandit with simple Neural Network (0)	2023.08.22
[RL-Python] Multi-armed bandit - Epsilon Greedy (0)	2023.08.21

개발자 맞는 것 같습니다

[RL-Python] Multi-armed bandit - softmax

'RL with python > Python example code' 카테고리의 다른 글

티스토리툴바

[RL-Python] Multi-armed bandit - softmax

'RL with python > Python example code' 카테고리의 다른 글

관련글

티스토리툴바