Which of the following is a difference between exploration and exploitation quizlet?

China - The unchanged culture of China allowed there to be a sense of patriotism, and unified the country. Wanted to be superior to the other countries, and to do this they remained self-sufficient. They also wanted to focus on their own security, since they were being attacked by the Manchus. Very worried about invasion. To achieve this, China built a wall.

Japan - Jesuits are trying to convert others to Christianity, and create a Catholic empire. In 1615 the Japanese drove the Jesuits from Japan. Began closed, or locked country policy to remain away from foreign influence and culture. Remained closed for over 200 years. The policy gave the Japanese 250 years of peace. To achieve this, they closed off their ports except for one. Only trades with the Chinese and Dutch. They are an island.

In Japan, they remained isolated to save their culture and keep out foreign influence. In China, they remained isolated to feel superior and show off their wealth.

Although there is no war, disease, or self-sufficient economy, the risk of having no allies is too great. They would also be cut off from new inventions, ideas, medicine, crops, and discoveries.

Recommended textbook solutions

Which of the following is a difference between exploration and exploitation quizlet?

Human Resource Management

15th EditionJohn David Jackson, Patricia Meglich, Robert Mathis, Sean Valentine

249 solutions

Which of the following is a difference between exploration and exploitation quizlet?

Human Resource Management

15th EditionJohn David Jackson, Patricia Meglich, Robert Mathis, Sean Valentine

249 solutions

Which of the following is a difference between exploration and exploitation quizlet?

Information Technology Project Management: Providing Measurable Organizational Value

5th EditionJack T. Marchewka

346 solutions

Which of the following is a difference between exploration and exploitation quizlet?

Human Resource Management

15th EditionJohn David Jackson, Patricia Meglich, Robert Mathis, Sean Valentine

249 solutions

Scheduled maintenance: Thursday, December 8 from 5PM to 6PM PST

Home

Subjects

Expert solutions

Create

Log in

Sign up

Upgrade to remove ads

Only ₩37,125/year

  • Flashcards

  • Learn

  • Test

  • Match

  • Flashcards

  • Learn

  • Test

  • Match

Terms in this set (20)

Exploration

thing that separates other machine learning topics from RL

The more data we have the more confident we are to believe that data.

True. Confidence based exploration

To obtain alot of reward, an RL agent must prefer actions that it has tried in past and found to effective in producing reward. But to discover such actions it has to try actions that it has not selected before.

The agent has to exploit what it has already experience , but it also has to
EXPLORE in order to make better actions selections in the future

True

On a stochastic task, each action must be tried many times in order to gain a reliable estimate of its expected reward

True

Transitions bandits vs deterministic MDPs vs Stochastics MDPs

bandits - don't have any state transitions at all but do have stochasticity
deterministic MDPs - we have state transitions but no stochasticity at all
stochastic MDPs - combining both before gives us a way of solving general MDPs

K-Armed Bandits

- k different arm bandits (slot machine)
- don't know the payouts of each bandit

Minimum confidence is going to get you better estimates. While Max likelihood is going to get you more reward

True

Metrics for Bandits

1. Maximize expected reward over finite horizon
2. Identify near optimal arm with high probability
3. Nearly maximize reward with high probability

Can be combined
Find best arm --> Few mistakes
Few mistakes --> Do well
Do well -> Find best arm

If we have an algorithm that gets within epsilon per time step of optimal than find best.

Hoeffding

- how many number of samples we need to accurately learn the value of arm

Exploring deterministic MDPs

- explore randomly
- trap states
- mistake bounds : number of epsiolon - suboptimal actions need to be bounder

RMAX algorithm
1. Keep track of MDP
2. Any unknown state-action pair is Rmax
3. Solve MDP
4. Take action from optimal policy

and what is the RMAX analysis

1. - Once all edges are know, no more mistakes : because solves MDP so every action is optimal

2. Stop visiting unknown states
- if we loop without learning anything then there is no mistakes

3. Number of transitions to unknown state action pair is bounder by number of states
- number of times we might discover state action pairs n*k

Lower bound for algorithm is O(n^2 k)

True

General Stochastic MDPs

- want to do efficient algorithm :
- stochastic Hoeffding bound until estimates are accuerate OR
- sequential (unknown state-action pairs assumed optimistic)

Explore or exploit lemma

- if all transitions are either accurately estimated or unknown, optimal policy is either near optimal or an unknown state is reached quickly

Summary

Bandits

bandits which is all about stocasticity and randomness
- decisions making with randomness
- we can estimate what we know using Hoffding bound, union bound to convince ourselfs that we have a sufficiently accurate estiamte of near optimal reward in stocastic decision problems.
- we have a stochastic world and we want to learns how it works so that we can get near optimal reward and optimize

- lets us deal with stochastic decision making

Hoffding bound tells us how certain we really are so that we know when we are certain enough

True

RMax works with Deterministic MDPs

-optimism in the face of uncertainty
- causes us to explore at a distance by planning ahead getting to new states that new information could be gained
- lets us deal with sequential decision making

Combined the 2 : use the bandit idea to estimate(noisy parameters) Transition probability and use RMax idea to make sure visited things enough to get accurate estimates

True
bandits - helped with transition prob and knew when to believe them

- stochastic + sequential

KWIK learning

- way we can distinguish in bandits from known and known can be generalized to KWIK learning
- learning transition probability using methods using know what they known

- so if you can distinguish between known and unknown it can associate optimism with the unknown and it makes guaranteers on how efficiently near-optimal behavior can be learned

- KWIK is a learning framework that try to generalize beyond tabluar MDPs to able to generalize between transition probablities between different parts of MDP

Rmax setting c parameters

Rmax in practice is very effective algorithm : makes good use of the data
- too small: might not learn near optimal behavior
- too big : learner has to visit many state-action pairs many many times over again

Other sets by this creator

CCC

8 terms

casanas10

Game Theory 3

9 terms

casanas10

Game Theory Reloaded

23 terms

casanas10

Options

12 terms

casanas10

Verified questions

engineering

Consider a process during which no entropy is generated $\left(S_{\mathrm{gen}}=0\right)$. Does the exergy destruction for this process have to be zero?

Verified answer

chemistry

What is an emulsifying agent?

Verified answer

engineering

The fan blows air at $6000\ \mathrm{ft}^3 / \mathrm{min}$. If the fan has a weight of $40\ \mathrm{lb}$ and a center of gravity at $G$, find the smallest diameter $d$ of its base so that it will not tip over. Assume the airstream through the fan has a diameter of $2\ \mathrm{ft}$. The specific weight of the air is $\gamma_a=0.076\ \mathrm{lb} / \mathrm{ft}^3$.

Verified answer

physics

Sand runs from a hopper at constant rate dm/dt onto a horizontal conveyor belt driven at constant speed V by a motor. a. Find the power needed to drive the belt. b. Compare the answer to a with the rate of change of kinetic energy of the sand. Can you account for the difference?

Verified answer

Recommended textbook solutions

Which of the following is a difference between exploration and exploitation quizlet?

Information Technology Project Management: Providing Measurable Organizational Value

5th EditionJack T. Marchewka

346 solutions

Which of the following is a difference between exploration and exploitation quizlet?

Introduction to Algorithms

3rd EditionCharles E. Leiserson, Clifford Stein, Ronald L. Rivest, Thomas H. Cormen

726 solutions

Which of the following is a difference between exploration and exploitation quizlet?

Information Technology Project Management: Providing Measurable Organizational Value

5th EditionJack T. Marchewka

346 solutions

Which of the following is a difference between exploration and exploitation quizlet?

Computer Organization and Design MIPS Edition: The Hardware/Software Interface

5th EditionDavid A. Patterson, John L. Hennessy

220 solutions

Other Quizlet sets

ecology ch.1 nature of ecology

30 terms

mohaina_samad

Audit Quiz Chapter 9

51 terms

Mbecn

Chem 106 Kinetics Test

82 terms

jeff_shipley

Which of the following is a difference between exploration and exploitation?

Which of the following is a difference between exploration and exploitation? a. Exploration extends the search for commercialization of new products beyond the boundaries of an organization, while exploitation limits itself to the boundary of an organization.

Are changes in an organization's production process that enable distinctive competence?

technology changes changes in an organization's production process, including its knowledge and skill base, that enable distinctive competence.

Which of the following is a characteristic of companies that successfully innovate?

Innovative companies share common characteristics that include empowerment and trust, team collaboration, leading by example, and listening to customers.

Is the adoption of an idea or behavior that is new to an organization's industry market or general environment?

The adoption of an idea or behavior that is new to the organization's industry, market, or general environment is referred to as: organizational innovation.