-
Notifications
You must be signed in to change notification settings - Fork 1
Objects and methods
MDPSolver runs on Python but contains a solver developed in C++. This makes the package ideal for large MDPs.
The MDPSolver workflow:
- Create a model object.
- Select your MDP model.
- Derive an
$\epsilon$ -optimal policy. - Get the results.
Use model()
to create a model object. This object is used for defining your MDP model and deriving its policy.
mdl = mdpsolver.model()
MDPSolver provides several ways to set up your MDP model. We currently provide the following:
- A custom model where users can import any reward structure and transition probability matrix.
- A time-based maintenance problem (soon available).
- A condition-based maintenance problem (soon available).
Select the custom MDP model with user-specified parameters.
void mdp(discount=0.99,
rewards=list(),
rewardsElementwise=list(),
rewardsFromFile="rewards.csv",
tranMatWithZeros=list(),
tranMatElementwise=list(),
tranMatProbs=list(),
tranMatColumns=list(),
tranMatFromFile="transitions.csv")
-
float discount
: The discount (often denoted$\lambda$ ). Must be chosen such that$0 < \lambda < 1$ . Note: Only applicable to the discounted reward optimality criterion (see thesolve
method below). -
list rewards
: A 2D-list containing the reward (float
) of a particular action in a particular state. The first index (rows) define the state, and the second index (columns) defines the action, e.g.r = rewards[sidx][aidx]
. Note that states do not need to contain the same number of actions. -
list rewardsElementwise
: Alternative reward format. A 2D-list where each row corresponds to a combination of a state and an action. The 2D-list always contains three columns. Column 1: State indices (int
), Column 2: Action indices (int
), Column 3: Rewards (float
). For instance,r = rewardsElementwise[i][2]
, where if all states contain the same number of actionsi = sidx*nActions + aidx
. -
string rewardsFromFile
: Load the rewards from a comma-separated (,
) file, e.g.rewardsFromFile = "rewards.csv"
. This can be useful for several reasons: (1) If you want to generate the reward structure in a different software, (2) if you want to store your reward structure for later use, and (3) if your MDP model is large. We assume the file follows the elementwise format, described above, and contains a header on the first line, e.g.state,action,reward
. -
list tranMatWithZeros
: A 3D-list containing the transition probabilities. The first index defines the current state, the second index defines the selected action, and the third index defines the next state, e.g.p = tranMatWithZeros[sidx][aidx][jidx]
. Note that this format includes elements that are 0. -
list tranMatElementwise
: Sparse transition probabilities (option 1). A 2D-list where each row corresponds to a combination of a current state, an action, and a next state. The 2D-list always contains four columns. Column 1: Current state indices (int
), Column 2: Action indices (int
), Column 3: Next state indices (int
), Column 4: The (non-zero) transition probability (float
). See the example in our Quick start guide. -
list tranMatProbs
: Sparse transition probabilities (option 2, part 1). A 3D-list containing the non-zero transition probabilities. Corresponds to the optiontranMatWithZeros
, but excludes all elements that are zero. The first index defines the current state, the second index defines the selected action, and the third index is a sequence-index, e.g.p = tranMatProbs[sidx][aidx][i]
. See the example in our Quick start guide. -
list tranMatColumns
: Sparse transition probabilities (option 2, part 2). A 3D-list containing the columns of the non-zero transition probabilities defined in the listtranMatProbs
(see above). The first index defines the current state, the second index defines the selected action, and the third index is a sequence-index, e.g.c = tranMatProbs[sidx][aidx][i]
. See the example in our Quick start guide. -
string tranMatFromFile
: Load the transition probabilities from a comma-separated (,
) file, e.g.tranMatFromFile = "transitions.csv"
. This can be useful for several reasons: (1) If you want to generate the transition probabilities in a different software, (2) if you want to store your transition probabilities, and (3) if your MDP model is large. We assume the file follows the elementwise format (seetranMatElementwise
), and contains a header on the first line, e.g.from_state,action,to_state,probability
.
Soon available.
Soon available.
Derive an
void solve(algorithm="mpi",
tolerance=1e-3,
update="standard",
criterion="discounted",
parIterLim=100,
SORrelaxation=1.0,
initPolicy=list(),
initValueVector=list(),
verbose=False,
postProcessing=True,
makeFinalCheck=True,
parallel=True)
-
string algorithm
: The optimization algorithm. Choose between"mpi"
(modified policy iteration),"pi"
(policy iteration), and"vi"
(value iteration). -
float tolerance
: The tolerance (often denoted$\epsilon$ ) employed in the optimization of the policy. Must be chosen such that$0 < \epsilon < 1$ . -
string update
: The value-update method. Choose between"standard"
,"gs"
(Gauss-Seidel), and"sor"
(successive over-relaxation). Choosing"standard"
employs the span norm optimality criterion, whereas"gs"
and"sor"
employ the supremum norm optimality criterion. -
string criterion
: The optimality criterion. Choose between"discounted"
and"average"
reward. -
int parIterLim
: The partial evaluation limit employed in the modified policy iteration algorithm. -
float SORrelaxation
: The relaxation parameter (often denoted$\omega$ ) employed in the successive over-relaxation value-update method. Should be chosen such that$0 < \omega < 2$ . -
list initPolicy
: A 1D-list defining the initial policy. Each element contains the action index (int
) of the state, i.e.aidx = initPolicy[sidx]
. This option can be used for "warm starting" the optimization procedure. -
list initValueVector
: A 1D-list defining the initial value vector. Each element contains the value (float
) of the state, i.e.val = initValueVector[sidx]
. This option can be used for "warm starting" the optimization procedure. -
bool verbose
: Turn on/off verbose. -
bool postProcessing
: Turn on/off a post processing of the value vector. If the span norm was employed, then to obtain correct estimates of the values, a post processing of the value vector is required. This process does not affect the policy and can, therefore, be turned off if the user is only interested in the resulting policy. -
bool makeFinalCheck
: Turn on/off a final check of the value vector. This process checks if the resulting values are reasonable. -
bool parallel
: Turn on/off parallel computing. This option is only available for the custom MDP model with"standard"
updates.
Return an action from the optimized policy.
int getAction(stateIndex=0)
-
int stateIndex
: State index of the requested action.
Return a value from the optimized policy.
float getValue(stateIndex=0)
-
int stateIndex
: State index of the requested value.
Return the entire optimized policy.
list getPolicy()
Return the entire value vector.
list getValueVector()
Print the entire policy in the terminal.
void printPolicy()
Print the entire value vector in the terminal.
void printValueVector()
Save the optimized policy or value vector in a comma-separated file.
void saveToFile(fileName="result.csv",type="policy")
-
string fileName
: The name of the file. -
string type
: Use"policy"
or"p"
to save the policy and"value"
or"v"
to save the value vector.
Return the runtime in milliseconds.
float getRuntime()