### Table 2. Learnt policies with Method 4.2.1 using optimal and suboptimal demonstrations.

2007

"... In PAGE 8: ...ection 4.2. To evaluate the performance of the method, we explicitly observed the learnt policy when the demonstrated policy is optimal and when it is not. The results are summarized in Table2 . We de- noted by 0 the empty slot, by B the large ball, by c the cube and by b the small ball.... ..."

Cited by 2

### Table 2. Learnt policies with Method 4.2.1 using optimal and suboptimal demonstrations.

in A

"... In PAGE 8: ...ection 4.2. To evaluate the performance of the method, we explicitly observed the learnt policy when the demonstrated policy is optimal and when it is not. The results are summarized in Table2 . We de- noted by 0 the empty slot, by B the large ball, by c the cube and by b the small ball.... ..."

### Table 3. Learnt policies with Method 4.2.2 using optimal and suboptimal demonstrations.

2007

Cited by 2

### Table 3. Learnt policies with Method 4.2.2 using optimal and suboptimal demonstrations.

"... In PAGE 59: ... Each episode lasted five seconds, after which the applicability levels were reset to zero. Table3 shows the final applicability levels for each inverse model at the end of the each episode. Figure 9.... ..."

### Table 4. Average trial lengths and standard deviations for final learnt policies from Experiment 2. P-HSMQ TRQ

2002

Cited by 8

### Table 1: Considering learning processes of varying length (1000 to 40000 scheduling episodes), this table opposes av- erage, best/worst case tardinesses, and learning result fluctua- tions of the learnt policy for Q learning and MQI and for differ- ent environments, i.e. different heuristically deciding agents.

2006

"... In PAGE 6: ... Moreover, in both scenarios sketched the poli- cies learned by MQI are better than the ones found by Q learning, re- gardless of the size of the four- tuples set F. On Sb, standard Q learning achieves better schedul- ing performance only in the EDD+ MS+[Q|MQI] scenario ( Table1 ), but only during the first few hundred training episodes. Yet, this good result would have been only detected by exhaustive policy screening.... ..."

Cited by 1

### Table 3: One part of the learnt policy. The six middle entries of the rows are the feature values corresponding to those features listed and numbered in Table 2. (The values of Feature 7 are not shown as these are always zero here). The column marked by the label `BId apos; denotes the number of the module that was chosen by a pure exploitation policy under the con- ditions described by the respective feature values. For example, in State 1 the robot used the behaviour \examine object quot; (Controller 3) with the pure exploitation strategy.

"... In PAGE 25: ...espectively, with nearly equal std-s of 34.78 and 34.82, respectively. One part of the learned policy is shown in Table3 , where 10 states were selected from the 25 explored ones together with their learnt associated behaviours. Theoretically, the total number of states is 27 = 128, but as learning concentrates on feature-con gurations that really occur this num-... ..."

### Table 1: Interception Using a Grid- and Memory-Based Function Approximator: The table summarizes the average number of steps to intercept the ball for a set of 1000 random start situations (noise-free environment). Columns reflect the quality of learnt policies after different numbers of learn- ing episodes experienced. So, the rightmost corresponds to approximately ten million state value backups.

2006

"... In PAGE 4: ... Choosing different levels of discretization we reduced S to 5k, 100k and 600k abstract states distributed nearly equidistantly over S and applied our TD(1) learning algorithm directly. Compared to the reference method MB the results obtained were far from optimal ( Table1 , lines 1-3). This is not surprising when considering the complexity of the underlying problem (cf.... ..."

Cited by 6

### Table 1: Success rate (%) for different policy types.

2004

"... In PAGE 11: ... Table1 shows the performance of the baseline policy and three types of learnt policy as a percentage (likelihood of detecting evader), for different number of pursuers. The error bounds given are at one standard deviation: the standard errors (n = 500) are about 20 times smaller so a difference of 2% is significant, using Gaussian statistics.... ..."

Cited by 1