SKINNER OPERANT CONDITIONING MODEL AND ROBOT BIONIC SELF-LEARNING CONTROL

Original scientific paper A Fuzzy Skinner Operant Conditioning Automaton (FSOCA) is constructed based on Operant Conditioning Mechanism with Fuzzy Set theory. The main character of FSOCA automaton is: the fuzzed results of state by Gaussian function are used as fuzzy state sets; the fuzzy mapping rules of fuzzy-conditioning-operation replace the stochastic "conditioning-operant" mapping sets. So the FSOCA automaton can be used to describe, simulate and design various self-organization actions of a fuzzy uncertain system. The FSOCA automaton firstly adopts online clustering algorithm to divide the input space and uses the excitation intensity of mapping rule to decide whether a new mapping rule needs to be generated in order to ensure that the number of mapping rules is economical. The designed FSOCA automaton is applied to motion balanced control of two-wheeled robot. With the learning proceeding, the selected probability of the optimal consequent fuzzy operant will gradually increase, the fuzzy operant action entropy will gradually decrease and the fuzzy mapping rules will automatically be generated and deleted. After about seventeen rounds of training, the selected probabilities of fuzzy consequent optimal operant gradually tend to one, the fuzzy operant action entropy gradually tends to minimum and the number of fuzzy mapping rules is optimum. So the robot gradually learns the motion balance skill.


Introduction
The combination of the disciplines of Psychology of Learning, Biology and Machine Learning leads to the development of Bionic Self-learning theory and practice.The main research objective of intelligent control and artificial intelligence has been enabling robots to obtain the bionic self-learning ability and gradually acquire new knowledge in the process of operation and gaining similar skills of motion control possessed by animals and human beings.A great number of reports and literature has been dedicated to robot control, an area in bionic self-learning, which is under great interest in the study of neural networks [1÷7].Although the study on robots based on artificial neural networks has connected robot motion control with neural physiology and cognitive science, the connection is still weak and the motion control skills of robots still rely on descriptive control rules, which involve excessive elements of design and less bionic selflearning and organizing skills in a biological system.This has impeded the development of bionic self-learning.The theory of Operant Conditioning, as an important guide to the study on the learning mechanism in human and animal neural networks, has brought the research on bionic selflearning to a new stage [8].
Since the mid-1990s, Carnegie Mellon University in the US has been focusing on the computing theory and model of Skinner Operant Conditioning and applied this model on the autonomous robots [9].Under the influence of their study, several relevant research areas to Skinner Operant Conditioning including ALC (Autonomous Learning Control) have received wide study interest.
Professor Zalama of the Department of Automatic Control and Systems Engineering in the University of Valladolid, has conducted many in-depth researches on the learning and control behaviours of robots and with his team, developed a computing method for obstacle avoidance Operate Conditioning based on the theory of Operant Conditioning.Through enabling the robot to move at different angular velocities in an environment of disordered obstacles, and activating nodes in the angular velocity mapping, a system of weights was developed in this model.The robot gradually obtained the skill of obstacle avoidance in a surveillance-free environment by reinforcing negative signals generated by collision.In 2002, a neural networks response to stimulation was developed in response to the navigation problem to correct navigation errors and realize learning reinforcement in Operant Conditioning [10].In 2006, they designed learning and computing model of first and second order conditioning for a robot named Arisco with audio-visual sensation.This model was created using competition artificial neural networks and enabled Arisco to have some self-organizing functions [11].In 1997, Gaudiano and Chang of the laboratory of Neurobotics in Boston University in the US conducted a similar research to build a neural computing model on the combination of Pavlov theory and Skinner Operant Conditioning theory in response to a navigation problem in a wheeled robot named Khepera.Khepera can learn obstacle avoidance through navigation without any empirical knowledge and instructing signals [12].In 2005, a research team on robots in Mechanical Engineering, Waseda University reported their study results.Itoh et al. believe that robots in the future should be more humanized, expressive, emotional and individualistic.Therefore, they designed a new behaviour model for humanized robots, based on Skinner Operant Conditioning theory, and realized the model in WE-4RII robots.The experiment showed that based on OC model, WE-4RII could select a proper behaviour model [13] in accordance with a particular setting autonomously within given behaviour list and learned an interaction skill---shaking hands with human.
A problem laying in the above researches is the following: how should the Skinner OC on machines and robots be realized?Among them, a majority of solutions relies on descriptive language while some adopt conventional artificial neural networks.However, purely descriptive language is not formalized and therefore does not have the ability of generalization; Conventional artificial neural networks cannot reflect the real structure and function of a biological neural system.In response to this question, Professor Ruan Xiao-gang has conducted an in-depth research since 2009, and has been working on building a OC computing model [14] with probabilistic automaton and put forward the concept of Skinner Operant Conditioning Automata (SOCA) though simulating Skinner's pigeons experiment and applying it to the self-learning of two-wheeled self-balancing robots.This method showed good self-learning abilities [15÷17] by enabling the robots to master self-balancing through learning.In 2010, the research team used cerebellar model to build an OC computing method based on the OC automaton, and conducted a bionic experiment on twowheeled self-balancing robots [18÷20].
The OC automaton that has been built has quick convergence speed but its accuracy of learning is relatively low, which limits the application of OC automata.There are two major reasons leading to the poor learning performance of OC automata: 1) The output of OC automata is a limited and discrete behaviour set.The operant behaviour of the automata is not continuous, resulting in failure in smooth control output and oscillations in output; in addition, in terms of OC selflearning model, the self-learning and adaptation abilities of the learning model are constrained and subject to failure by the limited number of operant behaviours available when the control effects are poor and the change of outside conditions resulting in new behaviour models whose optimal operant behaviours are not in the behaviours set.Therefore, due to the discrete output and limited number of operant behaviour, the OC automata cannot ensure that its amount of control learning is optimal in a nonlinear, time-varying and continuous system.The accuracy of learning and self-adaptation cannot be guaranteed.2) The number of inward mappings in OC automata is fixed.Among them, there are redundant mapping rules, which reduce the speed of learning.In fact, the human control behaviour is generated by revising a small number of rules to create complex control behaviour instead of large set of rules.Therefore, it is necessary to economize the number of mapping rules to improve the learning performance and self-adaptation abilities in OC learning model.
Increasing the number of operant behaviours can alleviate the problem of lack of output control smoothness but will reduce the learning speed.The solution to this question, other bionic mechanisms such as reinforcement learning, mostly is the adoption of the neural-fuzzy networks [21,22], which have low convergence speed and instantaneity.
Literature [23] has proposed the Q learning method, which realized the automatic increase and decrease of the number of fuzzy rules, and solved the problem of fixed mapping rules.However, this method selects consequent behaviours in the fuzzy inference system from a fixed set of behaviours, thus resulting in the lack of smoothness in the output control; Literature [24] designed a fuzzy logic system based on reinforcement learning using genetic algorithms based on Q value and online clustering method.The fuzzy logic system can adopt learning rules online and automatically generate fuzzy rules from zero.However, as it adopts genetic algorithms, it is complicated with large amount of computation.
Although the above studies on reinforcement learning cannot solve the problems in OC automata fundamentally, they indicate that fuzzy logic is an effective solution.Fuzzy logic has strong self-learning and adaptation skills, receiving wide interest among researchers.Fuzzy logic system has features including high accuracy, wide application, strong generalization ability and ease of building.It can use a limited size of fuzzy set to describe status and operant behaviour space, adapt to fuzzy descriptions and uncertain knowledge, in line with human thinking model.Therefore, fuzzy logic systems are more apt to combine with bionic learning which stresses the initiative.It is now widely used in bionic self-learning models to tackle the problems in complicated continuous systems.The advantages of adopting fuzzy logic system are summarized as follows: 1) It is capable of smooth output of continuous control; 2) Several successful solutions to the automatic increase and decrease of fuzzy rules; 3) Fuzzy inference system's fuzzification process is equivalent to the discretization process of the OC automata.Therefore, we can select suitable membership functions to avoid the discrete errors in the discretization of the OC automata; 4) OC automata's learning process is similar to the fuzzy inference process under the fuzzy control.Therefore, using fuzzy language and fuzzy inference to describe OC automata not only makes the structure clear but also clearly demonstrates the OC automata learning results in the form of a list of rules, showing the accumulation of learning experience of the self-learning model in a direct way.

Design of Fuzzy Skinner Operant Conditioning Automaton (FSOCA) 2.1 Mechanism of operant conditioning
The core content of Skinner operant conditioning theory is that by way of learning or training, animals will find their nervous tissue changed.The change results in the connection between certain percept sequence and action sequence, namely the continuous recursive process from "percept" to "action" and again to "percept".The operant conditioning mechanism is shown in Fig. 1.

Figure 1Sketch map of operant conditioning mechanism
The learning control on the basis of operant conditioning mechanism principles mainly consists of three elements: behaviour selection mechanism (choice behaviour based on probability), evaluation mechanism and orientation mechanism.As the core part of learning, the orientation mechanism is used to update behaviour selection strategies.Fig. 2 is the sketch map of learning control mechanism on the basis of operant conditioning principles.

Structure of FSOCA
The most salient feature of fuzzy control is that it expresses experts' control experience and knowledge as language control rules and then controls the system through these rules.Thus, fuzzy control theory has become a significant branch of intelligent control theory.As both the antecedent and consequent of fuzzy logic system are depicted by natural language variables, which makes it unnecessary to establish precise math models and easy to transform expert knowledge into control signals directly, it has become a significant method in robot control [25].The fuzzy conditional statement being made up of several linguistic variables, fuzzy inference reflects a certain way of thinking of humans.If fuzzy inference is viewed as the mapping relationship between state space and action space, we can establish the Fuzzy Skinner Operant Conditioning Automata based on fuzzy set theory.FSOCA uses limited fuzzy set to describe conditions and operation behaviour space.
Designed structure of FSOCA is shown in Fig. 3.In the learning model displayed in Fig. 3, the antecedent of each mapping rule corresponds to a fuzzy subset ij F of input space and the consequent is a certain operation behavior * ( ) . Therefore, in essence, the learning problem of FSOCA is to seek the optimal decision vector for each mapping rule.
The definition of FSOCA that can be formalized is as follows: Definition 1 FSOCA is a nine-tuple calculation model: FSOCA , , , , , , , , Each part is illustrated as follows: (1) Internal continuous state of FSOCA: ( 1, 2..., ) , the actual state value of detected control systems.n represents the number of internal continuous state in learning models.By employing the online clustering algorithm on ( ) x t , we can construct the antecedent of FSOCA automatically.
(2) Internal fuzzy state set of FSOCA: { 1,..., ; As the state antecedent of FSOCA, F emerges as the fuzzy subset after x fuzzification.With respect to the fuzzification or discretization of x(t), the Gaussian function is adopted: In this formula, c ij and b ij stand for the center and width of the Gaussian function respectively.j = 1,..., L is the number of clusters, namely the number of mapping rules.
Thus, we obtain the excitation intensity of the jth mapping rule, which is: . k a stands for the k th available operation behavior and r the number of available consequent operation behavior.The goal of learning is to search for the optimal consequent among the consequent operation behavior set A.
As the control signal of systems, the final output of FSOCA is expressed as: (4) The fuzzy "condition-operation" mapping rule set of FSOCA: }} which replaces the random "condition-operation set" in FSOCA.In this formula, N is the total number of mapping rules and ( ) R P is the j th mapping rule.( ) It can be seen that the fuzzy "condition-operation" operation rule set of FSOCA resembles the definition of fuzzy rule table in fuzzy inference system.The main difference between the two is that the mapping rule of the former is random and each mapping rule is connected with a certain probability while the fuzzy rule of the latter is definite.
( , it means that the consequent of mapping rules tends to select the operation behavior that makes the orientation function minimal.In other words, the learning model has already understood and adapted to the environment and acquired "learning of random mapping rules".Thus, the learning mechanism of FSOCA is as follows: If the implementation of operation behaviora When the consequents ( ) k a t of all mapping rules have equal possible probability, operation behavior entropy becomes the largest.Operation behavior entropy is used to measure the uncertainty degree of mapping rules which could further measure the amount of information acquired in learning models.In other words, the learning goal of FSOCA is to transform the uncertain consequent of mapping rules to a certain one, enabling mapping rule set to evolve from the unorganized to the organized instinctively or spontaneously under the domination of FSOCA.
(9) Internal parameter vector of FSOCA: the similarity threshold of cluster width and cluster center.These parameters are collectively referred to as internal parameters of FSOCA.The selection of these values has not only significant impacts on the learning speed and accuracy of learning models but also direct influence on the success of learning.
The basic learning process of FSOCA can be summarized as follows: suppose that the state detected by control systems is x(t) at the time of t, firstly the Gaussian function is employed to conduct the fuzzy processing on x(t) and online clustering algorithm is used to automatically construct the mapping rule antecedent of FSOCA; next, fuzzy subset ij F activates the mapping relationship j  as an activation signal and employs the learning mechanism of operant conditioning to obtain a certain operation behavior Ultimately, according to the trend of variation amount of orientation value, the learning mechanism of operant conditioning ( ) L  is employed to adjust and update the probability vector P j and reward probability of consequent operation behavior.When new state is activated, repeat this process until the optimal consequent behavior set * A is learned.Therefore, the essence of FSOCA is to achieve the optimal mapping from fuzzy antecedent state j F to fuzzy consequent behavior ( ) k a t .FSOCA is the result of the fuzzification of Skinner Operant Conditioning Automata.Comparing the two, we can find that their differences are mainly in three ways.Firstly, with regard to the discretization of continuous input state, FSOCA utilizes the fuzzification method which is mature and suitable for actual systems.Secondly, FSOCA outputs continuous smooth control variable.Thirdly, the number of FSOCA mapping rules can be deleted automatically. amounts to the transformation of original error measurement value ( ) e t which is made to range from 0 to 1.According to the orientation function expression designed by formula (5), the relationship between orientation value and orientation quality is that when orientation value approaches 0, the performance of learning models proves to be the best and the orientation reaches the maximum; when the value approaches 1, the performance of learning models is the worst and the corresponding orientation reaches the minimum; when the value lies between 0 and 1, the smaller the value, the better the performance of corresponding model.Therefore, the goal of learning is to make the performance index function approach the minimal.
Note: The orientation function designed here is mainly prepared for the control system.As far as control systems are concerned, error serves as the most direct indicator of reflecting the quality of system performance.So we design the orientation function based on the system error.Also, given that the closer the system error approaches 0, the better the system performance, we define the relationship between the orientation value and the orientation quality as the one mentioned above.

Design of learning mechanism
Learning mechanism serves to achieve the random mapping : ( ) according to mapping set  of random "condition-operation", after which we observe that ( 1) j s t  is the state at the time of t+1 and the orientation value is     of orientation function value can be used to judge the performance of the operation, the design of learning mechanism is as follows according to the Skinner operant conditioning theory: The increase part is designed as: The increase part is designed as: In the formula, In the OC learning mechanism formula, 1 0   and only play a role in influencing learning speed but also enable learning models to reflect the orientation characteristics more similar to animals.
From formula ( 6) and ( 7), we can see that the excitation probability of random mapping is mainly determined by the variation amount       and ( )   , the faster the increase speed of corresponding "good" operation behavior and the decrease speed of "bad" operation behavior; on the contrary, the smaller the variation amount of orientation value, the slower the update speed of probability.

Design of clustering algorithm
Fuzzy antecedents of FSOCA are based on online clustering algorithm.It is because data is generated during online bionic learning process, so clustering algorithm that automatically generates a certain number of mapping rules is needed; one mapping rule corresponds to one clustering in state space and excitation intensity can be used to examine the extent of how state belongs to corresponding clustering, namely, state x(t) of high excitation intensity is close to the clustering center in geometric space.Therefore, excitation intensity of mapping rules is used in this article as a standard on whether new mapping rules are generated.
Suppose t=0, and excitation state is (0 , a new mapping rule is then generated, with that the center and width of its corresponding gaussian function are: When t=1, the maximum excitation intensity is ) of which,  is the degree of overlapping between two clustering.
As the learning goes on, the number of clustering increases, so does that of mapping rules.In order to reduce the number of mapping rules and save resources for the system, clustering that is highly similar to one another, should be merged.Clustering merging is confirmed through judging old and new subordinate function of input variables, which is based on: are similarity threshold values of the given similar clustering, with j referring to the ordinal number of the clustering j and j the ordinal number of the clustering j .If the centers and width of two clusterings are close, then the above inequation is met, so the two clusterings are similar and can be merged, of which the center and width of the clustering are distributed as below: otherwise clustering cannot be merged.

The learning process
Step.1.Initialization: iterative learning steps 0 t  ; sampling time t s = 0,01 s.Orientation information of operating behavior at the beginning is unknown, so the rate of initial operating behavior is: the number of "condition-operation" mapping rules and r is the number of behavior within collection.
Step.2.Perceive state of the two-wheeled robot and fuzzily process it with gaussian function, and then mapping rules antecedences of FSOCA can be automatically generated through online clustering algorithm.
Step.3.According to probability vector P j of random mapping Γ j , output one operating behavior a k (t) which is randomly chosen from the alternative operating behavior collection A.
Step.4.Receive and analyze response of the twowheeled robot system to ak(t) and get increment of orientation value ( 1)  t  ( 1)   k a t   according to updated probability vector P j (t+1) and repeat "Step.2÷ Step.5" until the optimal fuzzy consequent collection * A is achieved.

Result of simulation experiment and its analysis
Build the "exact model" for simulation in the environment of Simulink.As Fig. 4 shows, uL and uR are motor voltage of the left and right wheels of the robot; Ljs, Lj, Rjs, Rjs, Pjs and Pj are angular velocity of the left wheel, the left corner, angular velocity of the right wheel, the right corner, angle velocity of the robot and angle of inclination of the robot.The result of the first four variables multiplied by the wheel radius R is: forward speed of the left wheel, displacement of the left wheel, forward speed of the right wheel, and displacement of the right wheel.No-linear model of the robot is compiled by S-Function.There is a switch respectively connected to attitude balance sub-controller u1, sentinel balance subcontroller u2, walking motion sub-controller u3L, u3R, and compensation controller u4, through which different exercise modes of the two-wheeled robot are switched, of which free self-balance control module is controlled by FSOCA, while others by PID.

Free self-balance exercise experiment
Fig. 5 shows how the two-wheeled robot achieves free self-balance control.State collection {( , ) 1..., ; 1,..., } of inclination and angular velocity after fuzzy processing are used as conditions activation signal of FSOCA; under the free self-balance control mode, U that is gotten after clarification of optimal fuzzy operating behavior * a , is used as the voltage control signal of the two wheels: Fig. 6 shows the curve of operating behavior entropy corresponding to stable state (0,0) in 30 times' training.It shows as learning goes on, operating behavior entropy begins to reduce, and after 17 times' training, this value keeps stable and achieves its minimum.On the one hand, the changing condition of operating behavior entropy examines convergence of clustering in this article; on the other hand, at initial state, random mapping control rules of FSOCA are in disorder, but after self-learning, random mapping control rules are in order and can be selforganized and form a positive and orderly rules collection.Compared with SOCA, the change of FSOCA is more smooth, and operating behavior entropy continues to reduce.The reason is that in the process of learning of FSOCA, even though the robot chooses a "bad" operation in the later learning, output can be relatively stable, because control signal of the robot is a weighted sum of fuzzy operating behavior and weight of "bad" behavior is low.That is also the reason why the failure rate of FSOCA is low.In order to examine anti-jamming capability of FSOCA, the robot is given an impulse interference of 10 in the tenth second.When comparing FSOCA and SOCA, it shows that output of the former is more smooth.That is because FSOCA is fuzzily processed, so the range of output voltage is between [−24, 24], which means continuous voltage can reduce strong jitter of the system.Further comparison shows that in initial learning stage of the former, the balance state can be recovered in one second and overshoot is smaller; in the later stage, learning error is close to 0. Therefore, FSOCA has bigger convergence rate and higher learning accuracy.After interfered, FSOCA can recover to the balance state in 0.5 second after a short jitter.Therefore, compared to SOCA, anti-jamming capability of FSOCA is stronger.

Point balance exercise experiment
On the basis of the above learning result, point balance exercise can be achieved by superposition of point balance control modules.Fig. 9 shows the simulation curve of inclination, angular velocity, displacement, forward velocity and motor control voltage of the robot.
Fig. 9 shows that simulation result of point balance control is similar to that of free self-balance control.The robot with FSOCA can recover balance in 1,2 s and stop at target location x = 0 m; compared with SOCA, FSOCA has more smooth curves, bigger convergence rate and higher control accuracy.

Straight move balance exercise experiment
On the basis of the above learning results, move balance exercise can be achieved by superposition of move exercise balance control modules.Suppose desired speed of the left and right wheel is respectively v l = v r = 0,15 m/s, Fig. 10 shows the simulation curve of inclination, angular velocity, displacement, forward velocity and motor control voltage of the robot.Fig. 10 shows the robot with FSOCA begins to move uniformly after 1 second at the speed of v l = v r = 0,15 m/s; inclination does not recover to 0 but keeps at a small angle range θ; motor voltage also keeps at a constant value to ensure the robot can move uniformly.Compared to simulation result of SOCA, output of FSOCA is more smooth and learning speed and accuracy are much higher.Compared to the previous exercise modes, improvement of learning accuracy becomes more obvious.

Steering move balance exercise experiment
On the basis of straight move balance exercise control, steering move balance exercise can be achieved by setting desired speed of the two wheels respectively at v dl = 0,3 m/s, v dr = 0,15 m/s.Fig. 11 shows the simulation curve of inclination, angular velocity, displacement, forward velocity and motor control voltage of the robot.Fig. 11 shows steering move is similar to straight move, except that the track is a circle.Compared to SOCA, the curve is more smooth and learning speed and accuracy is higher; the track in less than 1 s becomes a desired circle with radius of 0,3 m.

Conclusion
Combing the fuzzy set theory, this paper establishes the FSOCA, the main characteristic of which is that it can be used to depict, simulate and design various selforganizing behaviors of fuzzy and uncertain systems.By integrating fuzzy inference, FSOCA enables learning models to output continuous operation behavior and achieves smooth control.Online clustering method realizes the automatic deletion of fuzzy mapping rules and ensures that the number of fuzzy mapping rules is the most economical.The simulation result in the balance control of two-wheeled robots indicates that as learning proceeds, the selection probability of optimal fuzzy consequent operation behavior gradually approaches 1, entropy of fuzzy operation behavior tends to be minimal, the number of mapping rules is close to the optimal and relative to SOCA, learning performance is significantly improved.Through imposing pulse interference on robots, FSOCA's ability of anti-interference and fast recovery is verified.

Figure 2
Figure 2 Learning mechanism on the basis of operant conditioning principles in controlling the random degree in the process of competition and selection by consequent behavior.The mapping rules of FSOCA with competing consequents are as follows: width value and center value of the Gaussian function, *  the excitation intensity threshold of mapping rules,  the degree of overlap between clusters and min min , b c  

Figure 4
Figure 4 Simulation model of the two-wheeled self-balance robot

Figure 5
Figure 5 Structure of Free Self-balance Control Based On FSOCA (1) Setting of Simulation Parameter Iterative learning steps t = 0; sampling time t s = 0,01 s; when the robot is in off-line learning, parameter in orientation function is ζ = 0,6, γ = 0,03 and learning coefficient in the updated rate formula is η 1 = 0,01, η 2 = 0,001; when the robot is in online learning, ζ = 0,5, γ = 0,01, η 1 = 0,05, η 2 = 0,005; when the robot is learning, the mapping field can contract to correspond to lower bound value ε = 0,0005 of learning error and excitation intensity threshold value φ * = 0,0006 of mapping rules.Parameter setting involved in the clustering algorithm is as follows: width of gaussian function is * 5 b  ; excitation intensity threshold value is φ * = 0,0006; degree of overlapping between clustering is ; 4 , 0   respective similarity threshold value of width and the center of clustering are Δb min = 0,02 and Δc min = 0,02.The initial state of the robot is θ = 0,2 rad, otherwise the value is 0; all the initial operating behavior collection is A={−24, −5, −1, 0, 1, 5, 24}, of which the initial rate of every behavior is 1 (0) 7 ik p  , with its corresponding initial operating

Figure 6
Figure 6 Curve of Information Entropy Fig. 7 shows the number of fuzzy mapping rules in 30 times' training.

Figure 7
Figure 7 Number of mapping rules generated in every trainingThe result shows: in all the training, the average number of fuzzy mapping rules in the previous training is 20; while the number becomes 13 after 18 times' training;

8
(a) Simulation Curve of Inclination (b) Simulation Curve of Angular Velocity (c) Simulation Curve of Displacement (d) Simulation Curve of Forward Velocity (e) Simulation Curve of Motor Control Voltage Figure Simulation Result of Free self-balance Control

9
Simulation Curve of Inclination (b) Simulation Curve of Angular Velocity (c) Simulation Curve of Displacement (d) Simulation Curve of Forward Velocity (e) Simulation Curve of Motor Control Voltage Figure Simulation Result of Point Balance Control (a) Simulation Curve of Inclination (b) Curve of Angular Velocity (c) Simulation Curve of Displacement (d) Simulation Curve of Forward Velocity (e) Simulation Curve of Motor Control Voltage Figure 10 Simulation Result of Straight Moving Balance Control 23, 1(2016), 65-75 (a) Simulation Curve of Inclination (b) Simulation Curve of Angular Velocity (c) Simulation Curve of Displacement (d) Simulation Curve of Forward Velocity (e) Simulation Curve of Motor Control Voltage (f) Track in The x-y Plane Figure 11 Simulation Result of Steering Move Balance Control ( )

3 Design of learning algorithm 3.1 Design of orientation function
* e x x   .At the time of t and under the premise of discrete state ( ) i s t , if the selection of operation behavior k a causes the state transfer ( 1) j s t  and reduces error, namely e(t+1)-e(t)<0, it shows that the