Autonomous blimp control with reinforcement learning

University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2009 Autonomous blimp control with reinforcement learning Yiwei Liu University of Wollongong Recommended Citation Liu, Yiwei, Autonomous blimp control with reinforcement learning, Master of Engineering thesis, School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, 2009. http://ro.uow.edu.au/theses/3116 Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: research-pubs@uow.edu.au

Autonomous Blimp Control with Reinforcement Learning A thesis submitted in fulfilment of the requirements for the award of the degree Master of Engineering by Research from University of Wollongong by Yiwei Liu School of Electrical, Computer and Telecommunications Engineering August 2009

Dedicated to my parents...

Acknowledgements It is a great pleasure to be able to show my faithful thanks to many people whom I am indebted to for the their support during the progression of my thesis. First and foremost, I wish to express my utmost gratitude to my principal supervisor, Dr. Zengxi Pan of the University of Wollongong (UoW), for guiding me to pursue postgraduate studies at the University of Wollongong and the support given throughout the study period in many ways. Your dedication, patience, knowledge and experience could not have been surpassed. I admire your guidance towards me academically and personally over last few years. I would like to thank my co-supervisor, Senior Lecturer David Stirling of the UoW, for his insightful technical contributions and helpful attitude. I would also like to offer many appreciations to my co-supervisor Professor and Head of School Fazel Naghdy of the UoW for his academic guidance of my research work and the support on my usual university affairs. Very special thanks go to my friend Matthew who also worked and studied in the center for Intelligent mechatronic research, for being generously supportive especially during hard times along the research way. My heartiest gratitude goes to my parents Yulin Liu and Xiulan Wei for all encouragements, guidance and sacrifices made on behalf of me to come this far. Finally, my thanks go to the rest of my colleague and friends Kai, Prabodha, and Nishad for being supportive in many ways. iii

Certification I, Yiwei Liu, declare that this thesis, submitted in fulfilment of the requirements for the award of Master of Engineering,in the School of Electrical, Computer and Telecommunications Engineering,University of Wollongong, is entirely my own work unless otherwise referenced or acknowledged.this manuscript has not been submitted for qualifications at any other academic institute. Yiwei Liu Date: 27 August 2009 iv

Abstract Blimps are a special type of airship without rigid structure on the body. Most existing blimps are manually operated by a pilot directly or through radio control. One of the most famous examples is the Goodyear Blimp used for commercial advertising. With the fast development of microcontroller and electronic technologies, autonomous blimps have recently attracted great research interest as a platform to access dangerous or difficulty-to-access environment in applications such as disaster exploration and rescue, security surveillance in public events and climate monitoring, etc.. This thesis investigates the problem of learning an optimal control policy for blimp autonomous navigation in a rescue task, and presents a new approach for navigation control of an autonomous blimp using an intelligent reinforcement learning algorithm. Compared to the traditional model based control methods, this control strategy does not require a dynamic model of the blimp, which provides significant advantage in many practical situations where the blimp system model is either hard to acquire or too complicated to apply. The blimp in this research is used as a prototype for the UAV Outback Challenge organized by Australian Research Centre for Aerospace Automation (ARCAA). The Challenge requires the UAV to fly autonomously to a designated area and rescue the dummy, named Jack. The objective of this research is to develop a control system, which could autonomously adjust the blimp heading direction to the rescue target. As the blimp is required to obtain a range of pilot skills through the learning and reinforcement mechanism during actual navigation trials it can automatically account for the environmental changes during the navigation tasks. v

vi The basic hardware structure and devices of the blimp control system were preliminarily developed. The developed controller does not require a dynamic model of the blimp, but however, is adaptive to the changes of the surrounding environment. The simulation data generated from a Webots Robotics Simulator (WRS) demonstrate satisfactory results for planar steering motion control. The Matlab was used to analyse the simulation data produced by WRS. Within the simulation environment, the blimp used the Q-learning method was successfully tested in the single target and continuous target tasks subjected to various environmental disturbance. The different learning parameters and initial conditions are also tested to acquire better solutions of blimp autonomous steering motions. Reinforcement learning within blimp control in this research is shown to be a promising and effective solution for autonomous navigation tasks.

Table of Contents 1 Introduction 1 1.1 Background................................ 2 1.2 Thesis objective.............................. 4 1.2.1 Developing an intelligent navigation control system for autonomous blimp................................ 4 1.2.2 Examination of intelligent control algorithm with machine learning.................................. 5 1.2.3 Discussion of parameters for reinforcement learning....... 5 1.2.4 Summary of the contributions................... 6 1.3 Thesis outline............................... 6 2 Literature Review 8 2.1 Blimp................................... 8 2.1.1 History of blimp......................... 10 2.1.2 Research and applications of autonomous blimps........ 11 2.2 Autonomous control of a blimp..................... 17 2.3 Reinforcement learning.......................... 19 2.4 Remaining research difficulties in blimp control............ 22 2.5 Research direction............................ 23 2.6 Conclusion................................. 25 3 Hardware Design and Simulation Environment 26 3.1 System structure............................. 26 3.2 Hardware design............................. 29 3.2.1 Ground unit............................ 31 3.2.2 On board unit.......................... 32 3.3 Functions of microcontroller and host computer in learning system.. 39 3.4 Simulation software........................... 43 3.5 Chapter summary............................. 45 4 Methodology 47 4.1 Reinforcement learning.......................... 47 4.1.1 Elements of reinforcement learning............... 49 4.1.2 Dynamic programming methods................. 51 4.1.3 Monte Carlo Methods...................... 53 4.1.4 Temporal-Difference methods.................. 54 4.2 Q-learning: off-policy TD control algorithm.............. 56 4.3 Autonomous blimp control using reinforcement learning........ 57 4.3.1 Task analysing.......................... 57 4.3.2 Navigation coordinator conversion................ 58 4.3.3 Structure and definition of reinforcement learning algorithm. 59 vii

4.3.4 Basic elements and equations of the Q-learning........ 61 4.3.5 Programming flow chart of Q-learning in Webots....... 64 4.4 Chapter summary............................. 65 5 Simulation Results and Preliminary Discussion 72 5.1 Webots simulation results........................ 72 5.1.1 Webots simulation environment................. 72 5.1.2 Webots simulation setup..................... 75 5.1.3 Simulation of navigation tasks.................. 76 5.2 Analysis of simulation results in Matlab................ 79 5.2.1 Basic results............................ 79 5.2.2 Effect of initial target position in Q-learning.......... 90 5.2.3 Exploration of the Q table................... 96 5.2.4 Turning performance with disturbance............. 99 5.2.5 Continuous learning tasks..................... 105 5.3 Chapter summary............................. 108 6 Further Discussion 112 6.1 Exploration vs. exploitation in the Q-learning............. 112 6.2 Effect of different parameters in Q-learning............... 117 6.3 Control policy of PID control and Q-learning.............. 118 6.4 Effect of Different target directions on the Q-learning performance.. 123 6.5 The learning contribution of previous experience and the immediate feed back.................................. 128 6.6 Chapter summary............................. 130 7 Summary and Future Work 132 7.1 Summary................................. 132 7.2 Achievements............................... 135 7.3 Future work................................ 136 viii

List of Figures 2.1 Good year blimp [1]............................ 9 2.2 The navigable balloon created by Giffard in 1852 [2].......... 11 2.3 First Zeppelin flight at Lake Constance [2]................ 12 2.4 Blimp control system developed by Silveira [3].............. 13 2.5 Unmanned blimp control system built by Kawai [4]........... 14 2.6 Structure of blimp flying display system [5]............... 14 2.7 Role of the airship in USAR system [6].................. 15 2.8 Role of the airship in USAR system [7].................. 16 2.9 Overview of the learning system on the blimp control.......... 24 3.1 Machine learning system overview of autonomous blimp........ 28 3.2 Blimp body for UOW evaluation prototype............... 29 3.3 Hardware component of blimp control system.............. 30 3.4 System structure of onboard unit..................... 33 3.5 Blimp measurement sensors........................ 36 3.6 Thrusters position of prototype blimp and servo and DC motors mounted on the blimp gondola........................... 38 3.7 Onboard control electronic circuit..................... 39 3.8 Standard system structure of the machine learning........... 40 3.9 Reinforcement learning system structure in this thesis......... 41 4.1 Procedural form of Q-learning....................... 57 4.2 The word framework and the blimp body coordinates.......... 59 4.3 Flow charts of generating the blimp angular speed in simulation.... 67 4.4 Flow charts of getting best actions.................... 68 4.5 Flow charts of getting environment.................... 69 4.6 Flow charts of states judgment...................... 70 4.7 Flow charts of Webots running procedure................ 71 5.1 Webots simulator environment...................... 73 5.2 Outputs of simulation data from Webots................. 74 5.3 Time counter of Webots.......................... 75 5.4 Blimp body coordinate in the virtual environment........... 76 5.5 Definition of world coordinate in the virtual world........... 77 5.6 Blimp navigation task........................... 78 5.7 Initial setting of a four quadrant target trials.............. 79 5.8 Blimp navigation paths.......................... 80 5.9 Angular difference and orientation results obtained from Matlab.... 82 5.10 Above, plan view of the blimp body referencing frame......... 83 5.11 Referencing coordinates of the virtual word............... 84 5.12 Q-learning results processed by Matlab.................. 86 5.13 Results of actions sequence........................ 88 ix

5.14 Simulation results of states change.................... 89 5.15 3D track of blimp movement in Matlab.................. 90 5.16 Turning performances in the short time learning process........ 91 5.17 Blimp action changes in the short time learning process........ 93 5.18 Turning performances in the long time learning process......... 95 5.19 Blimp state changes in long time learning process............ 96 5.20 Q-value surface plot for restricted Q-learning.............. 97 5.21 Extended, more sufficient Q-learning................... 98 5.22 Results of Q-value tables with differing amounts of learning iterations. 100 5.23 Control performances in orientation and angular difference (16, 16).. 101 5.24 Control performances as manifested in sequence of actions (16, 16).. 102 5.25 Control performances in sequence of states................ 103 5.26 Control performances in orientation and angular difference (-16, 16).. 104 5.27 Control performances in sequence of states................ 105 5.28 Other tests results of blimp control under disturbance......... 106 5.29 Angular difference and Orientation in continuous turning........ 107 5.30 Sequence of actions in continuous turning................ 109 5.31 Sequence of states in continuous turning................. 110 6.1 Backup diagram of one step Q-learning.................. 113 6.2 Different learning procedures....................... 116 6.3 The first control strategy of blimp navigation tasks........... 121 6.4 Webots programme flowchart of first strategy.............. 122 6.5 The Second control strategy of blimp navigation tasks......... 123 6.6 Quadrants of body frame......................... 124 6.7 Simulation results of angular difference and states in first scenario... 125 6.8 Simulation results of angular difference and orientation in second scenario..................................... 126 6.9 Simulation results of blimp states in second scenario.......... 127 x

List of Tables 3.1 Explanation of the evaluating variables selected to process in Matlab. 45 4.1 State-judgment table............................ 61 5.1 Variables to be evaluated in Matlab................... 81 xi