FEATURES. Features. UCI Machine Learning Repository. Admin 9/23/13

Similar documents
Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Introduction to Pattern Recognition

5.1 Introduction. Learning Objectives

STANDARD SCORES AND THE NORMAL DISTRIBUTION

The Ideal Gas Constant

knn & Naïve Bayes Hongning Wang

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

Visual Traffic Jam Analysis Based on Trajectory Data

IDENTIFICATION OF WIND SEA AND SWELL EVENTS AND SWELL EVENTS PARAMETERIZATION OFF WEST AFRICA. K. Agbéko KPOGO-NUWOKLO

Warm-up. Make a bar graph to display these data. What additional information do you need to make a pie chart?

INTRODUCTION TO PATTERN RECOGNITION

Decision Trees. an Introduction

Introduction to Pattern Recognition

Introduction to Waves & Sound

Open Research Online The Open University s repository of research publications and other research outputs

hot hands in hockey are they real? does it matter? Namita Nandakumar Wharton School, University of

Descriptive Statistics. Dr. Tom Pierce Department of Psychology Radford University

Essen,al Probability Concepts

Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

Chapter 12 Practice Test

3D Inversion in GM-SYS 3D Modelling

Running head: DATA ANALYSIS AND INTERPRETATION 1

NCSS Statistical Software

POKEMON HACKS. Jodie Ashford Josh Baggott Chloe Barnes Jordan bird

The MACC Handicap System

Performance of Fully Automated 3D Cracking Survey with Pixel Accuracy based on Deep Learning

BIOL 101L: Principles of Biology Laboratory

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

EXPERIMENTAL RESULTS OF GUIDED WAVE TRAVEL TIME TOMOGRAPHY

Opleiding Informatica

SOLUTIONS Homework #1

Laboratory Activity Measurement and Density. Average deviation = Sum of absolute values of all deviations Number of trials

Stat 139 Homework 3 Solutions, Spring 2015

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

Analysis of Professional Cycling Results as a Predictor for Future Success

INSTRUMENT INSTRUMENTAL ERROR (of full scale) INSTRUMENTAL RESOLUTION. Tutorial simulation. Tutorial simulation

Chapter 6. Analysis of the framework with FARS Dataset

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

Gait Recognition. Yu Liu and Abhishek Verma CONTENTS 16.1 DATASETS Datasets Conclusion 342 References 343

Homework Exercises Problem Set 1 (chapter 2)

CS 528 Mobile and Ubiquitous Computing Lecture 7a: Applications of Activity Recognition + Machine Learning for Ubiquitous Computing.

Standing Waves in a String

Technical Report. 5th Round Robin Test for Multi-Capillary Ventilation Calibration Standards (2016/2017)

Waves & Interference

PIQCS HACCP Minimum Certification Standards

b

Bayesian Optimized Random Forest for Movement Classification with Smartphones

Object Recognition. Selim Aksoy. Bilkent University

Predicting Horse Racing Results with Machine Learning

Lab 1. Adiabatic and reversible compression of a gas

Section I: Multiple Choice Select the best answer for each problem.

CC-Log: Drastically Reducing Storage Requirements for Robots Using Classification and Compression

A Proof-Producing CSP Solver 1

Introduction. AI and Searching. Simple Example. Simple Example. Now a Bit Harder. From Hammersmith to King s Cross

Feature Extraction for Musical Genre Classification MUS15

STT 315 Section /19/2014

Descriptive Stats. Review

b

Frequency Distributions

The Evolution of Transport Planning

G53CLP Constraint Logic Programming

Autodesk Moldflow Communicator Process settings

Appendix A Bridge Deck Section Images

Advanced PMA Capabilities for MCM

PREDICTING the outcomes of sporting events

ACC Grassroots Website - Guide for the Member / Player

intended velocity ( u k arm movements

Interpretable Discovery in Large Image Data Sets

Understanding Winter Road Conditions in Yellowstone National Park Using Cumulative Sum Control Charts

An Introduction to Statistical Process Control Charts (SPC) Steve Harrison

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

MECHANICAL WAVES AND SOUND

Hydrostatics Physics Lab XI

Bayesian Methods: Naïve Bayes

E STIMATING KILN SCHEDULES FOR TROPICAL AND TEMPERATE HARDWOODS USING SPECIFIC GRAVITY

Energy capture performance

Puyallup Tribe of Indians Shellfish Department

Estimating the Probability of Winning an NFL Game Using Random Forests

Satoshi Yoshida and Takuya Kida Graduate School of Information Science and Technology, Hokkaido University

THe rip currents are very fast moving narrow channels,

Shedding Light on Motion Episode 4: Graphing Motion

MJA Rev 10/17/2011 1:53:00 PM

Looking at Spacings to Assess Streakiness

Black Sea Bass Encounter

MATH 118 Chapter 5 Sample Exam By: Maan Omran

Exercise 1: Control Functions

σ = force / surface area force act upon In the image above, the surface area would be (Face height) * (Face width).

Evaluating and Classifying NBA Free Agents

COMPARISON OF VIBRATION AND PRESSURE SIGNALS FOR FAULT DETECTION ON WATER HYDRAULIC PROPORTIONAL VALVE

Introduction To Diffraction Teachers Supplement

Torrild - WindSIM Case study

Machine Learning Application in Aviation Safety

Logistic Regression. Hongning Wang

DATA SCIENCE SUMMER UNI VIENNA

Wednesday, September 20, 2017 Reminders. Week 3 Review is now available on D2L (through Friday) Exam 1, Monday, September 25, Chapters 1-4

Analyses of the Scoring of Writing Essays For the Pennsylvania System of Student Assessment

Unit 6, Lesson 1: Organizing Data

Wave Motion. interference destructive interferecne constructive interference in phase. out of phase standing wave antinodes resonant frequencies

Transcription:

Admin Assignment 2 This class will make you a better programmer! How did it go? How much time did you spend? FEATURES David Kauchak CS 451 Fall 2013 Assignment 3 out Implement perceptron variants See how they differ in performance Take a break from implementing algorithms after this (for 1-2 weeks) Features UCI Machine Learning Repository Terrain Unicycletype Weather Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES Where do they come from? Go-For-Ride? http://archive.ics.uci.edu/ml/datasets.html 1

Provided features Provided features Predicting the age of abalone from physical measurements Predicting breast cancer recurrence Name / Data Type / Measurement Unit / Description ----------------------------- Sex / nominal / -- / M, F, and I (infant) Length / continuous / mm / Longest shell measurement Diameter / continuous / mm / perpendicular to length Height / continuous / mm / with meat in shell Whole weight / continuous / grams / whole abalone Shucked weight / continuous / grams / weight of meat Viscera weight / continuous / grams / gut weight (after bleeding) Shell weight / continuous / grams / after being dried Rings / integer / -- / +1.5 gives the age in years 1. Class: no-recurrence-events, recurrence-events 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. 3. menopause: lt40, ge40, premeno. 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. 6. node-caps: yes, no. 7. deg-malig: 1, 2, 3. 8. breast: left, right. 9. breast-quad: left-up, left-low, right-up, right-low, central. 10. irradiated: yes, no. Provided features Raw data vs. features In many physical domains (e.g. biology, medicine, chemistry, engineering, etc.) the data has been collected and the relevant features identified we cannot collect more features from the examples (at least core features) In these domains, we can often just use the provided features In many other domains, we are provided with the raw data, but must extract/identify features For example image data text data audio data log data 2

Text: raw data Feature examples Raw data Features? Raw data Features Clinton said banana repeatedly last week on tv, banana, banana, banana (1, 1, 1, 0, 0, 1, 0, 0, ) banana clinton said california across tv wrong capital Occurrence of words Feature examples Feature examples Raw data Features Raw data Features Clinton said banana repeatedly last week on tv, banana, banana, banana Clinton said banana repeatedly last week on tv, banana, banana, banana clinton said california across tv wrong (4, 1, 1, 0, 0, 1, 0, 0, ) banana capital Frequency of word occurrence clinton said said banana california schools across the tv banana wrong way banana repeatedly (1, 1, 1, 0, 0, 1, 0, 0, ) capital city Do we retain all the information in the original document? Occurrence of bigrams 3

Feature examples Lots of other features Raw data Features clinton said said banana california schools across the tv banana wrong way banana repeatedly Clinton said banana repeatedly last week on tv, banana, banana, banana Other features? (1, 1, 1, 0, 0, 1, 0, 0, ) capital city POS: occurrence, counts, sequence Constituents Whether V1agra occurred 15 times Whether banana occurred more times than apple If the document has a number in it Features are very important, but we re going to focus on the models today How is an image represented? How is an image represented? images are made up of pixels for a color image, each pixel corresponds to an RGB value (i.e. three numbers) 4

Image features Image features for each pixel: R[0-255] G[0-255] B[0-255] for each pixel: R[0-255] G[0-255] B[0-255] Do we retain all the information in the original document? Other features for images? Lots of image features Audio: raw data Use patches rather than pixels (sort of like bigrams for text) Different color representations (i.e. L*A*B*) Texture features, i.e. responses to filters How is audio data stored? Shape features 5

Terrain Unicycletype Weather Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES Go-For-Ride? 9/23/13 Audio: raw data Audio features frequencies represented in the data (FFT) frequencies over time (STFT)/responses to wave patterns (wavelets) Many different file formats, but some notion of the frequency over time Audio features? beat timber energy zero crossings Obtaining features Current learning model Very often requires some domain knowledge As ML algorithm developers, we often have to trust the experts to identify and extract reasonable features That said, it can be helpful to understand where the features are coming from training data (labeled examples) learn model/ classifier 6

Terrain Weather Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES Go-For-Ride? Terrain Unicycletype Unicycletype Weather Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES Go-For-Ride? 9/23/13 Pre-process training data training data (labeled examples) pre-process data learn model/ classifier What is an outlier? better training data What types of preprocessing might we want to do? An example that is inconsistent with the other examples What types of inconsistencies? An example that is inconsistent with the other examples - extreme feature values in one or more dimensions - examples with the same feature values but different labels 7

An example that is inconsistent with the other examples - extreme feature values in one or more dimensions - examples with the same feature values but different labels Removing conflicting examples Identify examples that have the same features, but differing values For some learning algorithms, this can cause issues (for example, not converging) In general, unsatisfying from a learning perspective Can be a bit expensive computationally (examining all pairs), though faster approaches are available Fix? Removing extreme outliers Throw out examples that have extreme values in one dimension An example that is inconsistent with the other examples - extreme feature values in one or more dimensions - examples with the same feature values but different labels How do we identify these? Throw out examples that are very far away from any other example Train a probabilistic model on the data and throw out very unlikely examples This is an entire field of study by itself! Often called outlier or anomaly detection. 8

Quick statistics recap What are the mean, standard deviation, and variance of data? Quick statistics recap mean: average value, often written as μ variance: a measure of how much variation there is in the data. Calculated as: n (x σ 2 i µ) 2 i=1 = n standard deviation: square root of the variance (written as σ) How can these help us with outliers? Outliers in a single dimension Examples in a single dimension that have values greater than kσ can be discarded (for k >>3) Even if the data isn t actually distributed normally, this is still often reasonable If we know the data is distributed normally (i.e. via a normal/gaussian distribution) 9

Outliers in general Outliers for machine learning - Calculate the centroid/center of the data - Calculate the average distance from center for all data - Calculate standard deviation and discard points too far away Again, many, many other techniques for doing this Some good practices: - Throw out conflicting examples - Throw out any examples with obviously extreme feature values (i.e. many, many standard deviations away) - Check for erroneous feature values (e.g. negative values for a feature that can only be positive) - Let the learning algorithm/other pre-processing handle the rest Feature pruning Bad features Good features provide us information that helps us distinguish between labels Each of you are going to generate a feature for our data set: pick 5 random binary numbers However, not all features are good f 1 f 2 label What makes a bad feature and why would we have them in our data? I ve already labeled these examples and I have two features 10

Bad features Each of you are going to generate some a feature for our data set: pick 5 random binary numbers f 1 f 2 label 1 0 1 1 0 Is there any problem with using your feature in addition to my two real features? 11