Re-evaluating automatic metrics for image captioning

Similar documents
Using Perceptual Context to Ground Language

Attribute Adaptation for Personalized Image Search Adriana Kovashka and Kristen Grauman (Supplementary Materials)

EVALITA 2011 The News People Search Task: Evaluating Cross-document Coreference Resolution of Named Person Entities in Italian News

BALLGAME: A Corpus for Computational Semantics

Sreyasi Nag Chowdhury, Niket Tandon, Gerhard Weikum. Max Planck Institute for Informatics, Saarbrücken, Germany

Deakin Research Online

INTRODUCTION TO PATTERN RECOGNITION

Phrase-based Image Captioning

Which Aspects are able to Influence the Decision in Case of the Bids for the Olympic Games 2024?

Human Performance Evaluation

Self-Driving Vehicles That (Fore) See

Teaching Faster and Faster!

Approximate Grammar for Information Extraction

Combined impacts of configurational and compositional properties of street network on vehicular flow

A computer program that improves its performance at some task through experience.

Visual Background Recommendation for Dance Performances Using Dancer-Shared Images

Petacat: Applying ideas from Copycat to image understanding

Commonsense Knowledge Acquisition and Applications

A Bag-of-Gait Model for Gait Recognition

Pedestrian crossing aid device for the visually impaired

Generating Commentaries for Tennis Videos

Rearrangement of Recognized Strokes in Online Handwritten Gurmukhi. words recognition.

SYSTEMS AND TOOLS FOR MACHINE TRANSLATION

1 Konan University Yuki Hattori Akiyo Nadamoto

Modelling Approaches and Results of the FHDO Biomedical Computer Science Group at ImageCLEF 2015 Medical Classification Task

Predicting Human Behavior from Public Cameras with Convolutional Neural Networks

Determining bicycle infrastructure preferences A case study of Dublin

ONLINE MODE SCHEDULE Full time

SCHEDULE OFFSITE MODE Full time (Mastraduvisual 15th edition)

Open Research Online The Open University s repository of research publications and other research outputs

Automated Proactive Road Safety Analysis

6.RP Speed Conversions

ONLINE MODE SCHEDULE Part time

Damjan Vlaj. University of Maribor SUMAT CIP-ICT-PSP Deliverable D9.8. Leaflets, posters and other dissemination materials.

Character-Angle based Video Annotation

Safety Assessment of Installing Traffic Signals at High-Speed Expressway Intersections

Advanced PMA Capabilities for MCM

Reduction of Bitstream Transfer Time in FPGA

Cross- Shopping Report

Sentiment Analysis Symposium - Disambiguate Opinion Word Sense via Contextonymy

Performance of Fully Automated 3D Cracking Survey with Pixel Accuracy based on Deep Learning

Fall Prevention Midterm Report. Akram Alsamarae Lindsay Petku 03/09/2014 Dr. Mansoor Nasir

DISTANCE MODE SCHEDULE Part time (Mastraduvisual 13th edition)

Performance Indicators (examples) Responds correctly to questions about own name, sex and age

Pedestrian Protection System for ADAS using ARM 9

Game, Shot and Match: Event-based Indexing of Tennis

Introduction to Pattern Recognition

DATA MINING ON CRICKET DATA SET FOR PREDICTING THE RESULTS. Sushant Murdeshwar

Diver Training Options

Style Guide HOUSTON LIVESTOCK SHOW AND RODEO TM

Interface Pressure Mapping (IPM) Clinical Use of the Literature

APPROACH RUN VELOCITIES OF FEMALE POLE VAULTERS

Clothing for Improving Mine Worker Visibility

CHAMPIONS OF OUR BRAND

Evaluation of the depth camera based SLAM algorithms

Parsimonious Linear Fingerprinting for Time Series

Predicting Horse Racing Results with Machine Learning

Impact of new imaging modalities on video surveillance and invasion of privacy

Level 5 Materials (Units 1 4) including The Superkids Hit Second Grade. Level 6 Materials (Units 5 8) including The Superkids Take Off

THE INSTALLATION OF PRE-SIGNALS AT RAILROAD GRADE CROSSINGS

RML Example 26: pto. First Try at a PTO. PTO with a table inside

TENNESSEE STATE RACING COMMISSION DEPARTMENT OF COMMERCE & INSURANCE RECORDS RECORD GROUP 303

Evaluating the Influence of R3 Treatments on Fishing License Sales in Pennsylvania

knn & Naïve Bayes Hongning Wang

Introduction to Pattern Recognition

Supplementary Material. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts

Pose Estimation for Robotic Soccer Players

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

Paper for Consideration by HSSC8 Development of an Additional Bathymetry Layer standard based on S-57/S-52

Grade 8 / Scored Student Samples ITEM #2 SMARTER BALANCED PERFORMANCE TASK

C est à toi! Level Two, 2 nd edition. Correlated to MODERN LANGUAGE CURRICULUM STANDARDS DEVELOPING LEVEL

CAAD CTF 2018 Rules June 21, 2018 Version 1.1

SCHEDULE OFFSITE MODE Part time (Mastraduvisual 15th edition)

Saturday, 15 July 2006 SAA3-3: 15:45-16:15 APPLYING BIOMECHANICS TO THE QUALITATIVE ANALYSIS OF THE FOREHAND

Rule 7 - Risk of collision

Pigging as a Flow Assurance Solution Avoiding Slug Catcher Overflow

Mathematics. Week of Lessons COS Routines Investigations/Additional resources 8/7 Days in School: 1. MT Calendar (page 150)

1. Task Definition. Automated Human Gait Recognition Ifueko Igbinedion, Ysis Tarter CS 229, Fall 2013

MICROSCOPIC ROAD SAFETY COMPARISON BETWEEN CANADIAN AND SWEDISH ROUNDABOUT DRIVER BEHAVIOUR

Object Recognition. Selim Aksoy. Bilkent University

Video Based Accurate Step Counting for Treadmills

arxiv: v2 [cs.cv] 1 Apr 2019

A IMPROVED VOGEL S APPROXIMATIO METHOD FOR THE TRA SPORTATIO PROBLEM. Serdar Korukoğlu 1 and Serkan Ballı 2.

THe rip currents are very fast moving narrow channels,

Lesson Plan: Kite Meteorology

UK Car Accidents Data Manipulation

Recognition of Tennis Strokes using Key Postures

Nadiya Afzal 1, Mohd. Sadim 2

Sensors & Transducers Published by IFSA Publishing, S. L., 2017

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

Kinect-based Badminton Motion Sensor as Potential Aid for Coaching Strokes in Novice Level

Perception Model to Analyze Football Players' Performances

PGA LOGO EMBROIDERY SLICKSHEET

Proposed Paralympic Classification System for Va a Information for National federations and National Paralympic Committees

Spreading Activation in Soar: An Update

PEDESTRIAN behavior modeling and analysis is

Marketing and Sponsorship. Proposal

The Rabbit Pen Problem Created by: Angeli Tempel George Mason University, COMPLETE Math Spring 2016

Transcription:

/mertkilickaya_ kilickayamert@gmail.com Re-evaluating automatic metrics for image captioning Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, Erkut Erdem

Background

source target f( = ( The cat sat on the mat?

Evaluation Human Automatic Borrowed from Machine Translation o BLEU o METEOR o ROUGE Developed for Image Captioning o CIDEr o SPICE

Reference {The cat sat on the mat} Candidate {An orange cat sitting on mat} BLEU {cat,on,mat} {the,cat,sat,on,the,mat}

Reference {The cat sat on the mat} Candidate {An orange cat sitting on mat} ROUGE {cat,on,mat} {an,orange,cat,sitting,on,mat}

Reference {The cat sat on the mat} Candidate {An orange kitty sitting on mat} METEOR {kitty,sit,on,mat} {the,cat,sat,on,the,mat} {kitty,sit,on,mat} {an,orange,cat,sitting,on,mat}

Reference {The cat sat on the mat} Candidate {An orange kitty sitting on mat} CIDEr f corpus = Word Frequency Kitty 0.6 Sit 0.2 On 0.1 Mat 0.1

Reference {The cat sat on the mat} Candidate {An orange kitty sitting on mat} CIDEr {0.6 kitty,0.2 sit,0.1 on,0.1 mat} {the,cat,sat,on,the,mat} {0.6 kitty,0.2 sit,0.1 on,0.1 mat} {an,orange,cat,sitting,on,mat}

What is missing?

Metric Proposed to Evaluate Idea Semantic Information BLEU Machine Translation N gram precision - ROUGE Document Summarization N gram recall - METEOR Machine Translation N gram with synonym matching Synonym Matching CIDEr Image Captioning N gram with corpus reweighting -

# Possible Captions Visual Complexity

SPICE Captions Scene Graph METEOR

parser Reference {The cat sat on the mat} Candidate {An orange kitty sitting on mat} p Objects {cat, mat} {kitty, mat} p attributes {-} {orange-kitty} p relations {cat-sat, cat-on-mat} {kitty-sit, kitty-on-mat}

Reference {The cat sat on the mat} Candidate {An orange kitty sitting on mat} SPICE f f Objects attributes f relations

What if we have a relevant caption, with no overlapping content?

Reference Candidate 1 Candidate 2 A man wearing a lifevest is sitting in a canoe A guy with a red jacket is standing on a boat A small white ferry rides through water

Word Mover Distance (WMD) Earth Captions W2V Space Mover Distance

Candidate 1 A guy with a red jacket is standing on a boat 2.49 = 0.48 + 0.50 + 0.60 + 0.43 + 0.48 Reference A man wearing a lifevest is sitting in a canoe 3.07 = 0.61 + 0.57 + 0.71 + 0.70 + 0.48 Candidate 2 A small white ferry rides through water

We have BLEU, METEOR, ROUGE, CIDEr, SPICE and WMD Which one is better?

Correlation Analysis with Williams Significance Test

Caption 1 Caption 2 Caption 3 Caption 4 Human Judgement Metric 1 Metric 2 Corr(Human Judgement, Metric 1) Corr(Human Judgement, Metric 2)

Spearman Correlation on Datasets Flickr-8k Composite WMD 0.60 0.43 SPICE 0.64 0.42 CIDEr 0.56 0.42 METEOR 0.58 0.44 BLEU 0.44 0.38 Is this significant? ROUGE 0.44 0.39 Flickr-8k Composite SPICE > WMD > METEOR > CIDEr > BLEU > ROUGE METEOR > WMD > SPICE > CIDEr > ROUGE > BLEU

Spearman Correlation Between Metrics Williams Significance Test

Spearman Correlation Between Metrics Williams Significance Test

Measuring Syntactic Properties

Metric Values Description BLEU METEOR ROUGE CIDEr SPICE WMD Original A man wearing a red life jacket is sitting in a canoe on a lake 1 1 1 10 1 1

Metric Values Description BLEU METEOR ROUGE CIDEr SPICE WMD Original A man wearing a red life jacket is sitting in a canoe on a lake 1 1 1 10 1 1 Candidate A man wearing a life jacket is in a small boat on a lake 0.45 0.28 0.68 2.19 0.40 0.19

Metric Values Description BLEU METEOR ROUGE CIDEr SPICE WMD Original A man wearing a red life jacket is sitting in a canoe on a lake 1 1 1 10 1 1 Candidate Synonyms A man wearing a life jacket is in a small boat on a lake A guy wearing a life vest is in a small boat on a lake 0.45 0.28 0.68 2.19 0.40 0.19 0.20 ( ) 0.17 ( ) 0.57( ) 0.65( ) 0.00 ( ) 0.10( )

Metric Values Description BLEU METEOR ROUGE CIDEr SPICE WMD Original A man wearing a red life jacket is sitting in a canoe on a lake 1 1 1 10 1 1 Candidate Synonyms Redundancy A man wearing a life jacket is in a small boat on a lake A guy wearing a life vest is in a small boat on a lake A man wearing a life jacket is in a small boat on a lake at sunset 0.45 0.28 0.68 2.19 0.40 0.19 0.20 ( ) 0.17 ( ) 0.57( ) 0.65( ) 0.00 ( ) 0.10( ) 0.45 0.28 0.66 2.01 0.36 0.18

Metric Values Description BLEU METEOR ROUGE CIDEr SPICE WMD Original A man wearing a red life jacket is sitting in a canoe on a lake 1 1 1 10 1 1 Candidate Synonyms Redundancy Word order A man wearing a life jacket is in a small boat on a lake A guy wearing a life vest is in a small boat on a lake A man wearing a life jacket is in a small boat on a lake at sunset In a small boat on a lake a man is wearing a life jacket 0.45 0.28 0.68 2.19 0.40 0.19 0.20 ( ) 0.17 ( ) 0.57( ) 0.65( ) 0.00 ( ) 0.10( ) 0.45 0.28 0.66 2.01 0.36 0.18 0.26 ( ) 0.26 ( ) 0.38( ) 1.32 ( ) 0.40 0.19

Robustness Analysis

Distractor Type Gold Caption A man wearing a life jacket is in a small boat on a lake with a ferry in view Replace Scene Replace Person Share Person Share Scene A man wearing a life jacket is in a small boat on takeoff with a ferry in view A woman in a blue shirt and headscarf is in a small boat on on a lake with a ferry in view A man is selecting a chair from a stack under a shady awning A black and brown dog is playing on the ice at the edge of a lake

Accuracy Case #Instances BLEU METEOR ROUGE CIDEr SPICE WMD Replace Scene Replace Person 2514 0.62 0.69 0.63 0.83 0.54 0.76 5817 0.73 0.77 0.78 0.78 0.67 0.80 Share Scene 2621 0.79 0.85 0.79 0.81 0.70 0.87 Share Person 4596 0.78 0.85 0.78 0.83 0.67 0.88 Overall 15548 0.73 0.79 0.75 0.81 0.65 0.83

Conclusion Word Mover Distance helps as a semantic measure, and opens up opportunity to apply semantic relevancy literature for this task! Metrics tend to be effected by small disturbances like scene change, synonym words, or word ordering Is current success of image captioning metrics illusion? Correlation alone is not enough. It should always coupled with Significance testing over improvements There is a big room for improvement for novel metrics, and the best practices proposed in this paper can be useful to show its contribution

References [BLEU] Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. [ROUGE] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." Text summarization branches out: Proceedings of the ACL-04 workshop. Vol. 8. 2004. [METEOR] Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments." Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Vol. 29. 2005. [CIDEr] Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image description evaluation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [SPICE] Anderson, Peter, et al. "Spice: Semantic propositional image caption evaluation." European Conference on Computer Vision. Springer International Publishing, 2016. [Williams Test] Graham, Yvette. "Improving Evaluation of Machine Translation Quality Estimation." ACL (1). 2015. [Robustness Corpus] Hodosh, Micah, and Julia Hockenmaier. "Focused evaluation for image description with binary forced-choice tasks." Workshop on Vision and Language, Annual Meeting of the Association for Computational Linguistics. Vol. 3. 2016. [Word Mover Distance] Kusner, Matt J., et al. "From Word Embeddings To Document Distances." ICML. Vol. 15. 2015.