A Chinese Domain Term Extractor

Similar documents
An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Optimal Design of DPCM Scheme for ECG Signal Handling

Entropy Coefficient Method to Evaluate the Level of Sustainable Development of China's Sports

Revenue Sharing and Competitive Balance. Does the invariance proposition hold?

ENGINEERING ECONOMICS

Realize a Mobile Lane Detection System based on Pocket PC Portable Devices

ENGINEERING ECONOMICS

Bayesian parameter estimation. Nuno Vasconcelos UCSD

2D MODELLING OF GROUNDWATER FLOW USING FINITE ELEMENT METHOD IN AN OBJECT-ORIENTED APPROACH

A Statistical Measuring System for Rainbow Trout

A RESPONSE SPECTRUM-BASED NONLINEAR ASSESSMENT TOOL FOR PRACTICE: INCREMENTAL RESPONSE SPECTRUM ANALYSIS (IRSA)

2 Stage I. Stage II. Stage III (ii)

The structure of the Fibonacci numbers in the modular ring Z 5

Application of K-Means Clustering Algorithm for Classification of NBA Guards

Relating Safety and Capacity on Urban Freeways

Analysis and Experimental Of 3-Dimentional AOA with Directional Antenna on Narrowband MIMO Capacity

MICROPOROSITY IN MONTMORILLONITE FROM NITROGEN AND CARBON DIOXIDE SORPTION

J. Sci. Res. 11 (1), (2019) A Bayesian Approach for Estimating Parameter of Rayleigh Distribution

Mass Distribution of Mercury among Ecosystem Components in the Florida Everglades

Resistance Prediction for a Novel Trimaran with Wave Piercing Bow

Expert Systems with Applic ations

Stochastic Scheduling with Availability Constraints in Heterogeneous Clusters

The research of applied pushover method in the earthquake resistance analysis of soil-structure interaction system

SPH4U Transmission of Waves in One and Two Dimensions LoRusso

Available online at ScienceDirect

Bayesian classification methods

SIMULATION OF COUNTER FLOW PEDESTRIAN DYNAMICS IN HALLWAYS USING SPHEROPOLYGONS INTRODUCTION

Journal of Engineering Science and Technology Review 10 (6) (2017) Research Article

Bowls North Harbour Inc PENNANTS. Start Time for Qualifying Rounds 9:30am

Integrated Model of Municipal Waste Management of the Czech Republic

Generative Models and Naïve Bayes

Scientific Herald of the Voronezh State University of Architecture and Civil Engineering. Construction and Architecture

Extensible Detection and Indexing of Highlight Events in Broadcasted Sports Video

number in a data set adds (or subtracts) that value to measures of center but does not affect measures of spread.

THE EFFECTS OF COUPLED INDUCTORS IN PARALLEL INTERLEAVED BUCK CONVERTERS

Patrick Boston (Leeds University) and Mark Chapman (Edinburgh University)

Confidence intervals for functions of coefficients of variation with bounded parameter spaces in two gamma distributions

ICC WORLD TWENTY ( WORLD CUP-2014 )- A CASE STUDY

CS 2750 Machine Learning. Lecture 4. Density estimation. CS 2750 Machine Learning. Announcements

Experimental and theoretical investigation of bending over sheave fatigue life of stranded steel wire rope

Lecture 13a: Chunks. Announcements. Announcements (III) Announcements (II) Project #3 Preview 4/18/18. Pipeline of NLP Tools

University of California, Los Angeles Department of Statistics. Measures of central tendency and variation Data display

Methodology for ACT WorkKeys as a Predictor of Worker Productivity

Rotary International President Gary C.K. Huang. Rotary Club of Taipei Taiwan. Coming Events July 2014

Engineering Analysis of Implementing Pedestrian Scramble Crossing at Traffic Junctions in Singapore

Welcome to the world of the Rube Goldberg!

ELIGIBILITY / LEVELS / VENUES

ELIGIBILITY / LEVELS / VENUES

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING, THE UNIVERSITY OF NEW MEXICO ECE-238L: Computer Logic Design Fall Notes - Chapter 6.

Limit of changes in transmissivity

ELIGIBILITY / LEVELS / VENUES

THE LATENT DEMAND METHOD

Analytical and numerical prediction of water sorptivity in cement-based materials

ELIGIBILITY / LEVELS / VENUES

Available online at ScienceDirect. Procedia Engineering 113 (2015 )

MST 121: Supplementary resource material for Chapter A1, Sequences

3.10 Convected Coordinates

8.5. Solving Equations II. Goal Solve equations by balancing.

Intersleek Pro. Divers Manual. Our World is Water CONTENTS

Reduced drift, high accuracy stable carbon isotope ratio measurements using a reference gas with the Picarro 13 CO 2 G2101-i gas analyzer

operate regenerator top without boiling aq. amine solution.

,~E~l. re: MANHOLE (CAST IN PLACE) ftj 6'-o" X 14 /-0 II 1:== POWER DISTRIBUTION STANDARDS - II. a:~ VJ II. a: - ~ ~~ (/)'~.,"--/\ i

First digit of chosen number Frequency (f i ) Total 100

Equilibrium or Simple Rule at Wimbledon? An Empirical Study

Abundance and distribution of freshwater sponges (Spongillidae)in Danube floodplain waters near Vienna, Austria

EMSBS/EMST. Drill For Machining Ultra-Deep Minute Holes FEATURES. For ultra-deep drilling of miniature holes. New chip stopper controls chip flow.

Modeling the Performance of a Baseball Player's Offensive Production

securing your safety

FOCUSING UNIDIRECTIONAL WAVE GROUPS ON FINITE WATER DEPTH WITH AND WITHOUT CURRENTS.

ITRS 2013 Silicon Platforms + Virtual Platforms = An explosion in SoC design by Gary Smith

DAMAGE ASSESSMENT OF FIBRE ROPES FOR OFFSHORE MOORING

A SECOND SOLUTION FOR THE RHIND PAPYRUS UNIT FRACTION DECOMPOSITIONS

SYMMETRY AND VARIABILITY OF VERTICAL GROUND REACTION FORCE AND CENTER OF PRESSURE IN ABLE-BODIED GAIT

1. Write down the ideal gas law and define all its variable and parameters. 2. Calculate the values and units of the ideal gas law constant R.

The Analysis of Bullwhip Effect in Supply Chain Based on Strategic Alliance

Gait-Event-Based Synchronization Method for Gait Rehabilitation Robots via a Bio-inspired Adaptive Oscillator

LACEY CITY COUNCIL MEETING September 8, 2016

The impact of foreign players on international football performance

GENETICS 101 GLOSSARY

High Speed 128-bit BCD Adder Architecture Using CLA

Risk analysis of natural gas pipeline

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

WORKING PAPER SERIES Long-term Competitive Balance under UEFA Financial Fair Play Regulations Markus Sass Working Paper No. 5/2012

Range St. Dev. n Mean. Total Mean % Competency. Range St. Dev. n Mean. Total Mean % Competency

ELIGIBILITY / LEVELS / VENUES

Series 600 Accessories

Load Calculation and Design of Roller Crowning of Truck Hub Bearing

Evaluating Rent Dissipation in the Spanish Football Industry *

PERFORMANCE TEAM EVALUATION IN 2008 BEIJING OLYMPIC GAMES

Policy sensitivity analysis of Karachi commuters

This report presents an assessment of existing and future parking & traffic requirements, for the site based on the current development proposal.

Basic Gas Spring Theory

West St Paul YMCA Swim Lessons Schedule

Step Detection Algorithm For Accurate Distance Estimation Using Dynamic Step Length

Our club has a rich history that dates back to the turn of the 20th century.

Andover YMCA Swim Lessons Schedule

CLASS: XI: MATHEMATICS

The new name for... Mines Rescue Service

Footwork is the foundation for a skilled basketball player, involving moves

Applications on openpdc platform at Washington State University

Transcription:

009 Iteratoal Coferece o Mache Learg ad Computg IPCSIT vol.3 (0 (0 IACSIT Press, Sgapore A Chese Term Extractor Jb Fu, Zhfe Wag, Jtao Mao Bejg Isttute of Techology Harb Uversty Abstract. A ovel method based o statstcal model of doma term laguage feature s proposed. Chese doma terms have three features: doma cohesveess, doma relevacy ad doma cosesus. These features are expressed respectvely by statstcal model ad these models are tegrated to extract doma terms. The relatve etropy betwee N-Gram laguage models s adopt to express cohesveess feature; the dfferece dstrbutg of terms betwee doma corpus ad balace corpus expresses the doma relevacy feature, the etropy of terms doma corpus deotes doma cosesus feature. Expermetal results show ths method make extracto of doma terms recevg the well precso ad recall. Keywords: term extracto, doma cohesveess, doma relevacy, doma cosesus.. Itroducto The vocabulary of a Chese laguage cotas thousads of terms, accurate detfcato of terms s mportat a varety of cotexts. Term extracto s the core parts of kowledge system ad the mportat task atural laguage processg, ad t ca be appled to a varety of felds such as otology costructo, text classfcato ad formato retreval. Furthermore, we ca study the developmet of the doma questo aswerg system th terms. We maly explore the Chese terms of computer doma ths paper, ad take the 30 thousad seteces to accout; the seteces come from user teractve log a real-world web tellget questo aswerg system. A ew method s proposed ad t s based o statstcal model of term laguage feature. Chese doma terms have three features: doma cohesveess, doma relevacy ad doma cosesus. These features are computed respectvely statstcal model ad these models are tegrated to extract doma terms. The relatve etropy betwee N-Gram laguage models s adopt to express cohesveess feature; the dfferece dstrbutg of terms betwee doma corpus ad balace corpus expresses the doma relevacy feature, the etropy of terms doma corpus deotes doma cosesus feature. Expermets show ths method make extracto of computer doma terms recevg the well precso ad recall.. Related Work Isofar as terms fucto as lexcal uts, ther compoet words ted to co-occur more ofte, to resst substtuto or paraphrase, to follow fxed sytactc patters, ad to dsplay some degree of sematc ocompostoalty []. However, oe of these characterstcs are ameable to a smple algorthmc terpretato, varous term extracto systems have bee developed, such as Termght [], ad TERMS [3] amog others methods [4-5]. Such systems typcally rely o a combato of lgustc kowledge ad statstcal assocato measures. Grammatcal patters, such as adjectve-ou or ou-ou sequeces are selected the raked statstcally, ad the resultg raked lst s ether used drectly or submtted for maual flterg. The lgustc flters are used typcal term extracto systems to reduce the umber of a pror Correspodg author. Tel.: +86 39057980; fax: +86 00-6895944. E-mal address: fujb@gmal.com. 54

mprobable terms ad thus mprove precso. The cohesveess measure does the actual work of dstgushg betwee terms ad plausble o-terms. A varety of methods have bee appled, ragg from smple frequecy [3], modfed frequecy measures such as c-values [6] ad stadard statstcal sgfcace tests such as the t-test, the ch-squared test[7], ad log-lkelhood [5] ad formato-based methods, e.g. pot-se mutual formato [8]. These ma term cohesveess measure methods are lst Table. Table. Term Cohesveess Measure Methods formula Iterpretato Frequecy[7] f s the frequecy of the bgram T-Score[7] Log-lkelhood[5] Ch-squared (x [7] Pot-se Mutual Iformato[8] True Mutual[5] Iformato C-Value[6] f f f x f f y k k ll(, k, ll(, k, k k k k ll(, k, ll( _ x, x j y, y ( f j j j, k 55, f x, f y s respectvely the frequecy of x,y; N s sum of bgram corpus k k f (, c(, f ( x*, f ( x* ll( p, k, k log( p ( k log( p f ( f ( j j x, x j y, y, N f (x f (y s respectvely the frequecy of x, y p ( p ( s the frequecy of the log x y bgram, p (x, p (y s respectvely the frequecy of x,y The same as above p ( log x y log f ( a f a sot ested otherse log f ( a f ( b Ta bt a f ( s the frequecy of corpus, Ta s term caddate lst cludg, P( T a s of the legth of lst However, all these studes performace was geerally s very deal, th precso fallg rapdly after the very hghest raked terms lst. Schoe ad Jurafsky [9] evaluate the detfcato of terms thout grammatcal flterg o a 6.7 mllo word extract from the TREC databases, applyg both WordNet ad ole dctoares as gold stadards. Oce aga, the geeral level of performace s low, th precso fallg off rapdly as larger portos of the -best lst were cluded, but they report better performace th statstcal ad formato theoretc measures (cludg mutual formato tha th frequecy. The overall patter appears to be oe where lexcal cohesveess measures geeral have very low precso ad recall o ufltered data, but perform far better whe combed th other features whch select lgustc patters lkely to fucto as terms. The relatvely low precso of lexcal cohesveess measures o ufltered data o doubt has multple explaatos, but a logcal caddate s the mstake of uderlyg statstcal assumptos [7]. For stace, may of the tests assume a ormal dstrbuto, despte the hghly skewed ature of atural laguage frequecy dstrbutos. I atural laguage, as frst observed by Zpf [0] the frequecy of words ad other lgustc uts ted to follow hghly skewed dstrbutos whch there are a large umber of rare evets. Zpf's law of ths relatoshp for sgle word frequecy dstrbutos postulates that the frequecy of a word s versely proportoal to ts rak the frequecy dstrbuto. More mportatly, statstcal ad formato-based metrcs such as the log-lkelhood ad mutual formato measure sgfcace relatve to the assumpto that the selecto of compoet terms s

statstcally depedet. But of course the possbltes for combatos of words are ot radom ad depedet. Use of lgustc flters such as "attrbutve adjectve + ou" or "verb + modfyg prepostoal phrase" arguably has the effect of selectg a subset of the laguage for whch the stadard ull hypothess -- that ay word may freely be combed th ay other word -- may be much more accurate, so the usual soluto s to mpose a lgustc flter o the data, th the cohesveess measures beg appled oly to the subset thus selected. For stace, f the uverse of statstcal possbltes s restrcted to the set of sequeces whch a adjectve s followed by a ou, the ull hypothess that word choce s depedet --.e., that ay adjectve may precede ay ou -- s a reasoable dealzato. It s thus worth cosderg whether there are ay ways to brg addtoal formato to bear o the problem of recogzg phrasal terms thout presupposg statstcal depedece. 3. Term Extracto Based o Laguage Feature I Chese laguage, doma term s cosdered as words or phase frequetly occurrg the doma corpus, expressg the cocept, feature ad relatoshp of the target doma. I term extracto area, Chese s dfferet from Eglsh. I Chese, there are o obvous morphologcal delmters to separate words seteces. Hece, term extracto from Chese s more dffcult tha Eglsh. We have observed that doma terms have three laguage features: Cohesveess, Relevacy ad Cosesus. The three features s model ad tegrated to evaluate doma terms. 3.. Cohesveess The cohesveess measures table, are maly appled Eglsh text, may of the measures assume a ormal dstrbuto. Furthermore, statstcal ad formato-based metrcs sgfcace relatve to the depedet assumpto. I the paper, we use cohesveess measure to extract term, ad flter term caddate by lgustc POS rules. I Chese, t s dffcult to separate words by delmters seteces. We use cohesveess measure to determe boudary of term. Cohesveess of term deotes the compactess of words or characters as the compoet elemet of term. We use N-Gram laguage model to descrbe Cohesveess degree of terms. The smplest laguage model s the ugram model, whch assumes each word of a gve word sequece s draw depedetly. We deote the ugram model LM for the target doma corpus. We ca also tra bgram models LM for the corpus, t s the better model to descrbe two-character terms the corpus. If we use ugram models LM stead of LM, the we have some loss to the corpus. We assume that the amout of loss betwee usg LM ad LM s related to Cohesveess. We use the relatve etropy betwee Bgram model ad Ugram to express the Cohesveess of term [3]. The defto of Cohesveess s as follow. CO( W ( LM D LM D log q( ( log log w ( p Let x, y s the probablty of bgram of, occurrg adjacet the corpus. Through the cohesveess, the adjacet characters are selected as term caddate. After above process, we ca get two character term caddates. The assumg these term caddates as a character, by re-computg of cohesveess, these two character terms ca be exted to mult-character term caddates. The lgustc POS rules are used to flter these term caddates, t s useful to reduce the umber of mprobable terms ad thus mprove precso. The POS rules such as adj + ou" or "ou + ou are used as support POS rules, the POS rules such as prepostoal + verb are used as elmato POS rules. There are some stop-words Chese setece, such as 虽然, 但是, especally the target doma, some geeral ou such as 北京 etc, s stop-words. Takg the POS rules ad stop-words to accout, the Cohesveess s defed as follow: DCO( Pstop Ppos CO( ( 56

Whle P stop s a pealty factor about stop-words, P pos s a pealty factor about lgustc POS rules. The term caddates are selected to ext step f ther doma cohesveess value surpasses a fxed threshold, ad kow term lst s used to determe the threshold. 3.. Relevacy Termologcal ad o-termologcal expresso (e.g. "last week" or "real tme" both have a property of hgh frequecy a corpus. The specfcty of a termologcal caddate th respect to the target doma s measured va comparatve aalyss across the target doma th balace corpus. Relevacy of term expresses the exclusve degree of term uderlyg doma. DR( log q( (3 Let be the probablty of strg w the target corpus ad q( be the probablty of strg w the balace corpus. We ca sort the term caddates by Relevacy descet order, through a threshold, relatve terms target doma ca be pck out. 3.3. Cosesus Terms are represetatve of cocepts whose meag are agreed upo large user commutes a uderlyg doma. We should take to accout ot oly the overall occurrece the target corpus but also ts appearace sgle documets. There are mportat terms th a hgh ad average frequecy th all documets uderlyg doma. Dstrbuted usage expresses a form of cosesus ted to the cosoldated sematcs of a term th the target doma []. Cosesus measures the dstrbuted use of a term a doma D. The dstrbuto of a term t documets d ca be take as a stochastc varable estmated throughout all d D. The etropy of ths dstrbuto expresses the cosesus of t D. The Cosesus s expressed as follows. m DC( w d log (4 w d freq(w d ad w d (5 d D freq(w d Let w d be codtoal probablty expresso of term w documet d, m be amout of documet the doma. Through Cosesus of terms, hgh qualty term ca be selected. 4. Archtecture of Term Extractor The archtecture of the doma term extractor s descrbed the fgure, the seteces of doma corpus are segmeted ad processed POS tagger, the the result s as put to doma cohesveess module, support of POS rules, stop-words ad kow term lst, the doma cohesveess s computg, by the step, term caddates are selected to ext step, support of balace corpus, doma relevacy ad doma cosesus s computed, the terms the doma s extracted the system. Semget POS Setece Balace Corpus Cosesus Corpus Cohesveess Relevacy Terms POS rules 5. Expermet Stop-words Kow Term Term Caddates Fg. : Archtecture of Term Extractor. Evaluato 57

I the expermet, the user teractve log of a web tellget questo aswerg system computer troubleshootg doma s as the doma corpus, t clude 3 thousad setece. Tacorp [] cludg 450 text fles s as the balace corpus, ICTCLAS[3] s adopted as segmet ad POS tools, the recall ad precso of the term extracto s cosder as evaluato crtera. For the purpose of evaluato, the "golde stadard" of doma terms s costructed maually. Typcal terms extracto method such as frequecy based C-Value, formato theory based mutual formato method s as basele system, the result of evaluato s lst table. The expermetal result show the method based o laguage feature the paper gets better result comparg th C-value method ad mutual formato method. Table. The evaluato of term Extracto methods Methods Extracted terms Terms Gold stadard Precso Recall Laguage Feature 54 67 35 66% 7% C Value 47 54 35 6% 66% Mutual Iformato 35 48 35 63% 63% 6. Cocluso ad Future Work The paper preseted a methods for doma extracto computer troubleshootg doma. The method s based o statstcal model of term laguage feature: doma cohesveess, doma relevacy ad doma cosesus. Future research ca focus o mprovg precso ad recall of the extractor by lgustc kowledge. 7. Refereces [] Mag C. D ad H. Schutze, Foudatos of Statstcal Natural Laguage Processg. Cambrdge,MA,USA: MIT Press, 999. [] Daga I ad K. W. Church, "Termght:Idetfyg ad traslatg techcal termology", Proceedgs of the fourth coferece o appled atural laguage processg, 994. [3] Justeso J. S ad S. M. Katz, "Techcal termology: some lgustc propertes ad a algorthm for detfcato text", Natural Laguage Egeerg,995, pp. 359-37. [4] Boguraev B ad C. Keedy, "Applcatos of Term Idetfcato Techology: Descrpto ad Cotet Characterzato", Natural Laguage Egeerg,999, pp. 7-44. [5] Patrck Patel ad Dekag L, "A Statstcal Corpus-Based Term Extractor", Proceedgs of 4th Beal Coferece of the Caada Socety o Computatoal Studes of Itellgece: Advaces Artfcal Itellgece, 00. [6] Katera T. Fratz, Sopha Aaadou, ad Ju-ch Tsuj, "The c-value/c-value method of automatcrecogto for mult-word terms", Proceedgs of the Secod Europea Coferece o Research ad Advaced Techology for Dgtal Lbrares, Lodo, UK, 998. [7] Paul Deae, "A Noparametrc Method for Extracto of Caddate Phrasal Terms", Proceedgs of The 43rd Aual Meetg of the Assocato for Computatoal Lgustcs (ACL 05, A Arbor, Mchga, 005. [8] Keeth W. Church ad Patrck Haks, "Word Assocato Norms, Mutual formato, ad Lexcography", Proceedgs of the 7th. Aual Meetg of the Assocato for Computatoal Lgustcs, Vacouver, B.C., 989. [9] Patrck Schoe ad Dael Jurafsky, "Is kowledge-free ducto of multword ut dctoary headwords a solved problem", Proceedgs of the Emprcal Methods Natural Laguage Processg, 00. [0] George Kgsley Zpf, Huma Behavor ad the Prcple of Least Effort: Addso-Wesley, 949. [] LIU Tao, LIU Bg-qua, XU Zh-mg, ad WANG Xao-log, "Automatc -Specfc Term Extracto ad Its Applcato Text Classfcato", ACTA ELECTRONICA SINICA,007, pp. 38-33. [] Sogbo Ta, A Novel Refemet Approach for Text Categorzato.: ACM CIKM, 005. [3] Huapg Zhag, "Chese Lexcal Aalyss Usg Herarchcal Hdde Markov Model", Proceedgs of Secod SIGHAN workshop afflated th 4th ACL, Sapporo Japa, 003. 58