A Chinese Domain Term Extractor

009 Iteratoal Coferece o Mache Learg ad Computg IPCSIT vol.3 (0 (0 IACSIT Press, Sgapore A Chese Term Extractor Jb Fu, Zhfe Wag, Jtao Mao Bejg Isttute of Techology Harb Uversty Abstract. A ovel method based o statstcal model of doma term laguage feature s proposed. Chese doma terms have three features: doma cohesveess, doma relevacy ad doma cosesus. These features are expressed respectvely by statstcal model ad these models are tegrated to extract doma terms. The relatve etropy betwee N-Gram laguage models s adopt to express cohesveess feature; the dfferece dstrbutg of terms betwee doma corpus ad balace corpus expresses the doma relevacy feature, the etropy of terms doma corpus deotes doma cosesus feature. Expermetal results show ths method make extracto of doma terms recevg the well precso ad recall. Keywords: term extracto, doma cohesveess, doma relevacy, doma cosesus.. Itroducto The vocabulary of a Chese laguage cotas thousads of terms, accurate detfcato of terms s mportat a varety of cotexts. Term extracto s the core parts of kowledge system ad the mportat task atural laguage processg, ad t ca be appled to a varety of felds such as otology costructo, text classfcato ad formato retreval. Furthermore, we ca study the developmet of the doma questo aswerg system th terms. We maly explore the Chese terms of computer doma ths paper, ad take the 30 thousad seteces to accout; the seteces come from user teractve log a real-world web tellget questo aswerg system. A ew method s proposed ad t s based o statstcal model of term laguage feature. Chese doma terms have three features: doma cohesveess, doma relevacy ad doma cosesus. These features are computed respectvely statstcal model ad these models are tegrated to extract doma terms. The relatve etropy betwee N-Gram laguage models s adopt to express cohesveess feature; the dfferece dstrbutg of terms betwee doma corpus ad balace corpus expresses the doma relevacy feature, the etropy of terms doma corpus deotes doma cosesus feature. Expermets show ths method make extracto of computer doma terms recevg the well precso ad recall.. Related Work Isofar as terms fucto as lexcal uts, ther compoet words ted to co-occur more ofte, to resst substtuto or paraphrase, to follow fxed sytactc patters, ad to dsplay some degree of sematc ocompostoalty []. However, oe of these characterstcs are ameable to a smple algorthmc terpretato, varous term extracto systems have bee developed, such as Termght [], ad TERMS [3] amog others methods [4-5]. Such systems typcally rely o a combato of lgustc kowledge ad statstcal assocato measures. Grammatcal patters, such as adjectve-ou or ou-ou sequeces are selected the raked statstcally, ad the resultg raked lst s ether used drectly or submtted for maual flterg. The lgustc flters are used typcal term extracto systems to reduce the umber of a pror Correspodg author. Tel.: +86 39057980; fax: +86 00-6895944. E-mal address: fujb@gmal.com. 54

mprobable terms ad thus mprove precso. The cohesveess measure does the actual work of dstgushg betwee terms ad plausble o-terms. A varety of methods have bee appled, ragg from smple frequecy [3], modfed frequecy measures such as c-values [6] ad stadard statstcal sgfcace tests such as the t-test, the ch-squared test[7], ad log-lkelhood [5] ad formato-based methods, e.g. pot-se mutual formato [8]. These ma term cohesveess measure methods are lst Table. Table. Term Cohesveess Measure Methods formula Iterpretato Frequecy[7] f s the frequecy of the bgram T-Score[7] Log-lkelhood[5] Ch-squared (x [7] Pot-se Mutual Iformato[8] True Mutual[5] Iformato C-Value[6] f f f x f f y k k ll(, k, ll(, k, k k k k ll(, k, ll( _ x, x j y, y ( f j j j, k 55, f x, f y s respectvely the frequecy of x,y; N s sum of bgram corpus k k f (, c(, f ( x*, f ( x* ll( p, k, k log( p ( k log( p f ( f ( j j x, x j y, y, N f (x f (y s respectvely the frequecy of x, y p ( p ( s the frequecy of the log x y bgram, p (x, p (y s respectvely the frequecy of x,y The same as above p ( log x y log f ( a f a sot ested otherse log f ( a f ( b Ta bt a f ( s the frequecy of corpus, Ta s term caddate lst cludg, P( T a s of the legth of lst However, all these studes performace was geerally s very deal, th precso fallg rapdly after the very hghest raked terms lst. Schoe ad Jurafsky [9] evaluate the detfcato of terms thout grammatcal flterg o a 6.7 mllo word extract from the TREC databases, applyg both WordNet ad ole dctoares as gold stadards. Oce aga, the geeral level of performace s low, th precso fallg off rapdly as larger portos of the -best lst were cluded, but they report better performace th statstcal ad formato theoretc measures (cludg mutual formato tha th frequecy. The overall patter appears to be oe where lexcal cohesveess measures geeral have very low precso ad recall o ufltered data, but perform far better whe combed th other features whch select lgustc patters lkely to fucto as terms. The relatvely low precso of lexcal cohesveess measures o ufltered data o doubt has multple explaatos, but a logcal caddate s the mstake of uderlyg statstcal assumptos [7]. For stace, may of the tests assume a ormal dstrbuto, despte the hghly skewed ature of atural laguage frequecy dstrbutos. I atural laguage, as frst observed by Zpf [0] the frequecy of words ad other lgustc uts ted to follow hghly skewed dstrbutos whch there are a large umber of rare evets. Zpf's law of ths relatoshp for sgle word frequecy dstrbutos postulates that the frequecy of a word s versely proportoal to ts rak the frequecy dstrbuto. More mportatly, statstcal ad formato-based metrcs such as the log-lkelhood ad mutual formato measure sgfcace relatve to the assumpto that the selecto of compoet terms s

statstcally depedet. But of course the possbltes for combatos of words are ot radom ad depedet. Use of lgustc flters such as "attrbutve adjectve + ou" or "verb + modfyg prepostoal phrase" arguably has the effect of selectg a subset of the laguage for whch the stadard ull hypothess -- that ay word may freely be combed th ay other word -- may be much more accurate, so the usual soluto s to mpose a lgustc flter o the data, th the cohesveess measures beg appled oly to the subset thus selected. For stace, f the uverse of statstcal possbltes s restrcted to the set of sequeces whch a adjectve s followed by a ou, the ull hypothess that word choce s depedet --.e., that ay adjectve may precede ay ou -- s a reasoable dealzato. It s thus worth cosderg whether there are ay ways to brg addtoal formato to bear o the problem of recogzg phrasal terms thout presupposg statstcal depedece. 3. Term Extracto Based o Laguage Feature I Chese laguage, doma term s cosdered as words or phase frequetly occurrg the doma corpus, expressg the cocept, feature ad relatoshp of the target doma. I term extracto area, Chese s dfferet from Eglsh. I Chese, there are o obvous morphologcal delmters to separate words seteces. Hece, term extracto from Chese s more dffcult tha Eglsh. We have observed that doma terms have three laguage features: Cohesveess, Relevacy ad Cosesus. The three features s model ad tegrated to evaluate doma terms. 3.. Cohesveess The cohesveess measures table, are maly appled Eglsh text, may of the measures assume a ormal dstrbuto. Furthermore, statstcal ad formato-based metrcs sgfcace relatve to the depedet assumpto. I the paper, we use cohesveess measure to extract term, ad flter term caddate by lgustc POS rules. I Chese, t s dffcult to separate words by delmters seteces. We use cohesveess measure to determe boudary of term. Cohesveess of term deotes the compactess of words or characters as the compoet elemet of term. We use N-Gram laguage model to descrbe Cohesveess degree of terms. The smplest laguage model s the ugram model, whch assumes each word of a gve word sequece s draw depedetly. We deote the ugram model LM for the target doma corpus. We ca also tra bgram models LM for the corpus, t s the better model to descrbe two-character terms the corpus. If we use ugram models LM stead of LM, the we have some loss to the corpus. We assume that the amout of loss betwee usg LM ad LM s related to Cohesveess. We use the relatve etropy betwee Bgram model ad Ugram to express the Cohesveess of term [3]. The defto of Cohesveess s as follow. CO( W ( LM D LM D log q( ( log log w ( p Let x, y s the probablty of bgram of, occurrg adjacet the corpus. Through the cohesveess, the adjacet characters are selected as term caddate. After above process, we ca get two character term caddates. The assumg these term caddates as a character, by re-computg of cohesveess, these two character terms ca be exted to mult-character term caddates. The lgustc POS rules are used to flter these term caddates, t s useful to reduce the umber of mprobable terms ad thus mprove precso. The POS rules such as adj + ou" or "ou + ou are used as support POS rules, the POS rules such as prepostoal + verb are used as elmato POS rules. There are some stop-words Chese setece, such as 虽然, 但是, especally the target doma, some geeral ou such as 北京 etc, s stop-words. Takg the POS rules ad stop-words to accout, the Cohesveess s defed as follow: DCO( Pstop Ppos CO( ( 56

Whle P stop s a pealty factor about stop-words, P pos s a pealty factor about lgustc POS rules. The term caddates are selected to ext step f ther doma cohesveess value surpasses a fxed threshold, ad kow term lst s used to determe the threshold. 3.. Relevacy Termologcal ad o-termologcal expresso (e.g. "last week" or "real tme" both have a property of hgh frequecy a corpus. The specfcty of a termologcal caddate th respect to the target doma s measured va comparatve aalyss across the target doma th balace corpus. Relevacy of term expresses the exclusve degree of term uderlyg doma. DR( log q( (3 Let be the probablty of strg w the target corpus ad q( be the probablty of strg w the balace corpus. We ca sort the term caddates by Relevacy descet order, through a threshold, relatve terms target doma ca be pck out. 3.3. Cosesus Terms are represetatve of cocepts whose meag are agreed upo large user commutes a uderlyg doma. We should take to accout ot oly the overall occurrece the target corpus but also ts appearace sgle documets. There are mportat terms th a hgh ad average frequecy th all documets uderlyg doma. Dstrbuted usage expresses a form of cosesus ted to the cosoldated sematcs of a term th the target doma []. Cosesus measures the dstrbuted use of a term a doma D. The dstrbuto of a term t documets d ca be take as a stochastc varable estmated throughout all d D. The etropy of ths dstrbuto expresses the cosesus of t D. The Cosesus s expressed as follows. m DC( w d log (4 w d freq(w d ad w d (5 d D freq(w d Let w d be codtoal probablty expresso of term w documet d, m be amout of documet the doma. Through Cosesus of terms, hgh qualty term ca be selected. 4. Archtecture of Term Extractor The archtecture of the doma term extractor s descrbed the fgure, the seteces of doma corpus are segmeted ad processed POS tagger, the the result s as put to doma cohesveess module, support of POS rules, stop-words ad kow term lst, the doma cohesveess s computg, by the step, term caddates are selected to ext step, support of balace corpus, doma relevacy ad doma cosesus s computed, the terms the doma s extracted the system. Semget POS Setece Balace Corpus Cosesus Corpus Cohesveess Relevacy Terms POS rules 5. Expermet Stop-words Kow Term Term Caddates Fg. : Archtecture of Term Extractor. Evaluato 57

I the expermet, the user teractve log of a web tellget questo aswerg system computer troubleshootg doma s as the doma corpus, t clude 3 thousad setece. Tacorp [] cludg 450 text fles s as the balace corpus, ICTCLAS[3] s adopted as segmet ad POS tools, the recall ad precso of the term extracto s cosder as evaluato crtera. For the purpose of evaluato, the "golde stadard" of doma terms s costructed maually. Typcal terms extracto method such as frequecy based C-Value, formato theory based mutual formato method s as basele system, the result of evaluato s lst table. The expermetal result show the method based o laguage feature the paper gets better result comparg th C-value method ad mutual formato method. Table. The evaluato of term Extracto methods Methods Extracted terms Terms Gold stadard Precso Recall Laguage Feature 54 67 35 66% 7% C Value 47 54 35 6% 66% Mutual Iformato 35 48 35 63% 63% 6. Cocluso ad Future Work The paper preseted a methods for doma extracto computer troubleshootg doma. The method s based o statstcal model of term laguage feature: doma cohesveess, doma relevacy ad doma cosesus. Future research ca focus o mprovg precso ad recall of the extractor by lgustc kowledge. 7. Refereces [] Mag C. D ad H. Schutze, Foudatos of Statstcal Natural Laguage Processg. Cambrdge,MA,USA: MIT Press, 999. [] Daga I ad K. W. Church, "Termght:Idetfyg ad traslatg techcal termology", Proceedgs of the fourth coferece o appled atural laguage processg, 994. [3] Justeso J. S ad S. M. Katz, "Techcal termology: some lgustc propertes ad a algorthm for detfcato text", Natural Laguage Egeerg,995, pp. 359-37. [4] Boguraev B ad C. Keedy, "Applcatos of Term Idetfcato Techology: Descrpto ad Cotet Characterzato", Natural Laguage Egeerg,999, pp. 7-44. [5] Patrck Patel ad Dekag L, "A Statstcal Corpus-Based Term Extractor", Proceedgs of 4th Beal Coferece of the Caada Socety o Computatoal Studes of Itellgece: Advaces Artfcal Itellgece, 00. [6] Katera T. Fratz, Sopha Aaadou, ad Ju-ch Tsuj, "The c-value/c-value method of automatcrecogto for mult-word terms", Proceedgs of the Secod Europea Coferece o Research ad Advaced Techology for Dgtal Lbrares, Lodo, UK, 998. [7] Paul Deae, "A Noparametrc Method for Extracto of Caddate Phrasal Terms", Proceedgs of The 43rd Aual Meetg of the Assocato for Computatoal Lgustcs (ACL 05, A Arbor, Mchga, 005. [8] Keeth W. Church ad Patrck Haks, "Word Assocato Norms, Mutual formato, ad Lexcography", Proceedgs of the 7th. Aual Meetg of the Assocato for Computatoal Lgustcs, Vacouver, B.C., 989. [9] Patrck Schoe ad Dael Jurafsky, "Is kowledge-free ducto of multword ut dctoary headwords a solved problem", Proceedgs of the Emprcal Methods Natural Laguage Processg, 00. [0] George Kgsley Zpf, Huma Behavor ad the Prcple of Least Effort: Addso-Wesley, 949. [] LIU Tao, LIU Bg-qua, XU Zh-mg, ad WANG Xao-log, "Automatc -Specfc Term Extracto ad Its Applcato Text Classfcato", ACTA ELECTRONICA SINICA,007, pp. 38-33. [] Sogbo Ta, A Novel Refemet Approach for Text Categorzato.: ACM CIKM, 005. [3] Huapg Zhag, "Chese Lexcal Aalyss Usg Herarchcal Hdde Markov Model", Proceedgs of Secod SIGHAN workshop afflated th 4th ACL, Sapporo Japa, 003. 58