Spam Message Classification Based on the Naïve Bayes Classification Algorithm

Similar documents
Barrier Options and a Reflection Principle of the Fractional Brownian Motion

AEROBIC SYSTEM (long moderate work)

Considering clustering measures: third ties, means, and triplets. Binh Phan BI Norwegian Business School. Kenth Engø-Monsen Telenor Group

Value-Growth Investment Strategy: Evidence Based on the Residual Income Valuation Model

Strategic Decision Making in Portfolio Management with Goal Programming Model

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The tennis serve technology based on the AHP evaluation of consistency check

2. JOMON WARE ROPE STYLES

Open Access Regression Analysis-based Chinese Olympic Games Competitive Sports Strength Evaluation Model Research

AHP-based tennis service technical evaluation consistency test

Time & Distance SAKSHI If an object travels the same distance (D) with two different speeds S 1 taking different times t 1

Automatic air-main charging and pressure control system for compressed air supplies

CS 188: Artificial Intelligence Spring Announcements

What the Puck? an exploration of Two-Dimensional collisions

Math Practice Use Clear Definitions

Interpreting Sinusoidal Functions

Using Rates of Change to Create a Graphical Model. LEARN ABOUT the Math. Create a speed versus time graph for Steve s walk to work.

Morningstar Investor Return

CMA DiRECtions for ADMinistRAtion GRADE 6. California Modified Assessment. test Examiner and Proctor Responsibilities

The t-test. What We Will Cover in This Section. A Research Situation

A Liability Tracking Portfolio for Pension Fund Management

Refining i\/lomentum Strategies by Conditioning on Prior Long-term Returns: Canadian Evidence

Semi-Fixed-Priority Scheduling: New Priority Assignment Policy for Practical Imprecise Computation

Urban public transport optimization by bus ways: a neural network-based methodology

AP Physics 1 Per. Unit 2 Homework. s av

Market Timing with GEYR in Emerging Stock Market: The Evidence from Stock Exchange of Thailand

1. The value of the digit 4 in the number 42,780 is 10 times the value of the digit 4 in which number?

A Probabilistic Approach to Worst Case Scenarios

Flow Switch LABO-VHZ-S

Lifecycle Funds. T. Rowe Price Target Retirement Fund. Lifecycle Asset Allocation

Brand Selection and its Matrix Structure -Expansion to the Second Order Lag-

Zelio Control Measurement Relays RM4L Liquid Level Relays

2017 / 2018 SCORPIONS SOCCER STYLE OF PLAY & TECHNICAL DEVELOPMENT

KINEMATICS IN ONE DIMENSION

Economics 487. Homework #4 Solution Key Portfolio Calculations and the Markowitz Algorithm

Name Class Date. Step 2: Rearrange the acceleration equation to solve for final speed. a v final v initial v. final v initial v.

KEY CONCEPTS AND PROCESS SKILLS. 1. An allele is one of the two or more forms of a gene present in a population. MATERIALS AND ADVANCE PREPARATION

Proportional Reasoning

Making Sense of Genetics Problems

Announcements. CS 188: Artificial Intelligence Spring Today. P4: Ghostbusters. Exact Inference in DBNs. Dynamic Bayes Nets (DBNs)

A Study on the Powering Performance of Multi-Axes Propulsion Ships with Wing Pods

ANALYSIS OF RELIABILITY, MAINTENANCE AND RISK BASED INSPECTION OF PRESSURE SAFETY VALVES

WELCOME! PURPOSE OF WORKSHOP

Reliability Design Technology for Power Semiconductor Modules

Minnesota s Wild Turkey Harvest Fall 2016, Spring 2017

Chapter / rev/min Ans. C / in. C mm Ans teeth Ans. C / mm Ans.

SPECIAL WIRE ROPES The Value Line

As time goes by - Using time series based decision tree induction to analyze the behaviour of opponent players

Exploring Impacts of Countdown Timers on Queue Discharge Characteristics of Through Movement at Signalized Intersections

INSTRUCTIONS FOR USE. This file can only be used to produce a handout master:

A Measurement Framework for National Key Performance Measures

Transit Priority Strategies for Multiple Routes Under Headway-Based Operations

Simulation of Scattering Acoustic Field in Rod and Identify of. Ultrasonic Flaw Detecting Signal

Bootstrapping Multilayer Neural Networks for Portfolio Construction

ARMENIA: Second Education Quality and Relevance Project (APL2) Procurement Plan. As of March 15, Measu rement Unit.

2017 MCM/ICM Merging Area Designing Model for A Highway Toll Plaza Summary Sheet

Capacity Utilization Metrics Revisited: Delay Weighting vs Demand Weighting. Mark Hansen Chieh-Yu Hsiao University of California, Berkeley 01/29/04

Machine Learning for Stock Selection

An Alternative Mathematical Model for Oxygen Transfer Evaluation in Clean Water

Hyper-Geometric Distribution Model to Estimate the Number of Residual Software Faults

Paul M. Sommers David U. Cha And Daniel P. Glatt. March 2010 MIDDLEBURY COLLEGE ECONOMICS DISCUSSION PAPER NO

3. The amount to which $1,000 will grow in 5 years at a 6 percent annual interest rate compounded annually is

Oath. The. Life-changing Impact TEACH HEAL DISCOVER. Going Into the Wild to Save Rhinos. Tracking Down Outbreaks page 2. Teaming Up for Nekot page 7

Bill Turnblad, Community Development Director City of Stillwater Leif Garnass, PE, PTOE, Senior Associate Joe DeVore, Traffic Engineer

Gas Source Localisation by Constructing Concentration Gridmaps with a Mobile Robot

Methods for Estimating Term Structure of Interest Rates

PRESSURE SENSOR TECHNICAL GUIDE INTRODUCTION FEATURES OF ELECTRIC PRESSURE SENSOR. Photoelectric. Sensor. Proximity Sensor. Inductive. Sensor.

Owner s Manual SSCI. Wheeling, IL (800) Form No /08

3.00 m. 8. At La Ronde, the free-fall ride called the Orbit" causes a 60.0 kg person to accelerate at a rate of 9.81 m/s 2 down.

The Discussion of this exercise covers the following points: The open-loop Ziegler-Nichols method. The open-loop Ziegler-Nichols method

Proceedings of the ASME 28th International Conference on Ocean, Offshore and Arctic Engineering OMAE2009 May 31 - June 5, 2009, Honolulu, Hawaii

Lesson 8: Application Technology

Stock Return Expectations in the Credit Market

The Measuring System for Estimation of Power of Wind Flow Generated by Train Movement and Its Experimental Testing

Performance Attribution for Equity Portfolios

8/31/11. the distance it travelled. The slope of the tangent to a curve in the position vs time graph for a particles motion gives:

2014 WHEAT PROTEIN RESPONSE TO NITROGEN

QUANTITATIVE FINANCE RESEARCH CENTRE. Optimal Time Series Momentum QUANTITATIVE FINANCE RESEARCH CENTRE QUANTITATIVE F INANCE RESEARCH CENTRE

Evaluating Portfolio Policies: A Duality Approach

Announcements. CS 188: Artificial Intelligence Spring Announcements II. P4: Ghostbusters 2.0. Today. Dynamic Bayes Nets (DBNs)

CS 188: Artificial Intelligence Spring Announcements

ANNUAL SPECIAL EVENTS SPONSORSHIP

Chapter : Linear Motion 1

Reproducing laboratory-scale rip currents on a barred beach by a Boussinesq wave model

Basic Systematic Experiments and New Type Child Unit of Anchor Climber: Swarm Type Wall Climbing Robot System

SURFACE PAVEMENT CHARACTERISTICS AND ACCIDENT RATE

EXAMINING THE FEASIBILITY OF PAIRED CLOSELY-SPACED PARALLEL APPROACHES

ERRATA for Guide for the Development of Bicycle Facilities, 4th Edition (GBF-4)

Avoiding Component Failure in Industrial Refrigeration Systems

MCW100A, B Time Proportional Rotary Position Controller

Monte Carlo simulation modelling of aircraft dispatch with known faults

Rolling ADF Tests: Detecting Rational Bubbles in Greater China Stock Markets

Instruction Manual. Rugged PCB type. 1 Terminal Block. 2 Function. 3 Series Operation and Parallel Operation. 4 Assembling and Installation Method

Working Paper: Reversal Patterns

Dynamics of market correlations: Taxonomy and portfolio analysis

Performance Optimization of Markov Models in Simulating Computer Networks

AGENDA REQUEST. September 7, 2010 Timothy Litchet

MVS. Electronic fan speed controller for DIN rail. Key features. Article codes Technical specifications. Area of use

Contents TRIGONOMETRIC METHODS PROBABILITY DISTRIBUTIONS

Development of Urban Public Transit Network Structure Integrating Multi-Class Public Transit Lines and Transfer Hubs

Straight Leg ged Walking of a Biped Robot

Transcription:

Spm Messge Clssificion Bsed on he Nïve Byes Clssificion Algorihm Bin Ning, Wu Junwei, Hu Feng Absrc A clssificion bsed on he nïve Byes lgorihm is proposed o clssify spm messges more effecively. Spm messge clssificion s bsed on he nïve Byes lgorihm re consruced boh for muli-clssificion nd muli-wo-clssificion hrough seps involving ex preprocessing bsed on regulr expression nd feure exrcion bsed on Jieb segmenion nd he TF-IDF (erm frequency inverse documen frequency) lgorihm. By furher compring he clssificion performnce gins he suppor vecor mchine nd rndom fores lgorihms, he nïve Byes lgorihm bsed on muli-wo-clssificion is shown o be he bes. Index Terms Nïve Byesin, spm messge, clssificion I. INTRODUCTION As convenien communicion mehod wih good mobiliy nd low cos, he shor messge service hs grdully ffeced more people s lives in he modern informion er. However, wih he incresing populriy of shor messge service, he problem of spm messges hs become incresingly more serious, which hs severely ffeced no only people s norml lives bu lso socil sbiliy nd public securiy []. Therefore, filering spm messge hs become n imporn sk h mus be solved urgenly, nd reserch on echnology for he inelligen clssificion of spm messges is of gre significnce. The echnology of filering spm messge currenly used generlly includes blck-nd-whie lis echnology [], he rules of mching [3] nd so on. When implemening blck-nd-whie lis echnology, he blck-nd-whie lis is minined by hird pry. This mehod is dynmiclly querying wheher cerin IP ddress is in he lis by wys of DNS. However, he mehod will be limied if dynmic or hidden IP is used by he oher side. The fundmenl principle Mnuscrip received December 5h, 07; revised November 9h, 08. This work is suppored by Gungzhou philosophy nd socil science 3h Five-Yer projec plnning (07GZYB3, 07GZYB98), Gungdong philosophy nd socil science h Five-Yer projec plnning (GD4YGL0), Gungdong supporing key discipline consrucion projec of philosophy nd socil science (GDXK076), he Nionl Nurl Science Fund of Chin (757053). Bin Ning is wih he school of Mngemen, Gungdong Universiy of Technology, Gungzhou, 5050, Gungdong, Chin (corresponding uhor:+8635805374; e-mil: bn_gdu@63.com). Wu Junwei is wih he school of Mngemen, Gungdong Universiy of Technology, Gungzhou, 5050, Gungdong, Chin (e-mil:7944634@qq.com). Hu Feng is wih he school of Mngemen, Gungdong Universiy of Technology, Gungzhou, 5050, Gungdong, Chin (e-mil: phoenin@63.com). of he rules of mching is o deermine wheher i is spm messge bsed on he comprison resul wih he presupposed rules. However, hese presupposed rules generlly re se siclly wihou credible knowledge lerning sregy, so hey hve poor filrion efficiency nd low filrion ccurcy in pplicion fields, wihou obvious rules [4]. As kind of conen-bsed filering echnology, he nïve Byes lgorihm is regrded highly for is simple nd esily undersood heoreicl rules, rpid clssificion speed nd high clssificion ccurcy; hus, i is widely pplied in ex filering [5]. A ype of messge clssificion bsed on Nïve Byes lgorihm is presened in his pper. Focusing on spm messge d from mobile operor, he firs clssifies he originl d se ino seven ypes of messge d; hen, he nlyses nd sudies he clssificion performnce of he nïve Byes lgorihm wih muli-clssificion nd muli-wo-clssificion fer ex preprocessing nd feure exrcion bsed on he TF-IDF (erm frequency inverse documen frequency) lgorihm; nd he furher sudies vrious liudinl feures wih vrious TF-IDF weighs. Moreover, he compres he clssificion efficiency of he muli-wo-clssificion nïve Byes lgorihm wih he suppor vecor mchine nd rndom fores lgorihms, nd he resuls show h he muli-wo-clssificion nïve Byes lgorihm hs he bes clssificion efficiency. II. NAÏVE BAYES ALGORITHM As kind of clssificion mehod bsed on Byes heorem wih he ssumpion of chrcerisic condiionl independence [6], he nïve Byes lgorihm is highly pplied mehod of Byes lerning. Is performnce cn be compred wih hose of he decision ree nd he neurl nework lgorihm in some specific pplicions, bu is compuion complexiy is much less hn h of oher lgorihms.. The clssificion principle nd process of he nïve Byes lgorihm The nïve Byes clssificion is defined s follows [7]: Providing x,,..., } is n iem needed o be { m clssified, nd ech is one of feure ribues of x; Clssificion se C { y, y,..., yn} is vilble; 3Clcule P y x), y x),..., yn ) ; ( x 4If yk x) mx{ y X ), y X ),..., yn X )}, x yk. The key now is how o clcule he condiionl probbiliy in sep 3. We cn perform he clculion s follows:

Find collecion se o be clssified wih known clssificion, which is clled he rining smple se. Obin he condiionl probbiliy of vrious chrcerisic ribues of vrious cegories, h is, y ), y ),..., y ); y ),... y ), n y ),..., y ),..., n m m m y ); y ) n () 3 If ech chrcerisic ribue is condiionlly independen, he following deducion cn be obined ccording o Byes heorem: x yi ) yi ) yi x) x) () As he denominor is consn for ll clssificions, only he numeror mus be mximized. Furhermore, s ech chrcerisic ribue is condiionlly independen, we hve he following: x y ) y ) i i m yi ) yi )... m yi ) yi ) j yi ) j (3). The pplicion of he nïve Byes lgorihm for ex clssificion As ype of clssificion mehod bsed on Byes heorem, he nïve Byes lgorihm cn be used in ex clssificion [8]. Providing he rining se is D d,..., d }, he { p { n clssificion se is C c,..., c }, he feure se is T {,..., }, he es ex is {(, w ),..., (, w )}, nd n de n n W { w,..., wn he weigh se in he es ex is }, he bsic principle of he nïve Byes lgorihm in ex clssificion is s follows [9]: For ech clssificion c k in he clssificion se C, he poserior probbiliy of es ex d e gins c k cn be clculed ccording o he formul s follows: P ( de ) p( wi ) (4) feure wi W In he rining se w i wih weigh D, providing he number of ex of c is D ( w, c ) k i k, nd he ex number of rin se is D, he formul (4) cn be clculed by mximum likelihood esimion: D ( wi, ck ) wi ) (5) D For ech clssificion c k in he clssificion se C, using he clculion resul of formul (4), he poserior probbiliy of es ex d cn be clculed by he Byes e formul: de ) ck ) ck de) (6) d ) e Providing he number of ex wih clssificion c k in he clssificion se D is D ( c ), nd he ex number k c k of rining se is D, ) in formul (6) cn be clculed by he mximum likelihood esimion: D ( ck ) ck ) (7) D 3Use he poserior probbiliies clculed by formul (6) in sep o form se bp bp,..., bp }. Providing he { n subscrip of he mximum vlue in bp is MxIndex, he clssificion of es ex d e is c MxIndex. In formul (7), P ( d e ) nd P ( c k ) re nonnegive, nd P ( d e ) is posiive vlue, so for he ske of convenience, we cn use he numeror of formul (8) s he sisicl prmeers in prcicl pplicions. III. SPAM MESSAGE DATA CLASSIFICATION FROM A MOBILE OPERATOR The nïve Byes lgorihm is doped in his pper o ddress he clssificion of spm messge d from mobile operor. These d hve been gged nd clssified mnully, which cn be divided ino wo ypes: one includes he repored d, nd he oher he rbirl d. The repored d re being repored o he operor s spm messges by users, which hve bsic form of repored mobile phone number + messge conen + clssificion lbel. The clssificion gs conined in he repored d re SP decoy informion, commercil dverisemens, prosiuion informion, gmbling informion, propgnd informion from he mobile compny, mfi informion, frud informion nd recionry informion. The rbirl d refer o hose judged s spm messges by he operor ccording o curren spm messge filrion echnology, which hs bsic form of messge conen + clssificion g. The clssificion g includes cusomized SP decoy informion, commercil dverisemen informion, prosiuion informion, poliicl informion, nd crime informion such s h regrding gmbling, he mfi, nd frud. Their d forms re shown in Tble I nd II (he d comes from mobile communicion corporion). TABLE I THE REPORTED MESSAGE DATA FORM Fieldnme Field ribue Field descripion Mobile phone number repored Sring Type of ex chrcer Messge conen Sring Type of big ex chrcer Clssificion g Sring Spm messge clssificion g TABLE II THE ARBITRAL MESSAGE DATA FORM Fieldnme Field ribue Field descripion Messge conen Sring Type of ex chrcer Clssificion g Sring Spm messge clssificion g As hese wo ypes of d re boh spm messges wih similr corresponding clssificion gs, we cn combine heir gs for he convenience of ler clssificion. The combinion process is shown in Tble III:

TABLE III SPAM MESSAGE DATA INTRODUCTION Combined g Repored d g Arbirl d g Poliicl informion Recionry informion Poliicl informion Commercil informion Commercil dverisemen, propgnd informion from he mobile compny Commercil dverisemen informion Prosiuion informion Prosiuion informion Prosiuion informion Mfi informion SP decoy informion, gmbling informion, mfi Cusomized SP decoy informion, crime informion informion such s gmbling nd mfi Frud informion Frud informion Crime-frud Oher spm messges Oher spm messges Oher informion Oher informion Oher informion from he repored d cn be used s norml messges for ess, while oher spm messges from he repored d cn be used s spm messges for ess. Afer combining hose wo ypes of spm messges, we will perform he process of duplice removl. The duplice removl process should be performed since he spm messge ofen includes considerble repeed informion sen o oher users from mny frudulen users, which is useless for our clssificion. The comprison of he quniy of messge d fer duplice removl is shown in Tble IV. TABLE IV INTRODUCTION OF THE DATA AFTER COMBINING THE SPAM MESSAGE CLASSIFICATION TAGS Combined g Tol Quniy fer Percenge duplice removl (%) Poliicl informion 538 784 4.57 Commercil informion 374844 96778 3.05 Prosiuion informion 3975 9807 4.90 Mfi informion 538 0935 7. Frud informion 507890 839 5.58 Oher spm messges 3488 3576 0.46 Oher informion 69686 676 39.6 Tol 5538787 3583 5.88 messge when i is enered i ino he clssificion in he very beginning, idenifying he suspicious messge nd hen furher improving he deecion efficiency. Therefore, ech sge of preprocessing should be crried ou precisely. The following clening process is finlly obined fer coninuous djusmen hrough experimens (Fig. ): Messge smple d. Full-widh nd hlf-widh conversion. Generlized chrcer conversion 3.Simplified nd rdiionl ex conversion 4. Uppercse nd lowercse conversion Mch ime nd delee Preremen compleed.mch emil, replce nd sve i.mch websie, replce nd sve i 3. Mch bnk crd No., replce nd sve i 4.Mch mobile phone No., replce nd sve i 5.Mch elephone number wih fixed form, replce nd sve i 6.Mch elephone number wihou fixed form, replce nd sve i Delee he repored number. Converse specil chrcer nd rdicls,. Remove excessive numbers nd English leers Remove sepror, only keep Chinese ex nd English leers nd numbers. Mch he bnk crd No. nd mobile phone No. gin, replce nd sve hem.. Mch QQ No., replce nd sve i.. Mch WeCh No., replce nd sve i. Mch he elephone No. wihou fixed form gin, replce nd sve i. The number of messges fer duplice removl is grely reduced nd is only 6% of he originl d number. Regrding he number of messges, here is lrge moun of informion wih he commercil nd mfi gs bu much less informion wih he poliicl nd oher spm messges gs. Regrding he duplice removl percenge, he informion for he commercil, mfi nd frud informion gs conins mny repeed messges, nd hese gs hve lrger percenge of removed duplices, while he moun of repeed messges wih he poliicl, prosiuion nd oher informion gs is relively less. I cn be seen h he spm messges re minly gged s commercil nd mfi informion, which generlly conin gre moun of repeed messges. For he operion efficiency nd ccurcy of clssificion, we should furher crry ou ex preprocessing on he spm messges o obin he idel ex d form for he clssificion. 3. Tex Preprocessing Tex preprocessing is necessry nd essenil process. During preprocessing, we should no only clen he d bu lso exrc nd sve some imporn feures [0], such s mobile phone number, elephone number, URL, bnk crd number, WeCh, nd QQ. These elecronic ddresses ply n imporn role, screening he elecronic ddress of he Fig. The process srucure of ex preprocessing 3. Tex Feure Exrcion Clen Chinese messge exs suible for nlysis re obined fer he bove ex preprocessing. We hen will exrc he ex feure. As he d re Chinese ex insed of ordinry digil d, i is necessry o perform word segmenion firs o remove he sopped words nd finlly del wih hem in compuble form []. Jieb word segmenion, widely used in deling wih Chinese ex nd pyhon progrmming, is pplied o ccomplish he word segmenion process. The feure exrcion is chieved by doping TF-IDF lgorihm, which is generlly used for ex, s well s pyhon progrmming. The deiled ex feure exrcion process is shown in Fig. : Preprocessed messge ex d Jieb word segmenion Remove sopped words Fig. The process srucure of ex feure exrcion TFIDF feure exrcion. Jieb word segmenion Jieb word segmenion is n effecive lgorihm for word segmenion of Chinese ex wih high ccurcy nd fs speed, which is quie pproprie for ex nlysis []. Jieb word segmenion involves some lgorihms s follows [3]:

Achieve efficien word nd figure scnning bsed on he Trie ree srucure, nd genere direced cyclic grph (DAG) consiued by ll possible word formions of Chinese chrcers in senences; Adop dynmic progrmming o find he mximum probbiliy ph nd mximum segmenion combinion bsed on word frequency; 3Adop n HMM mode for unregisered words bsed on Chinese word formion nd he Vierbi lgorihm.. TF-IDF clculion seps [4] Clcule word frequency Word frequency = Tol imes cerin word ppers in he pper Clcule he inverse documen frequency Inverse documen frequency (IDF) = log (ol number of documens in he corpus/number of documens including cerin word + ) (The denominor is 0, plus o he denominor) 3Clcule TF-IDF vlue TF-IDF vlue= TF * IDF (TF-IDF vlue is in proporion o he ppernce frequency of cerin word nd is in inverse proporion o he imes of such word ppering in he whole corpus, which ccords wih he previous nlysis.) 4Find ou he key words Afer clculing he TF-IDF vlues of ech word in he pper, sor hem nd selec severl words wih he highes vlue s he keywords. 3. The sisicl nlysis of feures wih differen weighs Vrious ypes of feure words hve vrious chrcerisics, represenive nd disincive from oher ypes, which sisfies he resul of he TF-IDF lgorihm nd is suible for ex nlysis [5]. However, becuse here is oo much ex conen in ech clssificion, he clculion weigh vlues re generlly smll. Nex, we will mke sisicl nlysis of feures wih differen weighs. TABLE V TOTAL NUMBER OF FEATURES EXTRACTED BY THE TF-IDF ALGORITHM WITH DIFFERENT WEIGHT 0.0 0.0 0.03 0.05 Commercil 566 07 09 46 Oher messge 8 8 50 3 Oher spm messge 5 70 3 4 Mfi 88 07 45 9 Prosiuion 357 8 04 43 Frud 347 93 5 64 Poliicl 35 30 76 43 Tble V shows h if he hreshold vlue of he weigh is reduced by one percenge poin, he ol number of feures will decline rpidly. When he weigh is more hn 0.05, he number of feures of ech clssificion is less hn 00. When he weigh is more hn 0.0, he number of feures of ech clssificion is more hn 00. Clerly, our clssificion problem is muliple clssificion problem, nd ech clssificion hs is own obvious feure. Therefore, we prefer o esblish muliple wo-clssificion s for such problems. Muliple wo-clssificion s no only hve obvious feures bu cn lso opimize he performnce. One hundred dimensionl feures cn be exrced for ech clssificion, nd he compuion dimension cn be simplified grely, which is n opimizion for our clssificion process. 3.3 Spm Messge Clssificion Model Tex clssificion for spm messges cn be performed fer series of ex preprocessing nd feure exrcion seps. 3.3. Muli-clssificion Nïve Byes Spm Messge Clssificion Model. Model frmework The nïve Byes lgorihm hs good performnce regrding he problems of clssificion nd high liude. Focusing on he problem of he muliple clssificion of spm messges, muli-clssificion nïve Byes spm messge is presened in his pper, which is shown in Fig. 3: 30% es d se Jieb word segmenion, dd he g of ech clssificion Feure exrcion, documen vecor represenion Tex d of ech clssificion fer preprocessing Clssificion Clssificion resul 70% rining d se Jieb word segmenion, dd he g of ech clssificion Feure exrcion, documen vecor represenion Trining clssifier Fig. 3 Muli-clssificion nïve Byes spm messge clssificion. Relizion process The muli-clssificion nïve Byes focuses on muli-clssificion problems. The d se hs seven clssificion gs; 30% of he d se is divided ino he es d se, nd he oher 70% is he rining d se. Ech messge will be ddressed wih Jieb word segmenion nd feure exrcion echnology nd will finlly be processed ino documen vecor form for rining nd esing. Pyhon progrmming echnology is doped o relize our. The pseudocodes of he re given s follows:

es_d = [] // Tes d se rin_d = [] // Trining d se fori = :N // Number of spm messge clssificion fobj = file.open(file(i)) // Red ex preprocessed d of ech clssificion while True: rw = fobj.redline() // Red ech messge if rw: word_cu = jieb.cu(rw) // Jieb word segmenion ifes_d.lengh>0.3*fobj.lengh: //Judgmen, es se 30%, Trining se 70% rin_d.ppend(word_cu,i) //Add d for Trining se, word segmenion +clssificion es_d.ppend(word_cu,i) //Add d for es se, word segmenion +clssificion brek word_feures = ge_feures() // Red TF-IDF feure vlue es_d = documen_feures(es_d) // Tes se documen vecor process rin_d = documen_feures(rin_d) // Trining se documen vecor process clssify = NïveByesClssifier.rin(rin_d) // Crry ou clssificion rining for Trining d se clssify.es(es_d) // Crry ou es inspecion for clssificion 3. Experimen resul nd nlysis Afer he seven clssificions re combined ogeher, he clssificion effecs of he nïve Byes lgorihm wih differen TF-IDF weighs nd differen feure dimensions is used, s shown in Tble VI. TABLE VI EXPERIMENT RESULT OF MULTI-CLASSIFICATION NAÏVE BAYES ALGORITHM MODEL Weigh Feure Accurcy dimension (%) Time consumed (s) 0.05 54 63.7 3.98 0.03 533 69.74 9.89 0.0 963 73.80 5.97 0.0 98 78.78 9.6 In he muli-clssificion lgorihm shown in Tble 6, he proporionl smpling mehod is used, nd 35,90 d re smpled for lgorihm evluion. The ime consumed is he rining ime for he lgorihm, nd he ccurcy refers o he precision of he clculions in he es se. From he bove d in Tble 6, he ccurcy ends o increse s he weigh decreses, i.e., he feure dimension increses. However, he ime consumed doubles, nd he compuionl complexiy increses grely. Among he resuls, he feure ccurcy 54 dimensions wih weigh less hn 0.05 is only 63%, which is fr from he clssificion resul we expec. Alhough he feure ccurcy 98 dimensions wih weigh less hn 0.0 increses o 78%, he ime consumed is nerly hlf minue. The reson for he lower ccurcy is h feures of more hn 000 dimensions cnno sepre ino hese seven clssificions. The more ypes of clssificion here re, he more inerference nd noise ech clssificion will receive. Therefore, he efficiency of he muli-clssificion nïve Byes wih he seven-clssificion combinion is fr from sisfcory. 3.3. Muli-wo-clssificion Nïve Byes Spm Messge Clssificion Model. Model frmework From he reserch on ex feure exrcions, we know h relively obvious feures cn be obined by clculing TF-IDF feure exrcion. Vrious ypes of spm messges hve vrious feure words represening heir chrcerisics. Moreover, differen weigh vlues will led o differen ol numbers of feures. Obviously, he problem of spm messge clssificion is muli-clssificion problem involving seven ypes of d. Therefore, i is necessry o consider hese seven ypes of feure words ogeher s clssificion feure. As resul, he feure dimensions will become drmiclly lrge, which leds o slow compuionl operion nd low clssificion ccurcy. According o he experimen resuls of he muli-clssificion nïve Byes in he bove secion, he cn be improved by dividing he muli-clssificion problem ino muli-wo-clssificion problem, which no only reduces he compuionl complexiy bu lso improves he clssificion ccurcy. A muli-wo-clssificion nïve Byes spm messge clssificion is consruced in his pper, which requires seven nïve Byes wo-clssificion s. The deiled process of ech is shown in Fig. 4: Trge d se Divide d se, 30% es d se nd 70% rining d se Jieb word segmenion, he clssificion g is 0 Feure exrcion, documen vecor represenion Tex d of ech clssificion preprocessed Trining d se, d clssifier Clssificion Tes se, es clssificion Oher ypes of d se Divide d se, 30% es d se nd 70% rining d se Jieb word segmenion, he clssificion g is Feure exrcion, documen vecor represenion Fig. 4 Nïve Byes wo-clssificion

. Relizion process To consruc muli-wo-clssificion nïve Byes clssificion, seven nïve Byes wo-clssificion s for he seven ypes of d should be buil firs. The clssificion g for one ype of d is 0, nd he clssificion g for oher ypes of d is. The feure vlue of he ype of d wih g 0 is used for feure exrcion, which helps reduce he feure dimension of ech clssificion nd improve he clssificion of his. Ech nïve Byes should go hrough processes such s d se division, Jieb word segmenion, feure exrcion nd documen vecor represenion, nd he s should finlly undergo rining nd esing. The following pseudocodes indice he relizion process of he nïve Byes wo-clssificion. es_d = [] // Tes d se rin_d = [] // Trining d se fori = :N // Number of spm messge clssificion fobj = file.open(file(i)) // Red d of ech clssificion preprocessed if fobj.nme = 'xxx' // If i is he rge clss, he clssificion g is 0 while True: rw = fobj.redline() // Red ech messge if rw: word_cu = jieb.cu(rw) // Jieb word segmenion ifes_d.lengh>0.3*fobj.lengh: //Judge, es se 30%, Trining se 70% rin_d.ppend(word_cu,0) // Add d o Trin se, word segmenion + ype es_d.ppend(word_cu,0) // Add d o es se, word segmenion + ype brek // Oher ypes of d, he clssificion g is while True: rw = fobj.redline() // Red ech messge if rw: word_cu = jieb.cu(rw) // Jieb word segmenion ifes_d.lengh>0.3*fobj.lengh: //Judge, es se 30%, Trining se 70% rin_d.ppend(word_cu,) // Add d o Trin se, word segmenion + ype es_d.ppend(word_cu,) // Add d o es se, word segmenion + ype brek word_feures = ge_feures('xxx') // Red cerin ype of TF-IDF feure vlue es_d = documen_feures(es_d) // Tes se documen vecor remen rin_d = documen_feures(rin_d) // Trining se documen vecor remen clssify = NïveByesClssifier.rin(rin_d) // Clssify he rining d se clssify.es(es_d) // Tes he clssificion 3. Experimen resuls nd nlysis The experimen resuls of he muli-wo-clssificion nïve Byes re shown in Tble VII: TABLE VII EXPERIMENTAL RESULTS OF THE MULTI-TWO-CLASSIFICATION NAÏVE BAYES CLASSIFICATION MODELS Clssificion Commercil Oher messge Oher spm messge Mfi Prosiuion Frud Poliicl TF-IDF weigh Feure dimension Accurcy (%) Time consumed (s) 0.05 46 76.57 0.86 0.03 09 80.43.07 0.0 07 83.68 3.94 0.0 566 88.5.6 0.05 3 86.04 0.6 0.03 50 85.8 0.93 0.0 8 86.3.59 0.0 8 87.09 3.44 0.05 4 98.0 0.7 0.03 3 98.3 0.60 0.0 70 97.86.34 0.0 5 97.87.96 0.05 9 73.7 0.36 0.03 45 76.79 0.88 0.0 07 80.36.95 0.0 88 84.63 5.56 0.05 43 97.4 0.84 0.03 04 97.58.98 0.0 8 97.73 3.5 0.0 357 97.88 6.99 0.05 64 9.79. 0.03 5 93.86.44 0.0 93 93.85 3.7 0.0 347 94.44 7.0 0.05 43 99.94 0.86 0.03 76 99.94.50 0.0 30 99.93.50 0.0 35 99.94 6.4 The bove muli-clss nd proporionl smpled 35,90 d poins is doped for he muli-wo-clssificion nïve Byes experimen. The experimen resuls show h he clssificion ccurcy ends o increse s he weigh decreses, i.e., he feure dimension increses. Alhough he rising rend is no significn, he ccurcy obined is quie sisfcory, nd he ime consumed by he lgorihm rining is lso close o our idels. Among hese seven clssificions, commercil, oher messge nd mfi clssificion hve lower ccurcy, which is lower hn 90% bu generlly higher hn 80%. The oher four clssificions, i.e., he prosiuion, frud, poliicl nd oher spm messge gs, hve higher ccurcy, which is more hn 90%. The ccurcy of poliicl clssificion is nerly 00%, which hs excellen efficiency for clssificion. Thus, i cn be seen h he clssificion performnce of muli-wo-clssificion is fr higher hn h of muli-clssificion. Regrding he ime consumpion of he lgorihm, if he feure dimension is low, he lgorihm is he fses when he weigh is 0.05, bu he overll ccurcy re is low. If he feure dimension is higher, he ccurcy re

will become higher when he weigh is 0.0, bu he ime consumed will increse by lmos 3 s or even more hn 5 s. On he whole, when he weigh is below 0.0, he comprehensive clssificion performnce is he bes. Menwhile, he feure dimension of ech clssificion h weigh remins pproximely 00, nd he ime consumed is pproximely s. Therefore, he muli-wo-clssificion nïve Byes lgorihm hs he bes efficiency for hese seven clssificions of messges when he feure exrcion TF-IDF lgorihm uses hreshold of 0.0. 4. Clssificion scheme The bove is he experimenl reserch of ech clssificion in he wo-clssificion. Seven clssificion resuls will be obined for every unknown messge. In mos cses, unique clssificion g will be obined hrough sisicl nlysis of he condiionl probbiliy of ech clssificion. However, when he condiionl probbiliy reches mximum, wo or more clssificion gs will pper, nd he messge hen will be clssified mnully. The following digrm is he finl process of he clssificion (Fig. 5). Unknown messge Tex preprocessing Feure exrcion Muli-woclssificion Nive Byes When clssificion probbiliy is mximum, is here unique clssificion g? N Mnully judge he clssificion g Y Clssify s he g of his clssificion Fig. 5. The clssifying soluion of he muli-wo-clssificion 3.4 Performnce Comprison of Differen Clssificion Algorihms wih Muli-wo-clssificions In his secion, he nïve Byes lgorihm is compred wih wo oher clssificion lgorihms h re used ofen; one is he suppor vecor mchine, nd he oher is he rndom fores lgorihm. The suppor vecor mchine: The bsic concep of he suppor vecor mchine is o chieve compromise beween he ccurcy (for given rining se) nd mchine cpciy (he cpbiliy of mchine o lern ny rining se wihou miske) for specified lerning sk wih limied rining smples o hus obin he bes promoion performnce [6]. The rndom fores: Considering he decision ree s he bsic clssifier, he rndom fores lgorihm dops smpling mehod wihou replcemen used for he bgging lgorihm for rining se smpling nd only exrcs pr of he feures for rining using he referring rndom subspce mehod. Finlly, he clssificion resul will be deermined by he voe of he rined decision ree [9]. The clssificion efficiency comprison beween he nïve Byes lgorihm, he suppor vecor mchine, nd he rndom fores lgorihm is shown in Tble 8. The proporionl smpling of 35,90 d poins is doped by king he feure vlue of TF-IDF 0.0 s he feure dimension. TABLE VIII EXPERIMENTAL RESULTS OF DIFFERENT CLASSIFICATION ALGORITHMS WITH MULTI-TWO-CLASSIFICATION Clssific ion Accur cy (%) Nïve Byes Time consume d(s) Suppor vecor mchine Accur Time cy consu (%) med (s) Rndom fores (0) Accur Time cy consu (%) med (s) Commerc il 83.68 3.94 8.5 9.69 77.97 7.57 Oher messge 86.3.59 86.76 0.07 8.6.86 Oher spm 97.86.34 99.6.3 99.8.89 messge Mfi 80.36.95 79.30 5.39 74.63 3.44 Prosiui on 97.73 3.5 97.3 6.5 97.58 4.86 Frud 93.85 3.7 93.36 9.97 93.45 5.75 Poliicl 99.93.50 99.86.95 99.94 3.0 The experimen resuls in Tble VIII show h he ccurcies of he nïve Byes nd suppor vecor mchine lgorihms re higher nd more similr o ech oher hn he ccurcies of he rndom fores lgorihm, which re slighly lower for some clssificions. In generl, he nïve Byes clssificion hs high nd sble ccurcy vlues. For he ime consumed of he hree lgorihm s, i is cler h he suppor vecor mchine nd rndom fores re no s fs s he nïve Byes, nd heir performnce is unsble. On he whole, fer compring differen lgorihms, he nïve Byes hs he bes clssificion performnce, which is no only fesible bu lso chieves he idel resul. IV. CONCLUSIONS This pper pus forwrd spm messge clssificion bsed on he nïve Byes lgorihm nd esimes he performnce of he nïve Byes lgorihm bsed on muli-clssificion nd muli-wo-clssificion in spm messge clssificion using spm messge ex preprocessing bsed on Jv regulr expression nd spm messge feure exrcion bsed on Jieb word segmenion nd he TF-IDF lgorihm. This pper furher compres he clssificion performnce of he nïve Byes, suppor vecor mchine, nd rndom fores lgorihms wih muli-wo-clssificion. The experimen resuls show h he muli-wo-clssificion nïve Byes lgorihm hs he bes efficiency of he hree s. However, he d used in he process of spm messge clssificion re only pr of he smpled d. Wih he developmen of big d echnology, how o perform feure exrcion nd ex clssificion for bulk d on he bsis of big d compuion will be our fuure reserch direcion. REFERENCES [] C. Hung, "Reserch on SMS filering echnology on inelligen mobile phone," M.S. hesis, Huzhong Universiy of Science nd Technology, Wuhn, Chin, 0. [] J. M, Y. Zhng nd Z. Wng, "A messge opic for muli-grin SMS spm filering," Inernionl Journl of Technology nd Humn Inercion (IJTHI), vol., no., pp. 83-95, 06. [3] S. J. Delny, M. Buckley nd D. Greene, "Review: SMS spm filering: Mehods nd d," Exper Sysems wih Applicions, vol. 39, no. 0, pp. 9899-9908, 0.

[4] J. Fdez-Glez, D. Runo-Ords nd J. R. Méndez, "A dynmic for inegring simple web spm clssificion echniques," Exper Sysems wih Applicions, vol. 4, no., pp. 7969-7978, 05. [5] A. Hrisinghney, A. Dixi nd S. Gup, "Tex nd imge bsed spm emil clssificion using KNN, Nïve Byes nd Reverse DBSCAN lgorihm," in Opimizion, Relibily, nd Informion Technology (ICROIT), 04 Inernionl Conference on. IEEE, 04, pp. 53-55. [6] A. Gelmn, J. B. Crlin nd H. S. Sern, Byesin d nlysis. Boc Ron, FL: CRC press, 04. [7] S. Ajz, M. Nfis nd V. Shrm, "Spm Mil Deecion Using Hybrid Secure Hsh Bsed Nïve Clssifier," Inernionl Journl of Advnced Reserch in Compuer Science, vol. 8, no. 5, 07. [8] G. Feng, B. An nd F. Yng, "Relevnce populriy: A erm even bsed feure selecion scheme for ex clssificion," Plos One, vol., no. 4, pp. e07434, 07. [9] D. M. Dib, K. M. El Hindi, "Using differenil evoluion for fine uning nïve Byesin clssifiers nd is pplicion for ex clssificion," Applied Sof Compuing, vol. 54, pp. 83-99, 07. [0] P. Chndrsekr, K. Qin, "The Impc of D Preprocessing on he Performnce of Nïve Byes Clssifier," in Compuer Sofwre nd Applicions Conference (COMPSAC), 06 IEEE 40h Annul. IEEE, 06, pp. 68-69. [] X. Y. Wng, "A sudy on he Chinese ex summrizion mehod bsed on concep lice," M.S. hesis, Beijing Insiue of Technology, Beijing, Chin, 05. [] L. Liu, Y. Lu nd Y, Luo, "Deecing "Smr" Spmmers On Socil Nework: A Topic Model Approch," rxiv preprin rxiv, pp.604.08504, 06. [3] D. Ye, P. Hung nd K, Hong, "Chinese Microblogs Senimen Clssificion using Mximum Enropy," Proceedings of he Eighh SIGHAN Workshop on Chinese Lnguge Processing, pp. 7-79, 05. [4] M. H. Arif, J. Li nd Iqbl. M, "Senimen nlysis nd spm deecion in shor informl ex using lerning clssifier sysems;" Sof Compuing, pp. -, 07. [5] J. K. Yi, L. K. Tin. A ex feure selecion lgorihm bsed on clss discriminion;" Journl of Beijing Universiy of Chemicl Technology (Nurl Science), vol. 40, pp. 7-75, 03. [6] M. Dile, W. C. Vn Der nd T. Celik, "Feure selecion nd suppor vecor mchine hyper-prmeer opimision for spm deecion;" in Pern Recogniion Associion of Souh Afric nd Roboics nd Mechronics Inernionl Conference (PRASA-RobMech) 06. IEEE, 06, pp. -7, 06.