Complexity of Data Tree Patterns over XML Documents

Similar documents
Noncrossing Trees and Noncrossing Graphs

Depth-first search and strong connectivity in Coq

Multi-Robot Flooding Algorithm for the Exploration of Unknown Indoor Environments

arxiv:cs/ v1 [cs.ds] 8 Dec 1998

Multi-Robot Forest Coverage

Red-Black Trees Goodrich, Tamassia Red-Black Trees 1

Complexity of Data Tree Patterns over XML Documents

ABriefIntroductiontotheBasicsof Game Theory

10 Torque. Lab. What You Need To Know: Physics 211 Lab

Rearranging trees for robust consensus

A Force Platform Free Gait Analysis

Efficient Algorithms for finding a Trunk on a Tree Network and its Applications

The Solution to the Bühlmann - Straub Model in the case of a Homogeneous Credibility Estimators

EC-FRM: An Erasure Coding Framework to Speed up Reads for Erasure Coded Cloud Storage Systems

The Properties of. Model Rocket Body Tube Transitions

OPTIMAL SCHEDULING MODELS FOR FERRY COMPANIES UNDER ALLIANCES

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Cheat-Proof Playout for Centralized and Distributed Online Games

British Prime Minister Benjamin Disraeli once remarked that

tr0 TREES Hanan Samet

Faster Nearest Neighbors: Voronoi Diagrams and k-d Trees

Design Engineering Challenge: The Big Dig Contest Platter Strategies: Ball Liberation

Wind and extremely long bridges a challenge for computer aided design

CORESTA RECOMMENDED METHOD N 68

The Study About Stopping Distance of Vehicles

Lesson 33: Horizontal & Vertical Circular Problems

An Auction Based Mechanism for On-Demand Transport Services

EcoMobility World Festival 2013 Suwon: an analysis of changes in citizens awareness and satisfaction

Data Sheet. Linear bearings

A CONCEPTUAL WHEELED ROBOT FOR IN-PIPE INSPECTION Ioan Doroftei, Mihaita Horodinca, Emmanuel Mignon

Multiple Vehicle Driving Control for Traffic Flow Efficiency

POSSIBLE AND REAL POWERFLOWS IN CONNECTED DIFFERENTIAL GEAR DRIVES WITH η 0 <i pq <1/η 0 INNER RATIO

tr0 TREES Hanan Samet

Range Extension Control System for Electric Vehicles Based on Front and Rear Driving Force Distribution Considering Load Transfer

Color Encodings: srgb and Beyond

UNIVERSITÀ DEGLI STUDI DI PADOVA. Dipartimento di Scienze Economiche Marco Fanno

CS3350B Computer Architecture. Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions

MODELLING THE INTERACTION EFFECTS OF THE HIGH-SPEED TRAIN TRACK BRIDGE SYSTEM USING ADINA

Morrison Drive tel. Ottawa, ON, Canada K2H 8S fax. com

A Deceleration Control Method of Automobile for Collision Avoidance based on Driver's Perceptual Risk

A Collision Risk-Based Ship Domain Method Approach to Model the Virtual Force Field

Carnegie Mellon University Forbes Ave., Pittsburgh, PA command as a point on the road and pans the camera in

Alternate stable states in coupled fishery-aquaculture systems. Melissa Orobko

THE IMPACTS OF CONGESTION ON COMMERCIAL VEHICLE TOUR CHARACTERISTICS AND COSTS

Incorporating Location, Routing and Inventory Decisions in Dual Sales Channel - A Hybrid Genetic Approach

DETC A NEW MODEL FOR WIND FARM LAYOUT OPTIMIZATION WITH LANDOWNER DECISIONS

Experiment #10 Bio-Physics Pre-lab Comments, Thoughts and Suggestions

Rotor Design and Analysis of Stall-regulated Horizontal Axis Wind Turbine

Phase Behavior Introduction to Phase Behavior F.E. Londono M.S. Thesis (2001)

PREDICTION OF THIRD PARTY DAMAGE FAILURE FREQUENCY FOR PIPELINES TRANSPORTING MIXTURES OF NATURAL GAS AND HYDROGEN Zhang, L. 1, Adey, R.A.

Multiple Adjunction in Feature-Based Tree-Adjoining Grammar

SHRiMP: Accurate Mapping of Short Color-space Reads

Interior Rule of the Quebec Open 2017

An integrated supply chain design model with random disruptions consideration

Trends in Cycling, Walking & Injury Rates in New Zealand

Cyclostrophic Balance in Surface Gravity Waves: Essay on Coriolis Effects

Design and Simulation Model for Compensated and Optimized T-junctions in Microstrip Line

College Hill Carthage. Winton Hills. Mill Creek. Spring Grove Village Paddock Hills. North Avondale. Avondale. Evanston. CUF Walnut Hills.

Numerical study of super-critical carbon dioxide flow in steppedstaggered

Session 6. Global Imbalances. Growth. Macroeconomics in the Global Economy. Saving and Investment: The World Economy

I. FORMULATION. Here, p i is the pressure in the bubble, assumed spatially uniform,

Bicycle and Pedestrian Master Plan

A Machine Vision based Gestural Interface for People with Upper Extremity Physical Impairments

f i r e - p a r t s. c o m

Experimental and Numerical Studies on Fire Whirls

Experiment #10 Bio-Physics Pre-lab Questions

Bubble clustering and trapping in large vortices. Part 1: Triggered bubbly jets investigated by phase-averaging

Overlap-Based Genome Assembly from Variable-Length Reads

PREDICTION OF ELECTRICAL PRODUCTION FROM WIND ENERGY IN THE MOROCCAN SOUTH

Tree. Tree. Siblings Grand Parent Grand Child H I J I J K. Tree Definitions

A Three-Axis Magnetic Sensor Array System for Permanent Magnet Tracking*

Electrical Equipment of Machine Tools

Fault tolerant oxygen control of a diesel engine air system

Deception in Honeynets: A Game-Theoretic Analysis

PlacesForBikes City Ratings Methodology. Overall City Rating

READING AREA TRANSPORTATION STUDY BICYCLE AND PEDESTRIAN TRANSPORTATION PLAN ADOPTED NOVEMBER 18, 2010

Matlab Simulink Implementation of Switched Reluctance Motor with Direct Torque Control Technique

Performance Characteristics of Parabolic Trough Solar Collector System for Hot Water Generation

High Axial Load Capacity, High speed, Long life. Spherical Roller Bearings

THE performance disparity between processor speed and the

Lecture Topics. Overview ECE 486/586. Computer Architecture. Lecture # 9. Processor Organization. Basic Processor Hardware Pipelining

Operating Instructions Compressors

Finite Element Analysis of Active Isolation of Deep Foundation in Clayey Soil by Rectangular Trenches

Asteroid body-fixed hovering using nonideal solar sails

Fire-Safety Analysis Timber. FRILO Software GmbH As of 29/06/2016

A Study on Brushless DC Motor for High Torque Density

Angle-restricted Steiner arborescences for flow map layout Buchin, K.A.; Speckmann, B.; Verbeek, K.A.B.

"A Home for Gracious Living" (real estate brochure for Batten House)

Follow this and additional works at:

Example. The information set is represented by the dashed line.

0ur Ref:CL/Mech/ Cal /BID-01(11-12) Date: 29 July 2011

Torque. Physics 2. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

FALL PROTECTION PROGRAM

Project Proposal: Characterization of Tree Crown Attributes with High Resolution Fixed-Base Aerial Photography. by Rich Grotefendt and Rob Harrison

MODEL 1000S DIGITAL TANK GAUGE

DECO THEORY - BUBBLE MODELS

Lecture 24. Wind Lidar (6) Direct Motion Detection Lidar

STUDY OF IRREGULAR WAVE-CURRENT-MUD INTERACTION

Theoretical and Experimental Study of Gas Bubbles Behavior

Motivation. Prize-Collecting Steiner Tree Problem (PCSTP) Kosten und Profite. Das Fraktionale Prize-Collecting Steiner Tree Problem auf Baumgraphen

Transcription:

Complexity of Data Tee Pattens ove XML Documents Claie David LIAFA, Univesity Pais 7 and CNRS, Fance cdavid@liafa.jussieu.f Abstact. We conside Boolean combinations of data tee pattens as a specification and quey language fo XML documents. Data tee pattens ae tee pattens plus vaiable (in)equalities which expess joins between attibute values. Data tee pattens ae a simple and natual fomalism fo expessing popeties of XML documents. We conside fist the model checking poblem (quey evaluation), we show that it is DP-complete 1 in geneal and aleady NP-complete when we conside a single patten. We then conside the satisfiability poblem in the pesence of a DTD. We show that it is in geneal undecidable and we identify seveal decidable fagments. 1 Intoduction The elational model and its popula quey language SQL ae widely used in database systems. Howeve, it does not fit well in the eve changing Intenet envionment, since its stuctue is fixed by an initially specified schema which is difficult to modify. When exchanging and manipulating lage amounts of data fom diffeent souces, a less stuctued and moe flexible data model is pefeable. This was the initial motivation fo the Extensible Makup Language (XML) model which is now the standad fo data exchange. An XML document is stuctued as an unanked, labelled tee. The main diffeence with the elational model is that in XML, data is also extacted because of its position in the tee and not only because of its value. Consequently, all the tools manipulating XML data, like XML quey languages and XML schema, combine navigational featues with classical data extaction ones. XPath 2 is a typical example. It has a navigational coe, known as Coe-XPath and studied in [16], which is essentially a modal language that walks aound in the tee. XPath also allows esticted tests on data attibutes. It is the building block of most XML quey languages (XQuey, XSLT...). Similaly, in ode to specify integity constaints in XML Schema, XML languages have navigational featues fo desciption of walks in the tee and selection of nodes. The nodes ae fo instance chosen accoding to a key o a foeign key [15]. Wok patly suppoted by DocFlow (ANR-6-MDCA-5) 1 A poblem DP is the intesection of a NP poblem and a co-np poblem. 2 In all the pape, XPath efes to XPath1.

In this pape, we study an altenative fomalism as a building block fo queying and specifying XML data. It is based on Boolean combinations of data tee pattens. A data tee patten is essentially a tee with child o descendant edges, labelled nodes and (in)equality constaints on data values. Intuitively, a document satisfies a data tee patten if thee exists an injective mapping fom the tee patten into the tee that espects edges, node labels and data value constaints. Using pattens, one can expess popeties on tees in a natual, visual and intuitive way. These popeties can expess queies, as well as some integity constaints. At fist glance, the injectivity equiement does not seem impotant; howeve, it has some consequences in tems of expessive powe. As we do not conside hoizontal ode between siblings, without injectivity data tee pattens ae invaiant by bisimulation. Data tee pattens with injective semantics ae stictly moe expessive than with non-injective semantics. Fo example, it is not possible to expess desiable popeties such as a node has two a-labelled childen without injectivity. Anothe consequence of injectivity appeas when consideing conjunctions of data tee pattens. With non-injective semantics, the conjunction of two pattens would be equivalent to a new patten obtained by meging the two pattens at the oot. With injectivity this no longe woks and we have to conside conjunctions of tee pattens. This diffeence appeas when we study the complexity of the satisfiability poblem: fo one patten the poblem is PTime while it is untactable fo a conjunction of pattens. XPath and data tee pattens ae incompaable in tems of expessiveness. Without data value, XPath queies ae closed unde bisimulation while data tee pattens ae not. On the othe hand, XPath allows negation of subfomulas while we only allow negation of a full data tee patten. Fo example XPath can check whethe a node has a-labelled childen but no b-labelled child. This is not possible with Boolean combinations of tee patten. In tems of data compaison, Xpath allows vey limited joins because XPath queies cannot compae moe than two elements at a time, while a single patten can compae simultaneously an abitay numbe of elements. In this pape, to continue this compaison, we study the complexity of two questions elated to data tee pattens: the model checking poblem (quey evaluation) and the satisfiability poblem in the pesence of schema. The evaluation of XPath queies has been extensively studied (see [6] fo a detailed suvey). The evaluation poblem is PTime fo geneal XPath queies. In ou case, this poblem is moe difficult: the combined complexity of the model checking poblem fo Boolean combinations of data tee pattens is untactable. We pove that it is DP-complete in geneal and aleady NP-complete when consideing only one tee patten. The satisfiability poblem fo XPath is undecidable in geneal [5]. Howeve fo many fagments the poblem is decidable with a complexity anging fom NP to NExpTime. Similaly, fo Boolean combinations of data tee pattens the satisfiability poblem is undecidable in geneal. We identify seveal decidable fagments by estaining the expessivity of tee pattens o by bounding 2

the depth of the documents. The coesponding complexities ange fom NP to 2ExpTime. Related Wok: Tee pattens have aleady been investigated in a database context, often without data values [22, 3, 2]. The focus is usually optimisation techniques fo efficient navigation [1, 12, 7]. In this wok, we focus on the difficulty aised by data values and we ae not inteested in optimisation but in the wost case complexity fo the model checking and satisfiability poblems. Seveal papes consideed the non injective semantics of tee patten with data constaints. Fist, [19] consideed the satisfiability poblem fo one positive patten while we conside Boolean combinations of tee patten. Then, the authos of [2] conside the type checking poblem which is moe poweful that unsatisfiability but incompaable to the satisfiability poblem. Data tee pattens ae used in [4] to specify data exchange settings. They study two poblems: the fist one is consistency of data exchange settings, the second one is quey answeing unde data exchange settings. Given a conjunction of data tee pattens and a DTD, we can constuct a data exchange setting such that the consistency of this setting is equivalent to the satisfiability of the conjunction of pattens in the pesence of the DTD. Howeve the data tee pattens they conside ae less expessive than ous, in that they can not expess inequality constaints on data values no Boolean combinations of data tee pattens. The othe poblem consideed in this pape is quey answeing. This poblem seems elated to ou model checking poblem. Howeve it does not seem possible to use thei esult o thei poof techniques. Fagment of XPath: In [14], the authos conside an XPath fagment (simple XPath) allowing only vetical navigation but augmented with data compaisons. Negation is disallowed, both in the navigation pat and in the compaison pat. A simple XPath expession can be viewed as a patten with non-injective semantics and only data equality. They study the inclusion poblem of such expessions wt special schemes (SXIC) containing integity constaints like inclusion dependency. We cannot simulate inclusion dependency even with Boolean combinations of data tee pattens. Hence, thei famewok is incompaable to ous. Conjunctive queies on tees: Conjunctive queies on tees can be expessed by tee pattens. They wee consideed in [17, 8] without data values. Vey ecently [9], an extension by schema constaints is poposed and in vey few cases they allow data compaison. Notice that, without sibling pedicate, those conjunctive queies ae stictly less expessive than ou famewok because they do not allow negation and do not have an injective semantic. It is shown that the quey satisfiability poblem is NP-complete, wheeas the quey validity poblem is 2ExpTime-complete. Moeove, the validity of a disjunction of conjunctive queies is shown to be undecidable. This last esult coesponds to ou undecidability esult but the poof is diffeent. Logics ove infinite alphabets: Anothe elated appoach is to conside logic fo tees ove an infinite alphabet. In [1,11], the authos study an extension of Fist Ode Logic with two vaiables. In [13, 18], the focus is on tempoal logic and µ-calculus. These woks ae vey elegant, but the coesponding complexities 3

ae non pimitive ecusive. Ou wok can be seen as a continuation of this wok aiming fo lowe complexities. Stuctue: Section 2 contains the necessay definitions. In Section 3, we conside the model-checking poblem. In Section 4, we conside the satisfiability poblem in geneal and the esticted cases. Section 5 contains a summay of ou esults and a discussion. Omitted poofs can be found in the appendix available at http://www.liafa.jussieu.f/~cdavid/publi/mfcs8.pdf. 2 Peliminay In this pape, we conside XML documents that ae modeled as unodeed, unanked data tees, as consideed e.g. in [1]. Definition 1 A data tee ove a finite alphabet Σ is an unanked, unodeed, labelled tee with data values. Evey node v has a label v.l Σ and a data value v.d D, whee D is an infinite domain. We only conside equality tests between data values. The data pat of a tee can thus be seen as an equivalence elation on its nodes. In the following, we wite u v fo two nodes u, v, if u.d = v.d and we use the tem class without moe pecision to denote an equivalence class fo the elation. The data easue of a data tee t ove Σ is the tee obtained fom t by ignoing the data value v.d of each node v of t. ` `a `a ` b ` b & ` ` `a `a ` b & ` a a b a a b a oot b a b (a) data tee (b) data easue (c) patten Fig. 1. Examples Data tee pattens ae a natual way to expess popeties of data tees, o to quey such tees. They descibe a set of nodes though thei elative positions in the tee, and (in)equalities between thei data values. Definition 2 A data tee patten P = (p, C, C ) consists of: an unodeed, unanked tee p, with nodes labelled eithe by Σ o by the wildcad symbol, and edges labelled eithe by (child edges) o by (descendant edge), and two binay elations C and C on the set of nodes of p. 4

A data tee t satisfies a patten P = (p, C, C ), and we wite t = P, if thee exists an injective mapping f fom the nodes of p to the nodes of t that is consistent with the labelling, the elative positions of nodes, the banching stuctue and the data constaints. Fomally, we equie the following: fo evey node v fom p with v.l Σ, we have v.l = f(v).l, fo evey pai of nodes (u,v) fom p, if (u, v) C (esp. (u, v) C ) then f(u) f(v) (esp. f(u) f(v)), fo evey pai of nodes (u,v) fom p, if (u, v) is an edge of p labelled by (esp. by ), then f(v) is a child (esp. a descendant) of f(u), fo any nodes u,v,z fom p, if (u, v) and (u, z) ae both edges of p labelled by, then f(v) and f(z) ae not elated by the descendant elation in t. A mapping f as above is called a witness of the patten P in the data tee t. Notice that the semantic does not peseve the least common ancesto and asks fo an injective mapping between the nodes of a patten and those of the tee. This enables pattens to expess integity constaints. We will discuss the impact of those choices in Section 5. Data tee pattens can descibe popeties that XPath cannot, see e.g. the patten in Fig 1 (XPath cannot talk simultaneously about the two -nodes and the two b-nodes). We denote by Ptn(,, ) the set of data tee pattens and by BC(,, ) the set of Boolean combinations of data tee pattens. We will also conside esticted pattens, that do not use child elations o do not use descendant elations (denoted espectively by P tn(, ), P tn(, )). Fom these, we deive the coesponding classes of Boolean combinations. Finally, BC + (esp. BC ) denotes conjunctions of pattens (esp. negations of pattens). In poofs, we conside the pase tee of a Boolean fomula ϕ ove pattens, denoted by T (ϕ). The leaves of this tee ae labelled by (possibly negation of) pattens and inne nodes ae labelled by conjunctions o disjunctions. Such tees ae of linea size in the size of the fomula and can be computed in PTime. Given a patten fomula fom BC(,, ), the main poblems we ae inteested in ae the model-checking on a data tee (evaluation), the satisfiability poblem, in the geneal case as well as fo inteesting fagments. Because the geneal stuctue of XML documents is usually constained, we may conside DTDs as additional inputs. DTDs ae essentially egula constaints on the finite stuctue of the tee. Since we wok on unodeed, unanked tees, we use as DTDs an unodeed vesion of hedge automata. A DTD is a bottom-up automaton A whee the tansition to a state q with label a is given by a Boolean combination of clauses of the fom #q k whee q is a state and k a constant (unay encoded). A clause #q k is satisfied if thee ae at most k childen in state q. Adding a DTD constaint does not change the complexity esults fo the model-checking, since checking whethe the data easue of a tee satisfies a DTD is PTime. Theefoe, we do not mention DTDs in the model-checking pat. We conside the following poblems: Poblem 1. Given a data tee t and a patten fomula ϕ, the model-checking poblem asks whethe t satisfies ϕ. 5

Poblem 2. Given a patten fomula ϕ and a DTD L, the satisfiability poblem in the pesence of a DTD asks whethe ϕ is satisfied by some data tee whose data easue belongs to L. 3 Model Checking Pattens povide a fomalism fo expessing popeties. In this section, we see how efficiently we can evaluate them. Ou main esult is the exact complexity of the model-checking poblem fo patten fomulas fom BC(,, ). Theoem 3 The model-checking poblem fo BC(,, ) is DP-complete. The class of complexity DP is defined as the class of poblems that ae the conjunction of a NP poblem and a co-np poblem [21]. In paticula, DP includes both NP and co-np. A typical DP-complete poblem is SAT/UNSAT: given two popositional fomulas ϕ 1, ϕ 2, it asks whethe ϕ 1 is satisfiable, and ϕ 2 is unsatisfiable. The key to the poof of Theoem 3 is the case whee only one patten is pesent. This poblem is aleady NP-complete. Poposition 4 The model-checking poblem fo a single patten fom P tn(,, ) is NP-complete. Poof. The uppe bound is obtained by an algoithm guessing a witness fo the patten in the data tee and checking in PTime whethe the witness is coect. The lowe bound is moe difficult. It is obtained by a eduction of 3SAT. Given a popositional fomula ϕ in 3-CNF, we build a data tee t ϕ and a patten P ϕ of polynomial size, such that t ϕ P ϕ iff ϕ is satisfiable. Because we conside the model-checking poblem, the data tee is fixed in the input. Thus, it must contain all possible valuations of the vaiables and at least all possible tue valuations of each vaiable. Moeove, one positive data tee patten should identify a tue valuation of the fomula and check its consistency. Hence, it does not seem possible to use peviously published encodings of 3SAT into tees. The patten selects one valuation pe vaiable and pe clause. Its stuctue ensues that only one valuation pe vaiable and pe liteal is selected. The constaints on data ensue the consistency of the selection. The data tee and the tee of the patten depend only on the numbe of vaiables and clauses of the fomula. Only the constaints on data of the patten ae specific to the fomula. They encode the link between vaiables and clauses. Let Σ = {,, X, Y, Z, #, } be the finite alphabet. Assume that ϕ has k vaiables and n clauses. The data tee t ϕ is composed of k copies of the tee t v and n copies of the tee t c as depicted in Figue 2. Even if we conside unodeed tees, each copy of t v coesponds to a vaiable of the fomula and each copy of t c to a clause. The tee t ϕ involves exactly thee classes, denoted as,,. Each subtee t v, see Figue 2(b), contains the two possible values fo a vaiable. The left (ight) banch of the tee epesents tue (esp. false). 6

`# t v t v t c t c ` ` ` ` ` (a) The data tee t ϕ ` `X `Y `Y `Z `Z ` ` `X `Y `Z `Z (b) The subtee t v ` `X `Y `Z (1) (2) (3) (c) The subtee t c Fig.2. The data tee t ϕ A clause is viewed hee as the disjunction of thee liteals, say X, Y, and Z. Each subtee t c, see Figue 2(c), is fomed by thee subtees. Each of them epesents one of the thee disjoint possibilities fo a clause to be tue: (1) X is tue, o (2) X is false and Y is tue, o (3) X and Y ae false and Z is tue. We now tun to the definition of the tee patten P ϕ = (tp ϕ, C, C ), depicted in Figue 3. Similaly to t ϕ, the tee tp ϕ is fomed by k copies of tp v (each of them implicitly coesponding to a vaiable) and n copies of tp c (each of them implicitly coesponding to a clause). # tp v tp v tp c tp c X Y Z (a) The tee tp ϕ (b) Subtees tp v and tp c Fig.3. The tee tp ϕ The fom of the data easues of t ϕ and tp ϕ ensues that any witness of P ϕ in t ϕ selects exactly one value pe vaiable and one (satisfying) valuation fo each clause. Note that this is ensued by the definition of witness, since the witness mapping is injective. It emains to define the data constaints C and C in ode to guaantee that each clause is satisfied. Assume that the fist liteal of clause c is a positive vaiable x (esp. the negation of x). Then we add in C the -position (esp. the -position) of the subtee tp v coesponding to the vaiable x togethe with the X-position of the subtee tp c coesponding to the clause c. The same can be 7

done with the liteals Y and Z. Figue 4 gives the example of the patten fo the fomula ϕ with only the clause a b c. # X Y Z Fig.4. Patten fo ϕ = a b c We now pove that t ϕ P ϕ iff ϕ is satisfiable. Assume that the fomula ϕ is satisfiable. Fom any satisfying assignment of ϕ we deive a mapping of P ϕ into t ϕ : the subtee p v coesponding to the vaiable v is mapped on the left banch of the coesponding t v if the value of v is tue, and on the ight banch othewise. Since each clause is satisfied, one of the thee cases epesented by the subtee t c happens, and we can map the tp c coesponding to the clause on the banch of the coesponding t c. The convese is simila. In the poofs of Poposition 4 and Theoem 3, the pattens use only the child pedicate. We can do the same with simila pattens using only the descendant pedicate. As a consequence, we have: Theoem 5 The model-checking poblem fo both fagments BC(, ), BC(, ) is DP-complete. Coollay 6 The model-checking poblem fo BC + (, ), BC + (, ) and fo BC + (,, ) is NP-complete. Similaly, we can see that the model-checking poblem of a (conjunction of) negated patten(s) is co-np-complete. Notice that in the poof of Theoem 3, the patten fomula of the lowe bound is a conjunction of one patten and the negation of one patten. Thus, the model-checking poblem is aleady DPcomplete fo a conjunction of one patten and one negated patten. The model checking poblem fo conjunctive queies is also exponential (NP) in elational databases. Howeve, the algoithms wok vey well in pactice, when models o queies ae simple. In paticula, when the quey is acyclic, the poblem becomes polynomial. The wost cases that lead to exponential behavios do not appea often. It would be inteesting to know how the algoithms following fom ou poofs behave on pactical cases, and whethe we can find some estiction on the pattens that would lead to efficient evaluation in pactice. 4 Satisfiability In this section, we study the satifiability poblem in the pesence of DTDs. Checking satisfiability of a quey is useful fo optimization of quey evaluation 8

o minimization techniques. In tems of schema design, satisfiability coesponds to checking the consistency of the specification. We show that the satisfiability poblem is undecidable in geneal. Howeve the eduction needs the combination of negation, child and descendant opeations. Indeed, emoving any one of these featues yields decidability, and we give the coesponding pecise complexities. 4.1 Undecidability Theoem 7 The satisfiability poblem fo BC(,, ) in the pesence of DTD is undecidable. Poof sketch. We pove the undecidability by a eduction fom the acceptance poblem of two-counte machines (o Minsky machines). Ou eduction builds a DTD and a patten fomula of size polynomial in the size of the machine whose models ae exactly the encodings of the accepting uns. The encoding of a un can be split in thee pats: 1. The geneal stuctue of the tee, which depends only on the data easue, and is contolled by the DTD. 2. The intenal consistency of a configuation. 3. The evolution of counte values between two successive configuations. The global stuctue contains a banch that is labelled by the sequence of tansitions. Ensuing that a tee is of this shape is done by the DTD. It ecognizes the data easue of sequences of configuations. In paticula it checks that a counte is zeo when this is equied by the tansition. It also ensues that the sequence of tansitions espects the machine s ules (succession of contol states, initial and final configuations). The data values allow us to contol the evolution of the countes between two consecutive configuations. In ode to do so, we need to guaantee a cetain degee of stuctue and continuity of the values though a un. The data stuctue and the evolution of countes ae ensued by the patten fomula. The poof uses only conjunctions of negated pattens. Thus, the satisfiability poblem is aleady undecidable fo the BC (,, ) fagment in the pesence of a DTD. Altenatively, the DTD can be eplaced by a patten fomula. To do so, we need a few positive pattens to constain initial and final configuations in the coding. Thus, the satisfiability poblem is undecidable fo BC(,, ) without DTDs. It is inteesting to notice that the satisfiability poblem of BC(,, ) is undecidable on wod models. We will discuss this in Section 5. 4.2 Decidable Restictions We can obtain decidability by estaining eithe the expessive powe of patten fomulas o the data tees consideed. Fo the fist pat, using only one kind of edge pedicate ( o ) leads to decidability. Fo the second pat, esticting the tees to bounded depth leads to decidability. We povide the exact complexities. 9

Resticted Fagments: The poof of undecidability uses both and in the patten to count unbounded values of the countes. If we estict expessivity of pattens to use eithe o, we can t do this anymoe and the poblem becomes decidable. The key to both lowe bounds is that pattens can still count up to a polynomial value and thus compae positions of a tee of polynomial depth. We use this idea to encode exponential size configuations of a Tuing machine into the leaves of polynomial depth subtees. Theoem 8 The satisfiability poblem of BC(, ) in the pesence of a DTD is 2ExpTime-complete. Poof sketch. The uppe bound is obtained by a small model popety. We can pove that a patten fomula ϕ of BC(, ) is satisfiable in the pesence of a given DTD iff it has a model with a numbe of classes that is doubly exponential in the size of the fomula. We can ecognize the data easue of such small models with an automaton of size doubly exponential in the size of the fomula. Because emptiness of such automata is PTime, we have the 2ExpTime uppe bound. The lowe bound is obtained by a coding of accepting uns of AExpSpace Tuing machines. We can build a DTD and a patten fomula fom BC(, ) such that a data tee is a model on the patten fomula and espects the DTD iff it is the encoding of an accepting un of the machine. Theoem 9 The satisfiability poblem of BC(, ) in the pesence of a DTD is NExpTime-complete. Bounded Depth estictions: In the context of XML documents, looking at the satisfiability poblem esticted to data tees of bounded depth is a cucial estiction. This estiction leads to decidability fo BC(, <, +1). Poblem 3. Conside a patten fomula ϕ, an intege d and a DTD L. The poblem of bounded depth satisfiability in the pesence of a DTD asks whethe ϕ is satisfiable by a data tee of depth smalle than d whose data easue belongs to L. Theoem 1 If d is fixed, the bounded depth satisfiability poblem in the pesence of a DTD fo BC(,, ) is Σ 2 -complete. Theoem 11 If d is pat of the input, the bounded depth satisfiability poblem in the pesence of a DTD fo BC(,, ) is NExpTime-complete. Othe emaks: All the lowe bound esults of this section only use conjunctions of negated pattens. Thus these esults hold fo the BC fagments. Poposition 12 The satisfiability poblem of a single patten is PTime. Poposition 13 The satisfiability poblem fo BC + (,, ) is NP-complete in the pesence of DTD. 1

5 Conclusion The table below summaizes ou esults. bnd (esp. bnd f ) Sat stands fo Bounded depth Satisfiability when the bound is pat of the input (esp. fixed). The gay pats of the table gives complexity esults fo data wods models. Data wods ae the linea model coesponding to data tees. This model is studied in the veification aea [11, 13]. Data pattens can also be consideed fo data wods. The poofs ae moe complex and will be available in a longe vesion. Fagments Model-Checking Satisfiability bnd Sat bnd f Sat BC(,, ) DP-complete Undecidable NExpTime-complete Σ 2-complete BC(, ) DP-complete 2ExpTime-complete NExpTime-complete Σ 2-complete Data Wod PTime PSpace-complete BC(, ) DP-complete NExpTime-complete NExpTime-complete Σ 2-complete Data Wod DP-complete Σ 2-complete BC (,, ) conp-complete Undecidable NExpTime-complete Σ 2-complete Data Wod conp-complete undecidable BC + (,, ) NP-complete NP-complete NP-complete NP-complete Data Wod NP-complete NP-complete Discussion: In ou famewok we use the unodeed vesion of tees. If we conside the next-sibling pedicate, the situation is diffeent. Fo the model checking poblem all esults hold with simila poofs. Howeve, the complexity of the satisfiability poblem can incease when negation is allowed. In paticula the satisfiability poblem fo bounded depth tee becomes undecidable since we can encode data wods. Recall that ou patten fomalism does not peseve the least common ancesto. All esults hold if we add the least common ancesto. An impotant issue of semi-stuctued databases is the containment poblem. Given a DTD and two patten fomulas we want to know whethe evey tee satisfying the DTD and the fist fomula also satisfies the second one. When the set of fomulas we conside is closed unde negation, we can decide whethe a fomula ϕ 1 is moe constaining than ϕ 2 by checking the satisfiability of ϕ 2 ϕ 1. In Boolean combinations, we have closue unde negation, hence the inclusion poblem educes to the satisfiability poblem. Fo the positive fagment, the pecise complexity seems hade to state and the question is left open. In tems of expessiveness, ou patten fomalism is incompaable to XPath. In tems of tactability, evaluation of XPath queies is PTime wheeas model-checking of one data tee patten is aleady NP-had. A question is to find good notions of constaints in ode to isolate inteesting fagments with lowe complexity. Consideing the complexity of the satisfiability poblem, XPath and ou patten fomalism behave similaly. In this pape, we only conside pattens as filtes in ode to define popeties on the data tees. Defining a quey language would be a natual extension of this wok. To do this, some of the vaiables of the pattens can be chosen as output vaiables. 11

Refeences 1. S. Al-Khalifa, H. V. Jagadish, J. M. Patel, Y. Wu, N. Koudas, and D. Sivastava. Stuctual Joins: A Pimitive fo Efficient XML Quey Patten Matching. In ICDE, pages 141 153. IEEE, 22. 2. N. Alon, T. Milo, F. Neven, D. Suciu, and V. Vianu. XML with data values: typechecking evisited. J. Comput. Syst. Sci., 66(4):688 727, 23. 3. S. Ame-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Sivastava. Minimization of tee patten queies. SIGMOD Rec., 3(2):497 58, 21. 4. M. Aenas and L. Libkin. XML data exchange: consistency and quey answeing. In PODS, pages 13 24, 25. 5. M. Benedikt, W. Fan, and F. Geets. XPath satisfiability in the pesence of DTDs. In PODS, pages 25 36, 25. 6. M. Benedikt and C. Koch. XPath Leashed. To appea in ACM Computing Suveys. 7. V. Benzaken, G. Castagna, and C. Miachon. CQL: a patten-based quey language fo XML. In BDA, pages 469 49, 24. 8. H. Bjöklund, W. Matens, and T. Schwentick. Conjunctive Quey Containment ove Tees. In DBPL, LNCS 4797, pages 66 8. Spinge, 27. 9. H. Bjöklund, W. Matens, and T. Schwentick. Optimizing Conjunctive Queies ove Tees using Schema Infomation. To appea in MFCS, 28. 1. M. Bojanczyk, C. David, A. Muscholl, T. Schwentick, and L. Segoufin. Twovaiable logic on data tees and XML easoning. In PODS, pages 1 19, 26. 11. M. Bojanczyk, A. Muscholl, T. Schwentick, L. Segoufin, and C. David. Two- Vaiable Logic on Wods with Data. In LICS, pages 7 16. IEEE, 26. 12. N. Buno, N. Koudas, and D. Sivastava. Holistic twig joins: optimal XML patten matching. In SIGMOD Confeence, pages 31 321. ACM, 22. 13. S. Demi and R. Lazic. LTL with the Feeze Quantifie and Registe Automata. In LICS, pages 17 26. IEEE, 26. 14. A. Deutsch and V. Tannen. Containment and integity constaints fo xpath. In KRDB, volume 45 of CEUR Wokshop Poceedings. CEUR-WS.og, 21. 15. W. Fan and L. Libkin. On XML integity constaints in the pesence of DTDs. J. ACM, 49(3):368 46, 22. 16. G. Gottlob, C. Koch, and R. Pichle. Efficient algoithms fo pocessing XPath queies. ACM Tans. Database Syst., 3(2):444 491, 25. 17. G. Gottlob, C. Koch, and K. U. Schulz. Conjunctive queies ove tees. J. ACM, 53(2):238 272, 26. 18. M. Judzinski and R. Lazic. Altenation-fee modal mu-calculus fo data tees. In LICS, pages 131 14. IEEE, 27. 19. L. V. S. Lakshmanan, G. Ramesh, H. Wang, and Z. J. Zhao. On Testing Satisfiability of Tee Patten Queies. In VLDB, pages 12 131, 24. 2. A. Neumann and H. Seidl. Locating matches of tee pattens in foests. In FSTTCS, volume 153 of LNCS, pages 134 145. Spinge, 1998. 21. C. H. Papadimitiou and M. Yannakakis. The Complexity of Facets (and Some Facets of Complexity). J. Comput. Syst. Sci., 28(2):244 259, 1984. 22. Y. Wu, J. M. Patel, and H. V. Jagadish. Stuctual Join Ode Selection fo XML Quey Optimization. In ICDE, pages 443 454. IEEE, 23. 12