Query with Assumptions for Probabilistic Relational Databases

Users may have prior knowledge about a probabilistic database. They prefer to query over a probabilistic database on their prior knowledge which cannot be written as component clauses of conventional SQL queries. A naive approach is to query over a new database version, which is generated by transforming the original probabilistic database to satisfy users' prior knowledge; however, it is impractical to generate a different probabilistic database version for each prior knowledge. In this paper, we propose the concept of the query with assumptions which allow users to describe their prior knowledge with a newly introduced ASSUMPTION clause of SQL. We also propose an approach to obtain the result of a query based on assumption clauses. The experimental studies show our approach has better performance compared to the naive approach.

In applications for probabilistic databases, it is common that users would have prior knowledge (defined as C) from other sources [19]. Users could not obtain what they really want from a probabilistic database PDB by conventional queries in such situations. The probability inference of a conventional query is a priori probability P(Q), while what users need is a posteriori probability, the conditional probability P(Q|C). The problem is thus how to evaluate a query Q given C.
Let's consider the following example. Example 1: Assume a probabilistic table PT1 in Tab. 1 records information of suspicious persons related to a crime. Table V_P in Tab. 2 records a set of mutually independent Boolean variables, each associated with a probability being true. According to their degree of suspicion and correlation, each suspicious person associates a probability computed by the logical expression in column f being true. For example, since the logical expression in f for tuple t1 is v1 with 0.5 being true, Jim is responsible for the crime with probability 0.5. As the logical expressions of tuple t1 and t3 are mutually exclusive, Dan and Jim did not participate in the crime together.
Each assignment of variables in V_P represents a possible instance of PT1: the instance containing all the tuples whose logical expressions are true with the given assignment. A probabilistic database is a joint probability distribution over the assignments of these variables. So PT1 includes four possible worlds (See Tab. 3). wi(x1, x2) donates the i-th possible world with v1,v2 taking values x1, x2 separately and P(wi) donates the probability of wi. Suppose a detective has an own point of view based on experience or research when checking information of criminals with black hair. For example, the detective suspected Jim participated in the crime, which means the detective has a prior knowledge about PT1 (the tuple t1 must be present in PT1). Consider the following conventional query Q1, select name from PT1 where color = 'black'. Q1 evaluated one very possible world of PT1 separately according to the semantics of probabilistic query evaluation. The result is another set of possible results instances with the same probability distribution. The final result of the probabilistic query is a union of all the possible result tuples, and the probability of each result tuple is the sum of the probabilities of all results instances that contain it. In this example, Dan and Jone is the result of Q1 executing on PT1 (See Fig. 1).

Result of Q1
Result of AQ Name P Name P Dan 0.5 Jone 0.6 Jone 0.3 Figure 1 Result of Q1 and Result of AQ However, results obtained by the conventional query Q1 actually do not meet the demand of the detective, because the result is obtained based on all the four possible worlds of PT1, two of which do not satisfy the detective's prior knowledge. What's more, the prior knowledge cannot be described in WHERE clause. Therefore, we propose query with assumptions so that users with prior knowledge can describe the assumption in the query, the query with assumptions AQ in this example can be written as follows: select name from PT1 where color='black' assumption exist name='Jim'; where the assumption is the keyword to describe the user's prior knowledge.
The result of this query with assumptions is shown in Fig. 1, namely, the suspect degree of Jone is 0.6 and Dan did not participate in the crime. Let P(t ∈ Q1) be the probability of tuple t in the result of Q1, C be the assumption about PT1 "exist name='Jim'", and AQ be Q1 executed based on assumption C. Therefore, P(t ∈ AQ) = P(t ∈ Q1|C), by applying the Bayesian theory of conditional probability, In this example, only two possible words w1, w2 satisfy the given assumption, therefore, Q1 is only evaluated against these two possible words. Jone is the result from w1 denoted as t_Jone, and empty is the result from w2. Thus, P(t_Jone ∈ Q1∧C) = P(w1) = 0.3, P(C) = P(w1) + P(w2) = 0.5.
By Eq. (1), the probability of t_Jone to be in the result of AQ, namely, the degree of Jone to be a criminal is, The information obtained is based on the probability database and the detective's prior knowledge, which meets the detective's demand.
Example 1 is all above. Prior knowledge may change a probability distribution of information in a probabilistic relational database, thus, it cannot be processed in a conventional query. What users need is not just the result of a conventional query, but the result under the condition of a prior knowledge. Fig. 2 shows the research problem considered in this paper. The conditioning_based approach is to execute conventional queries over posteriori probabilistic relational databases. The posteriori probabilistic database is the result of conditioning probabilistic relational databases. However, different users may have different prior knowledge. In example 1, the other detectives may have different opinions or assumptions about the criminals while a detective suspected Jim must be involved in that crime. It is too heavy cost to generate a new probabilistic database version for each query with different assumptions, and then delete the new database version after the query.
Our aim of this study is to enable users to obtain the result of a query on a priori knowledge, and do not produce a new version of the probabilistic database. In Example 1, the query with assumptions makes the detective obtain the degree of suspicion of criminals on an assumption, and the other detectives can obtain information based on their different views or assumptions at the same time.
The main contributions of this paper are as follows: (1) A new ASSUMPTION clause is introduced to SQL syntax. At the end of a conventional SQL query, multiple ASSUMPTION clauses can be added, and in each ASSUMPTION clause, users' prior knowledge can be described as either existence or non-existence of tuples satisfying specifying conditions. (2) We propose a new lineage_based evaluation approach for processing ASSUMPTION clause. The result of query with assumptions is obtained based on the result of the conventional query and the conditional probabilities. The conditional probability of the result of the conventional query under the given prior knowledge is calculated for the result of query with assumptions. We also provide an improved method for calculating conditional probabilities by incorporating probability calculations into the lineage computation so that shared sub-expressions are not re-evaluated all the time. (3) We conduct an experimental study of the algorithm presented in this study. Experimental results show that the lineage_based approach obtains the correct result for query with assumptions and is more efficient than the conditioning_based approach.

RELATED WORKS 2.1 Probabilistic Relational Database
The studies on uncertain data representation can be divided into two categories [20,21], one is based on simple correlation assumption [22,23], which associates existence probabilities with individual tuples. The tuples in the probabilistic relational databases are mutually independent or exclusive while the other can express complex correlations between tuples [24,25].
Approaches to query evaluation in probabilistic relational databases can also be divided into two categories [26]. One is to evaluate the query and calculate the probability results separately [27,28]. Lineage expressions of result tuples can be used for correct confidence computation, without restricting the specific query plans. Another approach integrates the probabilistic inference with the query evaluation step [29]. Standard data management techniques can be used to speed up the processing of probabilistic inference. But it is suitable for only the queries that have safety plans. The first approach is more suitable for the evaluation of query with assumptions, since the answer tuples of query with assumptions are computed based on the conventional query, and the confidences of the result tuples are computed based on the conditional probability of the lineage under a given assumption.

PDB|C
Conditioning PDB on C

Results
Evaluate Q Evaluate |

Conditioning
As the authors claim, [30][31][32] are the only three existing works on conditioning probabilistic relational databases. Conditioning probabilistic relational databases remove possible worlds that violate the additional knowledge. Our work is different from conditioning probabilistic databases in two aspects.
Firstly, the scenarios are different. Conditioning probabilistic database is mainly useful in scenarios that the administrations add in some new evidence to a database of priori probabilities, and update it to a posteriori probabilistic database taking the evidence into account, so it focuses on how to get the posteriori probabilistic database after conditioning. However, query with assumptions is useful in scenarios when different ordinary users query over a probabilistic database with their different prior knowledge or assumptions, and prefer the result taking their assumption into account without affecting the probabilistic relational database. Secondly, the approach of conditioning probabilistic database is not appropriate for solving query with assumptions, because it is impractical to generate a posteriori probabilistic database for each query with different assumptions.

Definition 1:
A query with assumptions AQPDB(Q, C) over a probabilistic database PDB: AQ PDB (Q, C) is a query Q over a probabilistic database PDB whose possible worlds only include all possible worlds satisfying assumption C, where Q is a conventional query over the probabilistic database, C is an assumption about presence or absence of tuples in the PDB.
Supposed {RT AQ , P AQ } is the result of query with assumptions AQ PDB (Q, C), where RT AQ is a set of tuples in the result, P AQ is the present probability of tuple in RT AQ . Then RT AQ is the set of result tuple of Q executed over all possible worlds of PDB satisfying assumption C. Let W{w 1 , …, w n } be the set of possible worlds of PDB. The set of result tuple For tuple t ∈ RT AQ , P AQ (t) = P(t ∈ Q|C) = P(t ∈ Q∧C)/P(C), P(t ∈ Q∧C) represents the sum probability of all possible worlds which include t in the result of query Q and satisfy C, P(C) represents the sum probability of all possible worlds satisfying assumption C.

Syntax
The assumption supported in this paper is limited to that which will not introduce new possible worlds on the basis of the probabilistic database. Since the presence or absence of each tuple in the probabilistic database can determine a possible world, the assumption can be converted into presence or absence of several tuples. So the assumption in the query can be presence or absence of tuples satisfying specifying conditions. Since assumption on the constraint of the number of present tuples is inconvenient for users to convert into presence or absence of several tuples, an interface for count assumption is provided, which will be automatically converted into the presence or absence of tuples. Based on this consideration, the syntax for assumption in the query is defined as follows:

Transformed Assumption Expression
Assumption clauses need to be transformed to a logical expression of tuple identifiers in the probabilistic database (denoted as CE), which represents all possible worlds satisfying assumption in PDB. The following is the process of transforming assumption clause. Let C E (Ci) be the transformed expression from each <Ci>. <Ci>::=<c_exist>.
Let C E (exp j ) be the transformed expression from each <exp j >.
The tuples related to the assumption can be obtained by the following query, Suppose R{t 1 , …, t k } is the set of result tuples, s = <int_count>, p = k*(k−1)*…*(k+s−1)/s!, then C E is a disjunctive normal form which contains x conjunctive clauses as the following: Mi is a conjunctive clause that contains either t j or ┐t j for each tuple t j in R. The number of t j without negative form in M i is equal to s. So M i is defined as the following: When there are a number of n assumption keywords in the ASSUMPTION clause, each assumption Ci can be transformed separately, then the expression for the assumption clause can be obtained by the conjunction of each C E (Ci).

Process Procedure of a Query with Assumptions
Conditioning_based approach: For a query with assumptions AQ PDB (Q, C), the conditioning_based approach is to evaluate the conventional query Q over a posteriori probabilistic relational database version. The posteriori probabilistic relational database is generated by conditioning.
But for different assumption clauses, different posteriori probabilistic databases need to be generated and deleted after the query with assumptions, we consider an alternative way to process the query with assumptions.
Since AQPDB(Q, C) is a conventional query Q executed under specifying assumption C and the result of Q over PDB can be obtained by existing methods [9], we study the correlation between Q and AQ PDB (Q, C) and obtain the result of AQ PDB (Q, C) based on the result of Q and their correlation.
Theorem 1: Given a query with assumptions AQ PDB (Q, C), if {RT AQ , P AQ } is the result of a query with assumptions AQ PDB (Q, C), where RT AQ is the set of tuples in the result, P AQ is the existence probability of tuple in RT AQ ; if {RT Q , L} is the result of evaluating Q without the probability inference, which RT Q is the set of result tuple, L is the lineage of a result tuple, then where L(t) is the lineage of tuple t, C E is a transformed expression for assumption C in the query with assumptions.
Proof: Suppose PDB{W, P} is a probabilistic database, where W = {w1, ..., wn} is a set of possible worlds, n is the number of possible worlds, P is the probability distribution over W. Suppose PDB'{W', P'} is the probabilistic database transformed from PDB to satisfy assumption C [11]: W' = {wi|wi ∈ W, C ~ wi}, P'(wi) = P(wi)/P(CE), P(C E ) = ∑ wi∈W, C～wi P(wi)， AQ PDB (Q, C) is equal to executing the conventional query Q over PDB'{W', P'}.
For ∀t∈RT Q , given a specify possible world, if L(t) is true, then t is in the result of Q over this possible world.
Namely, when

End proof
By Theorem 1, a query with assumptions AQ PDB (Q, C) can be processed as follows. The algorithm we proposed for query with assumptions is shown in Fig. 3.

black' and Sex='F'assumption exist name='Jim' and not exist name='Dan';
First, the following conventional query Q3 in this query with assumptions will be evaluated, select name from PT1 where color='black' and Sex='F ' We obtain Jone(t4) in the result. The transformed expression for assumption：C = t1 ∧┐t3.

Improved Conditional Probability Calculation
We next give an algorithm for computing P(L(rs)|CE), the existence probability of a result tuple rs of a query with assumptions, where L(t) is the lineage of tuple t, C E is a transformed expression for assumption C in a query with assumptions. Both of L(t) and C E is a logical expression of tuple identifiers in the probabilistic database.
Since the true value of f(t) determines the presence or absence of the tuple t, and the probability of f(t) to be true or false is defined by probabilities associated with the variables of which it is composed, we compute the probability of a logical expression of tuple identifiers by replacing the tuple identifier t with f(t).
The existence probability inference of a result tuple rs of a conventional query Q is a priori probability P(L(rs)), whereas the existence probability of the result tuple rs of a query with assumptions AQPDB(Q, C) is a posteriori probability, the conditional probability P(L(rs)|C E ). Our goal is to make P(L(rs)|C E ) have the same time complexity with P(L(rs)).
Definition 2: Expression of Variables EV: Supposed X is a logical expression of tuple identifiers in a PDB, EV(X) is to transform X by replacing every tuple identifier t in X with f(t).
Definition 3: Set of Variables SV: Supposed Y is a logical expression of variables in a PDB, SV(Y) is a set of variables that appeared in Y.

DISCUSSION OF TIME COMPLEXITY OF ALGORITHM FOR QUERY WITH ASSUMPTIONS
For a conventional query Q over a tuple-correlated probabilistic database PDB, let T(Q PDB ) be the time for evaluating the set of result tuples and their lineage, and T(P Q ) be the time for computing the probabilities of tuples.
Given a query with assumptions AQ(Q, C) over a probabilistic database PDB, We compare the lineage_based approach with the conditioning_based approach (mentioned in section 3.3). We also compare the time for AQ(Q, C) with Q.

Conventional Query Q
Let {RT, L} be the result of Q without probability inference.
To compute the probability of an arbitrary logical expression being true, the inference method is to enumerate the assignments of involved variables and sum up the probability of assignments that can make the logical expression true, thus the time complexity of probability inference is exponential complexity over the number of involved variables.
For each t ∈ RTQ, the time complexity of computing the probability of t, P(L(t)), is O(2 |SVT(L(t))| ), where SVT(L(t)) donates the set of variables appeared in L(t) after replacing each tuple identifier t i in L(t) with f(t i ).

AQ(Q, C) by our Approach
(1) Naive probability inference method According to the algorithm Assumption_Query (mentioned in section 3.3), our approach firstly evaluates the conventional query Q over the probabilistic database PDB without probability inference, then computes P(L(t)|C) as the probability of each result tuple t. The first step takes T(QPDB) time as a conventional query does.

P(L(t)|C) = P(L(t)∧C)/P(C).
Computing P(C) takes O(2 |SVT(C)| ) time, where SVT(C) donates the set of variables that appeared in C after replacing each tuple identifier t i in C with f(t i ). P(C) is computed only once, then it can be used in the probability calculation of each result tuple.
A naive inference method for computing P(L(t)∧C) is to enumerate the assignments of variables in SVT(L(t)∧C). The time complexity of the naive inference method to compute P( donates the set of variables appeared in L(t) ∧ C after replacing each tuple identifier t in L(t) and C with f(t).

AQ(Q, C) by the conditioning_based Approach
The conditioning_based approach needs a preprocessing step to generate a posteriori probabilistic database PDB' and evaluate the conventional query Q over the posteriori probabilistic database. Since the posteriori probabilistic database PDB' has the same set of tuples R as the original database PDB, the time of Q over PDB' without probability inference is the same as Q over PDB. The preprocessing step takes O(|R|*2 |E| ) time, where |R| is the number of tuples and |E| is the number of variables in the probabilistic relational database.
Let {RT, L'} be the result of Q over PDB' without probability inference. Let T(P' Q ) be the time for computing the probabilities of result tuples.
The set of variables in PDB' and the logical formula for each tuple may be different from PDB. Therefore, T(P' Q ) may be more or less than T(P Q ).

Analysis
For a query with assumptions, the lineage_based approach and the conditioning_based approach take the same time in the step of evaluating conventional query without probability inference, while the preprocessing of generating a posteriori probabilistic database in the conditioning_based approach takes a huge time cost more than time for probability inference of result tuples in our approach. Furthermore, the lineage_based approach avoids generating a new probabilistic database version.
Next, we compare the time for a query with assumptions AQ(Q, C) by our approach with the conventional query Q.
The naive probability inference method takes more time for the probability calculation of result tuples than the conventional query does. While the improved probability inference method will not take more time than the conventional query unless there is a result tuple t satisfying the condition |VST(L(t))| < |VST(C)| and VST(L(t))∩VST(C) ≠ Ф.

EXPERIMENTS
In this section, we evaluate the efficiency of the Lineage_based approach for answering queries with assumptions over probabilistic relational databases.

Experiment Setup
Probabilistic databases: The data set consists of a variables table V_P and tuple-correlated probabilistic databases. V_P contains a set of mutually independent boolean variables {e1, e 2 , …, e 10 }, whose probability distributions are chosen at random. Tuple-correlated probabilistic databases are obtained from relational databases produced by TPC-H 2.14.4, where each tuple t is associated with a logical formula f(t) that is composed of variables in V_P.

Lineage_based Approach, the Conditioning_based Approach
The conditioning_based approach incurs a cost in terms of generating a posteriori probabilistic database and processing a conventional query. We generate posteriori probabilistic databases by transforming the probabilistic database at 0.01 TPC-H scale for the two assumption clauses C1, C2. The posteriori PDB (with 6 variables) for C1 includes 4 less variables than the original probabilistic database, while that (with 14 variables) for C2 includes 4 more variables than the original probabilistic database. Tab. 4 shows that the running time of AQ(Q, C1) and AQ(Q, C2) by the lineage_based approach and the conditioning_based approach over the probabilistic database at each TPC-H scale.
For a AQ(Q, C), the time cost of the conditioning_based approach increases extremely faster than the lineage_based approach when the database scales, since it needs much more time to generate the posteriori PDB. When |RT| grows, the time cost of the lineage_approach increases and still much less than the conditioning_based approach. Although when C = C1, the lineage for result tuples L for probability computing in the conditioning_based approach includes less variables than that in the lineage_based approach, the time cost during generating the posteriori probabilistic database is much more than all the time for processing AQ(Q, C) by the lineage_based approach. It demonstrates that the conditioning_based approach costs much more time than the lineage_based approach for all AQ(Q, C), no matter when C = C1 or C = C2. This is because, in the conditioning_based approach, the posteriori probabilistic database re-computes the logical formula for each tuple in the PDB in O(|R|*2 |E| ) time complexity, and the total cost is O(|R|*2 |E| ) + T(Q PDB ) + ∑ t∈RT O(2 |SVT(L'(t))| ), while the total cost of the lineage_based approach is T( . Thus, the conditioning_based approach always costs much more time than the lineage_based approach no matter the posteriori PDB contains more or less variables than the original PDB.

Query with Assumptions, Conventional Query
Based on the analysis in section 4.4, a query with assumptions AQPDB(Q, C) by lineage_based approach has the same time complexity as the conventional query Q on the step of query evaluation without probability inference, donated as T(Q PDB ). And when the number of variables in assumption C E |SVT(C E )| is less than that in the lineage of each tuple SVT(L), probability inference of AQ PDB (Q, C) also has the same time complexity as that of Q, otherwise, probability inference of AQ PDB (Q, C) costs more time than that of Q.    Fig. 6 shows when the D_scale and |RT| is fixed, Queries with |SVT(L)| = 10 cost much more time than those with |SVT(L)| = 8. This is because when |SVT(L)| increases, the time cost of probability calculation of results grows exponentially. And the time cost of probability calculation accounts for a large proportion of the total time cost of query processing. Therefore, when |SVT(L)| increases, the total time cost of queries grows significantly.

DISCUSSION
For assumption queries, the lineage_based approach is more efficient than the conditioning_based approach, even when the generated posteriori probabilistic database includes less variables than the original probabilistic database. Because in the lineage_based approach, the time saved by avoiding generating a new probabilistic database version covers the time of probability computation for result tuples. The lineage_based approach will not take more time for the query with assumption than the conventional query as long as the number of variables in the transformed expression of the assumption is less than that in the lineage expressions of any result tuples of the convention query.

CONCLUSIONS
When users have prior knowledge about a probabilistic database, they cannot obtain data on users' additional knowledge from the probabilistic database by conventional queries. Users' prior knowledge is difficult to be described in the component clauses of a conventional query statement. Therefore, we propose query with assumptions, the conventional query based on a given assumption, which makes users able to obtain information from the probabilistic database based on their prior knowledge.
The conditioning_based approach generates a posteriori probabilistic database for each query with assumptions, which is too resource consuming. Our approach obtains the result of query with assumptions from the original probabilistic relational database directly without conditioning.
The experimental results show that our approach has much better performance than the conditioning_based approach. A query with assumptions by our approach has approximately performance with a conventional query when the transformed expression of assumption does not contain more variables than the lineage of any tuple in the result of the convention query.
The assumption supported in this paper is limited to that which will not introduce new possible worlds based on the probabilistic database. As future work, we plan to consider the assumption which will introduce new possible worlds.