-
Notifications
You must be signed in to change notification settings - Fork 4
Queries
In the following descriptions, the terms "node", "input relation", and "output relation" are
used frequently. Congregation is a query compiler, and represents operations to itself as nodes
in a DAG. For each node, there is an input relation, which represents the input data to that
node (operation), and an output relation, which represents the data after the operation has been
performed. Therefore, each query (excepting a few special ones) takes one or more input nodes, and
returns a single node object.
Each query documented below will be formatted as follows:
- Method signature
- Query description
- A list of query parameter descriptions
-
create(
name: < str >,
columns: < list >,
stored_with: < set >,
input_path: < str > )-
Description: Each dataset to be used in a workflow must be instantiated
by a call tocreate()
. It is analogous to the SQLUSE
command. -
name: The name of the output relation on this node. For create nodes,
this will also correspond to the name of the file on your local filesystem
(omit the .csv extension). -
columns: A list of column objects.
-
stored_with: The party or parties that store this data. Note that a
stored_with set of a single party indicates that the input data is
in plaintext form, while a multi-element stored_with set indicates
that the input data is secret shared across the given parties. -
input_path: An optional argument to indicate a file path to the input
file which corresponds to this dataset. If not input_path argument is
supplied, congregation will resolve this filepath by concatenating the
general.data_path value (from the configuration file) with thename
value passed to create(). Thus, if your general.data_path was set as
/tmp/data/
, and thename
value to this function wasin1
, congregation
would resolve the path to this dataset as/tmp/data/in1.csv
(if no input_path
argument was supplied.
-
-
aggregate(
input_op_node: < Node >,
name: < str >,
group_col_names: < list, None >,
agg_col_name: < str >,
agg_type: < str >,
agg_out_col_name: < str, None > = None
)-
Description: Perform some aggregation type on the column with name
agg_col_name
, with respect to the columns ingroup_col_names
. For
example, if we had some relation stored in a variable calledstat
with
two columns [age
,income
], and we wanted to find the averageincome
of this relation with respect toage
, we would write that as follows:
avg_income = aggregate(stat, 'avg_income', ['age'], 'income', 'mean')
-
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
group_col_names: Either a list of columns to be grouped in this relation,
or None if only a simple (non-keyed) aggregation is being performed. \ -
agg_col_name: Name of the column which is being aggregated over.
-
agg_type: The type of aggregation to be performed. Currently, the following
aggregation types are supported:- Sum ("sum")
- Mean ("mean")
- Standard deviation ("std_dev")
- Variance ("variance")
-
agg_out_col_name: (optional) The name of the output aggregation column.
If this argument is left blank, the column being aggregated over will retain
its original name.
-
-
min_max_median(
input_op_node: < Node >,
name: < str >,
group_col_names: < list, None >,
target_col_name: < str >,
)-
Description: Outputs the minimum, maximum, and median values of the column
with nametarget_col_name
. Note thatgroup_col_names
here is a placeholder and
not currently available for MPC workflows. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
group_col_names: Either a list of columns to be grouped in this relation,
or None if only a simple (non-keyed) aggregation is being performed. \ -
target_col_name: Name of the column which is being aggregated over.
-
-
deciles(
input_op_node: < Node >,
name: < str >,
group_col_names: < list, None >,
target_col_name: < str >, \
)-
Description: Outputs the 1...9-decile values of the column with name
target_col_name
. Note thatgroup_col_names
here is a placeholder and
not currently available for MPC workflows. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
group_col_names: Either a list of columns to be grouped in this relation,
or None if only a simple (non-keyed) aggregation is being performed. \ -
target_col_name: Name of the column which is being aggregated over.
-
-
aggregate_count(
input_op_node: < Node >,
name: < str >,
group_col_names: < list >
)-
Description: Much like
aggregate()
above, this performs an aggregation
with respect to some provided group columns. Here, though, what is produced
is a simple count column which indicates how many elements were in each output
group. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
group_col_names: Either a list of columns to be grouped in this relation,
or None if only a simple (non-keyed) aggregation is being performed. Note
that for this operator, a non-keyed aggregation amounts to a simple row
count of the input relation.
-
-
all_stats(
input_op_node: < Node >,
name: < str >,
group_col_names: < list, None >,
target_col_name: < str >, \
)-
Description: Outputs all statistical operators (sum, mean, variance, standard
deviation, min/max/median, deciles, row count) as part of a single relation. Note
thatgroup_col_names
here is a placeholder and not currently available for MPC
workflows. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
group_col_names: Either a list of columns to be grouped in this relation,
or None if only a simple (non-keyed) aggregation is being performed. \ -
target_col_name: Name of the column which is being aggregated over.
-
-
project(
input_op_node: < Node >,
name: < str >,
selected_col_names: < list >
)-
Description: Either drop columns from an input relation (by
omitting them fromselected_col_names
), reorder the columns in
and existing relation, or both. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
selected_col_names: A list of the names of columns that should be part
of the output relation. Note that the order passed in this list will be
preserved in the output relation.
-
-
add(
input_op_node: < Node >,
name: < str >,
target_col_name: < str >,
operands: < list >
)-
Description: Add all values from the
operands
list to the values
of columntarget_col_name
. Values in theoperands
list can be names
of columns in the input relation, scalar values, or a mix of both. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
target_col_name: Name of the column to be operated upon. This can be
either a column already present in the input relation (in which case
it will be summed with each element of theoperands
list), or the
name of a new column which will be part of the output relation. In the
latter case, a new column will be created from the sum of the elements
of theoperands
list. -
operands: A list elements to be summed with the target column, if it exists.
If the target column is a new column, then a column is created with the name
target_col_name
, whose value will be the sum of the elements of the
operands
list. The elements in theoperands
list can either be numeric
(int
orfloat
), or the names of columns themselves (str
). For example, an
operands
list of [3, "a", "c"] would yield 3 added to each row of column "a",
added to each row of column "c".
-
-
subtract(
input_op_node: < Node >,
name: < str >,
target_col_name: < str >,
operands: < list >
)-
Description: Subtract all values from the
operands
list from the values
of columntarget_col_name
. Values in theoperands
list can be names
of columns in the input relation, scalar values, or a mix of both. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
target_col_name: Name of the column to be operated upon. This can be
either a column already present in the input relation (in which case
each element of theoperands
list will be subtracted from it), or the
name of a new column which will be part of the output relation. In the
latter case, a new column will be created from the difference of the elements
of theoperands
list. -
operands: A list elements to be subtracted from the target column, if it exists.
If the target column is a new column, then a column is created with the name
target_col_name
, whose value will be the difference of the elements of the
operands
list. The elements in theoperands
list can either be numeric
(int
orfloat
), or the names of columns themselves (str
). For example, an
operands
list of [3, "a", "c"] would yield 3 minus each row of column "a",
minus each row of column "c".
-
-
multiply(
input_op_node: < Node >,
name: < str >,
target_col_name: < str >,
operands: < list >
)-
Description: Multiply all values from the
operands
list with the values
of columntarget_col_name
. Values in theoperands
list can be names of
columns in the input relation, scalar values, or a mix of both. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
target_col_name: Name of the column to be operated upon. This can be
either a column already present in the input relation (in which case
it will be multiplied by each element of theoperands
list), or the
name of a new column which will be part of the output relation. In the
latter case, a new column will be created from the product of the elements
of theoperands
list. -
operands: A list elements to be multiplied by the target column, if it exists.
If the target column is a new column, then a column is created with the name
target_col_name
, whose value will be the product of the elements of the
operands
list. The elements in theoperands
list can either be numeric
(int
orfloat
), or the names of columns themselves (str
). For example, an
operands
list of [3, "a", "c"] would yield 3 multiplied by each row of
column "a", multipled by each row of column "c".
-
-
divide(
input_op_node: < Node >,
name: < str >,
target_col_name: < str >,
operands: < list >
)-
Description: Divide all values from the
operands
list with the values
of columntarget_col_name
in sequence. For example, iftarget_col_name
is
a
, andoperands
is [b
, 7,c
], then the output will be the values ofa
divided by the values ofb
, divided by 7, divided by the values ofc
. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
target_col_name: Name of the column to be operated upon. This can be
either a column already present in the input relation (in which case
it will be divided by each element of theoperands
list), or the
name of a new column which will be part of the output relation. In the
latter case, a new column will be created from the division of the elements
of theoperands
list. -
operands: A list of elements that the target column will be divided by, if
it exists. Note that, unlikemultiply()
, ordering fordivide()
is important.
For example, if we hadtarget_col
= "a", andoperands
= [3, "b"], the output
would be equal to the rows of column "a" divided by 3, and then divided by the
rows of column "b". If column "a" doesn't already exist, a new column "a" would
be created, whose rows would be equal to 3 divided by the rows of column "b".
-
-
limit(
input_op_node: < Node >,
name: < str >,
limit_num: < int >
)-
Description: Drop all rows beyond
limit_num
from the input relation. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
limit_num: Number of rows to be part of the output relation. E.g. - a
limit_num
of 10 would output the first 10 rows of the input relation.
-
-
distinct(
input_op_node: < Node >,
name: < str >,
selected_col_names: < list, None > = None
)-
Description: Drop all non-unique rows from the input relation with respect
to the columns inselected_col_names
. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
selected_col_names: The columns in the input relation to apply the
distinct operator to. For example, aselected_col_names
value of ["a", "b"]
would output all distinct rows of the input relation with respect to columns
"a" and "b". Note thatdistinct
will also only output those columns which are
part of theselected_col_names
list, and will drop all other columns from the
input relation.
-
-
filter_by(
input_op_node: < Node >,
name: < str >,
filter_col_name: < str >,
operator: < str >,
filter_against: < str, int >
)-
Description: Drop all rows from the input relation that do not satisfy some
filter condition. The filter condition itself is composed of both theoperator
andfilter_against
values. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
filter_col_name: Column to apply the filter operation on. Only those columns
which satisfy the filter operation will be part of the output relation. -
operator: Comparison function to apply. The following operators are currently
supported: ">", "<", "==", ">=", "<=" -
filter_against: Either the name of the column, or the scalar to compare the
filter column against. -
Example :
If
filter_col_name
equals "a",operator
equals ">", and
filter_against
equals 7, then only those rowsi
which satisfy
i[a] > 7
will be included in the output relation.If we instead set
filter_against
to "b", then only those rowsi
which satisfyi[a] > i[b]
will be included in the output relation.
-
-
sort_by(
input_op_node: < Node >,
name: < str >,
sort_by_col_name: < str >,
increasing: < bool, None > = True
)-
Description: Sort the rows of the input relation with respect to the
values in one of its columns. -
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
sort_by_col_name: The name of the column to key the sort on.
-
increasing: Whether rows will be sorted in increasing order (True),
or decreasing order (False).
-
-
num_rows(
input_op_node: < Node >,
name: < str >,
count_col_name: < str > = "num_rows"
)-
Description: Count the number of rows in the input relation.
-
input_op_node: Input node.
-
name: The name of the output relation on this node.
-
count_col_name: The output of this operation is a 1X1 relation which
indicates how many rows were present in the input relation.count_col_name
is the name that will be given to the single column in this output relation.
-
-
collect(
input_op_node: < Node >,
target_parties: < set >
)-
Description: Terminates a workflow and returns the data in the output relation
ofinput_op_node
to the parties specified bytarget_parties
. -
input_op_node: Input node.
-
target_parties: The set of compute parties that the output should be collected
by. Note that at the time of this writing, this only controls whether a party
writes the output to file, and does not control whether that output is actually
revealed to them in a cryptographic sense.
-
-
join(
left_input_node: < Node >,
right_input_node: < Node >,
name: < str >,
left_col_names: < list >,
right_col_names: < list >
)-
Description: Performs a simple join between the output relations of
left_input_node
andright_input_node
. Note that equality checks between columns are done in the order
indicated byleft_col_names
andright_col_names
, and both of those lists must
contain the same number of column names. -
left_input_node: Left side input node.
-
right_input_node: Right side input node.
-
name: The name of the output relation on this node.
-
left_col_names: The names of the group by columns in the left input relation.
-
right_col_names: The names of the group by columns in the right input relation.
-
-
concat(
input_op_nodes: < list >,
name: < str >,
column_names: < list, None > = None
)-
Description: Performs a vertical concatenation of all nodes in the
input_op_nodes
list. Note that the number of columns in the output relations of each node in this list
must be equal. -
input_op_nodes: List of nodes whose output relations will be concatenated.
-
name: The name of the output relation on this node.
-
column_names: Option to rename the columns of the output relation on this
node. If unspecified, the column names from the first node ofinput_op_nodes
will be used.
-