Queries

In the following descriptions, the terms "node", "input relation", and "output relation" are
used frequently. Congregation is a query compiler, and represents operations to itself as nodes
in a DAG. For each node, there is an input relation, which represents the input data to that
node (operation), and an output relation, which represents the data after the operation has been
performed. Therefore, each query (excepting a few special ones) takes one or more input nodes, and
returns a single node object.

Each query documented below will be formatted as follows:

- Method signature
- Query description
- A list of query parameter descriptions

create(
   name: < str >,
   columns: < list >,
   stored_with: < set >,
   input_path: < str > )
- Description: Each dataset to be used in a workflow must be instantiated
  by a call to create(). It is analogous to the SQL USE command.
- name: The name of the output relation on this node. For create nodes,
  this will also correspond to the name of the file on your local filesystem
  (omit the .csv extension).
- columns: A list of column objects.
- stored_with: The party or parties that store this data. Note that a
  stored_with set of a single party indicates that the input data is
  in plaintext form, while a multi-element stored_with set indicates
  that the input data is secret shared across the given parties.
- input_path: An optional argument to indicate a file path to the input
  file which corresponds to this dataset. If not input_path argument is
  supplied, congregation will resolve this filepath by concatenating the
  general.data_path value (from the configuration file) with the name
  value passed to create(). Thus, if your general.data_path was set as
  /tmp/data/, and the name value to this function was in1, congregation
  would resolve the path to this dataset as /tmp/data/in1.csv (if no input_path
  argument was supplied.
aggregate(
   input_op_node: < Node >,
   name: < str >,
   group_col_names: < list, None >,
   agg_col_name: < str >,
   agg_type: < str >,
   agg_out_col_name: < str, None > = None
)
- Description: Perform some aggregation type on the column with name
  agg_col_name, with respect to the columns in group_col_names. For
  example, if we had some relation stored in a variable called stat with
  two columns [age, income], and we wanted to find the average income
  of this relation with respect to age, we would write that as follows:
  avg_income = aggregate(stat, 'avg_income', ['age'], 'income', 'mean')
- input_op_node: Input node.
- name: The name of the output relation on this node.
- group_col_names: Either a list of columns to be grouped in this relation,
  or None if only a simple (non-keyed) aggregation is being performed. \
- agg_col_name: Name of the column which is being aggregated over.
- agg_type: The type of aggregation to be performed. Currently, the following
  aggregation types are supported:
  - Sum ("sum")
  - Mean ("mean")
  - Standard deviation ("std_dev")
  - Variance ("variance")
- agg_out_col_name: (optional) The name of the output aggregation column.
  If this argument is left blank, the column being aggregated over will retain
  its original name.
min_max_median(
   input_op_node: < Node >,
   name: < str >,
   group_col_names: < list, None >,
   target_col_name: < str >,
)
- Description: Outputs the minimum, maximum, and median values of the column
  with name target_col_name. Note that group_col_names here is a placeholder and
  not currently available for MPC workflows.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- group_col_names: Either a list of columns to be grouped in this relation,
  or None if only a simple (non-keyed) aggregation is being performed. \
- target_col_name: Name of the column which is being aggregated over.
deciles(
   input_op_node: < Node >,
   name: < str >,
   group_col_names: < list, None >,
   target_col_name: < str >, \
)
- Description: Outputs the 1...9-decile values of the column with name
  target_col_name. Note that group_col_names here is a placeholder and
  not currently available for MPC workflows.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- group_col_names: Either a list of columns to be grouped in this relation,
  or None if only a simple (non-keyed) aggregation is being performed. \
- target_col_name: Name of the column which is being aggregated over.
aggregate_count(
   input_op_node: < Node >,
   name: < str >,
   group_col_names: < list >
)
- Description: Much like aggregate() above, this performs an aggregation
  with respect to some provided group columns. Here, though, what is produced
  is a simple count column which indicates how many elements were in each output
  group.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- group_col_names: Either a list of columns to be grouped in this relation,
  or None if only a simple (non-keyed) aggregation is being performed. Note
  that for this operator, a non-keyed aggregation amounts to a simple row
  count of the input relation.
all_stats(
   input_op_node: < Node >,
   name: < str >,
   group_col_names: < list, None >,
   target_col_name: < str >, \
)
- Description: Outputs all statistical operators (sum, mean, variance, standard
  deviation, min/max/median, deciles, row count) as part of a single relation. Note
  that group_col_names here is a placeholder and not currently available for MPC
  workflows.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- group_col_names: Either a list of columns to be grouped in this relation,
  or None if only a simple (non-keyed) aggregation is being performed. \
- target_col_name: Name of the column which is being aggregated over.
project(
   input_op_node: < Node >,
   name: < str >,
   selected_col_names: < list >
)
- Description: Either drop columns from an input relation (by
  omitting them from selected_col_names), reorder the columns in
  and existing relation, or both.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- selected_col_names: A list of the names of columns that should be part
  of the output relation. Note that the order passed in this list will be
  preserved in the output relation.
add(
   input_op_node: < Node >,
   name: < str >,
   target_col_name: < str >,
   operands: < list >
)
- Description: Add all values from the operands list to the values
  of column target_col_name. Values in the operands list can be names
  of columns in the input relation, scalar values, or a mix of both.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- target_col_name: Name of the column to be operated upon. This can be
  either a column already present in the input relation (in which case
  it will be summed with each element of the operands list), or the
  name of a new column which will be part of the output relation. In the
  latter case, a new column will be created from the sum of the elements
  of the operands list.
- operands: A list elements to be summed with the target column, if it exists.
  If the target column is a new column, then a column is created with the name
  target_col_name, whose value will be the sum of the elements of the
  operands list. The elements in the operands list can either be numeric
  (int or float), or the names of columns themselves (str). For example, an
  operands list of [3, "a", "c"] would yield 3 added to each row of column "a",
  added to each row of column "c".
subtract(
   input_op_node: < Node >,
   name: < str >,
   target_col_name: < str >,
   operands: < list >
)
- Description: Subtract all values from the operands list from the values
  of column target_col_name. Values in the operands list can be names
  of columns in the input relation, scalar values, or a mix of both.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- target_col_name: Name of the column to be operated upon. This can be
  either a column already present in the input relation (in which case
  each element of the operands list will be subtracted from it), or the
  name of a new column which will be part of the output relation. In the
  latter case, a new column will be created from the difference of the elements
  of the operands list.
- operands: A list elements to be subtracted from the target column, if it exists.
  If the target column is a new column, then a column is created with the name
  target_col_name, whose value will be the difference of the elements of the
  operands list. The elements in the operands list can either be numeric
  (int or float), or the names of columns themselves (str). For example, an
  operands list of [3, "a", "c"] would yield 3 minus each row of column "a",
  minus each row of column "c".
multiply(
   input_op_node: < Node >,
   name: < str >,
   target_col_name: < str >,
   operands: < list >
)
- Description: Multiply all values from the operands list with the values
  of column target_col_name. Values in the operands list can be names of
  columns in the input relation, scalar values, or a mix of both.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- target_col_name: Name of the column to be operated upon. This can be
  either a column already present in the input relation (in which case
  it will be multiplied by each element of the operands list), or the
  name of a new column which will be part of the output relation. In the
  latter case, a new column will be created from the product of the elements
  of the operands list.
- operands: A list elements to be multiplied by the target column, if it exists.
  If the target column is a new column, then a column is created with the name
  target_col_name, whose value will be the product of the elements of the
  operands list. The elements in the operands list can either be numeric
  (int or float), or the names of columns themselves (str). For example, an
  operands list of [3, "a", "c"] would yield 3 multiplied by each row of
  column "a", multipled by each row of column "c".
divide(
   input_op_node: < Node >,
   name: < str >,
   target_col_name: < str >,
   operands: < list >
)
- Description: Divide all values from the operands list with the values
  of column target_col_name in sequence. For example, if target_col_name is
  a, and operands is [b, 7, c], then the output will be the values of a
  divided by the values of b, divided by 7, divided by the values of c.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- target_col_name: Name of the column to be operated upon. This can be
  either a column already present in the input relation (in which case
  it will be divided by each element of the operands list), or the
  name of a new column which will be part of the output relation. In the
  latter case, a new column will be created from the division of the elements
  of the operands list.
- operands: A list of elements that the target column will be divided by, if
  it exists. Note that, unlike multiply(), ordering for divide() is important.
  For example, if we had target_col = "a", and operands = [3, "b"], the output
  would be equal to the rows of column "a" divided by 3, and then divided by the
  rows of column "b". If column "a" doesn't already exist, a new column "a" would
  be created, whose rows would be equal to 3 divided by the rows of column "b".
limit(
   input_op_node: < Node >,
   name: < str >,
   limit_num: < int >
)
- Description: Drop all rows beyond limit_num from the input relation.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- limit_num: Number of rows to be part of the output relation. E.g. - a
  limit_num of 10 would output the first 10 rows of the input relation.
distinct(
   input_op_node: < Node >,
   name: < str >,
   selected_col_names: < list, None > = None
)
- Description: Drop all non-unique rows from the input relation with respect
  to the columns in selected_col_names.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- selected_col_names: The columns in the input relation to apply the
  distinct operator to. For example, a selected_col_names value of ["a", "b"]
  would output all distinct rows of the input relation with respect to columns
  "a" and "b". Note that distinct will also only output those columns which are
  part of the selected_col_names list, and will drop all other columns from the
  input relation.
filter_by(
   input_op_node: < Node >,
   name: < str >,
   filter_col_name: < str >,
   operator: < str >,
   filter_against: < str, int >
)
- Description: Drop all rows from the input relation that do not satisfy some
  filter condition. The filter condition itself is composed of both the operator
  and filter_against values.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- filter_col_name: Column to apply the filter operation on. Only those columns
  which satisfy the filter operation will be part of the output relation.
- operator: Comparison function to apply. The following operators are currently
  supported: ">", "<", "==", ">=", "<="
- filter_against: Either the name of the column, or the scalar to compare the
  filter column against.
- Example :
  
  If filter_col_name equals "a", operator equals ">", and
  filter_against equals 7, then only those rows i which satisfy
  i[a] > 7 will be included in the output relation.
  
  If we instead set filter_against to "b", then only those rows i
  which satisfy i[a] > i[b] will be included in the output relation.
sort_by(
   input_op_node: < Node >,
   name: < str >,
   sort_by_col_name: < str >,
   increasing: < bool, None > = True
)
- Description: Sort the rows of the input relation with respect to the
  values in one of its columns.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- sort_by_col_name: The name of the column to key the sort on.
- increasing: Whether rows will be sorted in increasing order (True),
  or decreasing order (False).
num_rows(
   input_op_node: < Node >,
   name: < str >,
   count_col_name: < str > = "num_rows"
)
- Description: Count the number of rows in the input relation.
- input_op_node: Input node.
- name: The name of the output relation on this node.
- count_col_name: The output of this operation is a 1X1 relation which
  indicates how many rows were present in the input relation. count_col_name
  is the name that will be given to the single column in this output relation.
collect(
input_op_node: < Node >,
target_parties: < set >
)
- Description: Terminates a workflow and returns the data in the output relation
  of input_op_node to the parties specified by target_parties.
- input_op_node: Input node.
- target_parties: The set of compute parties that the output should be collected
  by. Note that at the time of this writing, this only controls whether a party
  writes the output to file, and does not control whether that output is actually
  revealed to them in a cryptographic sense.
join(
   left_input_node: < Node >,
   right_input_node: < Node >,
   name: < str >,
   left_col_names: < list >,
   right_col_names: < list >
)
- Description: Performs a simple join between the output relations of left_input_node
  and right_input_node. Note that equality checks between columns are done in the order
  indicated by left_col_names and right_col_names, and both of those lists must
  contain the same number of column names.
- left_input_node: Left side input node.
- right_input_node: Right side input node.
- name: The name of the output relation on this node.
- left_col_names: The names of the group by columns in the left input relation.
- right_col_names: The names of the group by columns in the right input relation.
concat(
   input_op_nodes: < list >,
   name: < str >,
   column_names: < list, None > = None
)
- Description: Performs a vertical concatenation of all nodes in the input_op_nodes
  list. Note that the number of columns in the output relations of each node in this list
  must be equal.
- input_op_nodes: List of nodes whose output relations will be concatenated.
- name: The name of the output relation on this node.
- column_names: Option to rename the columns of the output relation on this
  node. If unspecified, the column names from the first node of input_op_nodes
  will be used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queries

Queries

Clone this wiki locally