Skip to content

Queries

bengetch edited this page Jul 21, 2021 · 7 revisions

Queries

In the following descriptions, the terms "node", "input relation", and "output relation" are
used frequently. Congregation is a query compiler, and represents operations to itself as nodes
in a DAG. For each node, there is an input relation, which represents the input data to that
node (operation), and an output relation, which represents the data after the operation has been
performed. Therefore, each query (excepting a few special ones) takes one or more input nodes, and
returns a single node object.

Each query documented below will be formatted as follows:

- Method signature
- Query description
- A list of query parameter descriptions
  • create(
       name: < str >,
       columns: < list >,
       stored_with: < set >,
       input_path: < str > )

    • Description: Each dataset to be used in a workflow must be instantiated
      by a call to create(). It is analogous to the SQL USE command.

    • name: The name of the output relation on this node. For create nodes,
      this will also correspond to the name of the file on your local filesystem
      (omit the .csv extension).

    • columns: A list of column objects.

    • stored_with: The party or parties that store this data. Note that a
      stored_with set of a single party indicates that the input data is
      in plaintext form, while a multi-element stored_with set indicates
      that the input data is secret shared across the given parties.

    • input_path: An optional argument to indicate a file path to the input
      file which corresponds to this dataset. If not input_path argument is
      supplied, congregation will resolve this filepath by concatenating the
      general.data_path value (from the configuration file) with the name
      value passed to create(). Thus, if your general.data_path was set as
      /tmp/data/, and the name value to this function was in1, congregation
      would resolve the path to this dataset as /tmp/data/in1.csv (if no input_path
      argument was supplied.

  • aggregate(
       input_op_node: < Node >,
       name: < str >,
       group_col_names: < list, None >,
       agg_col_name: < str >,
       agg_type: < str >,
       agg_out_col_name: < str, None > = None
    )

    • Description: Perform some aggregation type on the column with name
      agg_col_name, with respect to the columns in group_col_names. For
      example, if we had some relation stored in a variable called stat with
      two columns [age, income], and we wanted to find the average income
      of this relation with respect to age, we would write that as follows:
      avg_income = aggregate(stat, 'avg_income', ['age'], 'income', 'mean')

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • group_col_names: Either a list of columns to be grouped in this relation,
      or None if only a simple (non-keyed) aggregation is being performed. \

    • agg_col_name: Name of the column which is being aggregated over.

    • agg_type: The type of aggregation to be performed. Currently, the following
      aggregation types are supported:

      • Sum ("sum")
      • Mean ("mean")
      • Standard deviation ("std_dev")
      • Variance ("variance")
    • agg_out_col_name: (optional) The name of the output aggregation column.
      If this argument is left blank, the column being aggregated over will retain
      its original name.

  • min_max_median(
       input_op_node: < Node >,
       name: < str >,
       group_col_names: < list, None >,
       target_col_name: < str >,
    )

    • Description: Outputs the minimum, maximum, and median values of the column
      with name target_col_name. Note that group_col_names here is a placeholder and
      not currently available for MPC workflows.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • group_col_names: Either a list of columns to be grouped in this relation,
      or None if only a simple (non-keyed) aggregation is being performed. \

    • target_col_name: Name of the column which is being aggregated over.

  • deciles(
       input_op_node: < Node >,
       name: < str >,
       group_col_names: < list, None >,
       target_col_name: < str >, \
    )

    • Description: Outputs the 1...9-decile values of the column with name
      target_col_name. Note that group_col_names here is a placeholder and
      not currently available for MPC workflows.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • group_col_names: Either a list of columns to be grouped in this relation,
      or None if only a simple (non-keyed) aggregation is being performed. \

    • target_col_name: Name of the column which is being aggregated over.

  • aggregate_count(
       input_op_node: < Node >,
       name: < str >,
       group_col_names: < list >
    )

    • Description: Much like aggregate() above, this performs an aggregation
      with respect to some provided group columns. Here, though, what is produced
      is a simple count column which indicates how many elements were in each output
      group.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • group_col_names: Either a list of columns to be grouped in this relation,
      or None if only a simple (non-keyed) aggregation is being performed. Note
      that for this operator, a non-keyed aggregation amounts to a simple row
      count of the input relation.

  • all_stats(
       input_op_node: < Node >,
       name: < str >,
       group_col_names: < list, None >,
       target_col_name: < str >, \
    )

    • Description: Outputs all statistical operators (sum, mean, variance, standard
      deviation, min/max/median, deciles, row count) as part of a single relation. Note
      that group_col_names here is a placeholder and not currently available for MPC
      workflows.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • group_col_names: Either a list of columns to be grouped in this relation,
      or None if only a simple (non-keyed) aggregation is being performed. \

    • target_col_name: Name of the column which is being aggregated over.

  • project(
       input_op_node: < Node >,
       name: < str >,
       selected_col_names: < list >
    )

    • Description: Either drop columns from an input relation (by
      omitting them from selected_col_names), reorder the columns in
      and existing relation, or both.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • selected_col_names: A list of the names of columns that should be part
      of the output relation. Note that the order passed in this list will be
      preserved in the output relation.

  • add(
       input_op_node: < Node >,
       name: < str >,
       target_col_name: < str >,
       operands: < list >
    )

    • Description: Add all values from the operands list to the values
      of column target_col_name. Values in the operands list can be names
      of columns in the input relation, scalar values, or a mix of both.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • target_col_name: Name of the column to be operated upon. This can be
      either a column already present in the input relation (in which case
      it will be summed with each element of the operands list), or the
      name of a new column which will be part of the output relation. In the
      latter case, a new column will be created from the sum of the elements
      of the operands list.

    • operands: A list elements to be summed with the target column, if it exists.
      If the target column is a new column, then a column is created with the name
      target_col_name, whose value will be the sum of the elements of the
      operands list. The elements in the operands list can either be numeric
      (int or float), or the names of columns themselves (str). For example, an
      operands list of [3, "a", "c"] would yield 3 added to each row of column "a",
      added to each row of column "c".

  • subtract(
       input_op_node: < Node >,
       name: < str >,
       target_col_name: < str >,
       operands: < list >
    )

    • Description: Subtract all values from the operands list from the values
      of column target_col_name. Values in the operands list can be names
      of columns in the input relation, scalar values, or a mix of both.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • target_col_name: Name of the column to be operated upon. This can be
      either a column already present in the input relation (in which case
      each element of the operands list will be subtracted from it), or the
      name of a new column which will be part of the output relation. In the
      latter case, a new column will be created from the difference of the elements
      of the operands list.

    • operands: A list elements to be subtracted from the target column, if it exists.
      If the target column is a new column, then a column is created with the name
      target_col_name, whose value will be the difference of the elements of the
      operands list. The elements in the operands list can either be numeric
      (int or float), or the names of columns themselves (str). For example, an
      operands list of [3, "a", "c"] would yield 3 minus each row of column "a",
      minus each row of column "c".

  • multiply(
       input_op_node: < Node >,
       name: < str >,
       target_col_name: < str >,
       operands: < list >
    )

    • Description: Multiply all values from the operands list with the values
      of column target_col_name. Values in the operands list can be names of
      columns in the input relation, scalar values, or a mix of both.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • target_col_name: Name of the column to be operated upon. This can be
      either a column already present in the input relation (in which case
      it will be multiplied by each element of the operands list), or the
      name of a new column which will be part of the output relation. In the
      latter case, a new column will be created from the product of the elements
      of the operands list.

    • operands: A list elements to be multiplied by the target column, if it exists.
      If the target column is a new column, then a column is created with the name
      target_col_name, whose value will be the product of the elements of the
      operands list. The elements in the operands list can either be numeric
      (int or float), or the names of columns themselves (str). For example, an
      operands list of [3, "a", "c"] would yield 3 multiplied by each row of
      column "a", multipled by each row of column "c".

  • divide(
       input_op_node: < Node >,
       name: < str >,
       target_col_name: < str >,
       operands: < list >
    )

    • Description: Divide all values from the operands list with the values
      of column target_col_name in sequence. For example, if target_col_name is
      a, and operands is [b, 7, c], then the output will be the values of a
      divided by the values of b, divided by 7, divided by the values of c.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • target_col_name: Name of the column to be operated upon. This can be
      either a column already present in the input relation (in which case
      it will be divided by each element of the operands list), or the
      name of a new column which will be part of the output relation. In the
      latter case, a new column will be created from the division of the elements
      of the operands list.

    • operands: A list of elements that the target column will be divided by, if
      it exists. Note that, unlike multiply(), ordering for divide() is important.
      For example, if we had target_col = "a", and operands = [3, "b"], the output
      would be equal to the rows of column "a" divided by 3, and then divided by the
      rows of column "b". If column "a" doesn't already exist, a new column "a" would
      be created, whose rows would be equal to 3 divided by the rows of column "b".

  • limit(
       input_op_node: < Node >,
       name: < str >,
       limit_num: < int >
    )

    • Description: Drop all rows beyond limit_num from the input relation.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • limit_num: Number of rows to be part of the output relation. E.g. - a
      limit_num of 10 would output the first 10 rows of the input relation.

  • distinct(
       input_op_node: < Node >,
       name: < str >,
       selected_col_names: < list, None > = None
    )

    • Description: Drop all non-unique rows from the input relation with respect
      to the columns in selected_col_names.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • selected_col_names: The columns in the input relation to apply the
      distinct operator to. For example, a selected_col_names value of ["a", "b"]
      would output all distinct rows of the input relation with respect to columns
      "a" and "b". Note that distinct will also only output those columns which are
      part of the selected_col_names list, and will drop all other columns from the
      input relation.

  • filter_by(
       input_op_node: < Node >,
       name: < str >,
       filter_col_name: < str >,
       operator: < str >,
       filter_against: < str, int >
    )

    • Description: Drop all rows from the input relation that do not satisfy some
      filter condition. The filter condition itself is composed of both the operator
      and filter_against values.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • filter_col_name: Column to apply the filter operation on. Only those columns
      which satisfy the filter operation will be part of the output relation.

    • operator: Comparison function to apply. The following operators are currently
      supported: ">", "<", "==", ">=", "<="

    • filter_against: Either the name of the column, or the scalar to compare the
      filter column against.

    • Example :

      If filter_col_name equals "a", operator equals ">", and
      filter_against equals 7, then only those rows i which satisfy
      i[a] > 7 will be included in the output relation.

      If we instead set filter_against to "b", then only those rows i
      which satisfy i[a] > i[b] will be included in the output relation.

  • sort_by(
       input_op_node: < Node >,
       name: < str >,
       sort_by_col_name: < str >,
       increasing: < bool, None > = True
    )

    • Description: Sort the rows of the input relation with respect to the
      values in one of its columns.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • sort_by_col_name: The name of the column to key the sort on.

    • increasing: Whether rows will be sorted in increasing order (True),
      or decreasing order (False).

  • num_rows(
       input_op_node: < Node >,
       name: < str >,
       count_col_name: < str > = "num_rows"
    )

    • Description: Count the number of rows in the input relation.

    • input_op_node: Input node.

    • name: The name of the output relation on this node.

    • count_col_name: The output of this operation is a 1X1 relation which
      indicates how many rows were present in the input relation. count_col_name
      is the name that will be given to the single column in this output relation.

  • collect(
       input_op_node: < Node >,
       target_parties: < set >
    )

    • Description: Terminates a workflow and returns the data in the output relation
      of input_op_node to the parties specified by target_parties.

    • input_op_node: Input node.

    • target_parties: The set of compute parties that the output should be collected
      by. Note that at the time of this writing, this only controls whether a party
      writes the output to file, and does not control whether that output is actually
      revealed to them in a cryptographic sense.

  • join(
       left_input_node: < Node >,
       right_input_node: < Node >,
       name: < str >,
       left_col_names: < list >,
       right_col_names: < list >
    )

    • Description: Performs a simple join between the output relations of left_input_node
      and right_input_node. Note that equality checks between columns are done in the order
      indicated by left_col_names and right_col_names, and both of those lists must
      contain the same number of column names.

    • left_input_node: Left side input node.

    • right_input_node: Right side input node.

    • name: The name of the output relation on this node.

    • left_col_names: The names of the group by columns in the left input relation.

    • right_col_names: The names of the group by columns in the right input relation.

  • concat(
       input_op_nodes: < list >,
       name: < str >,
       column_names: < list, None > = None
    )

    • Description: Performs a vertical concatenation of all nodes in the input_op_nodes
      list. Note that the number of columns in the output relations of each node in this list
      must be equal.

    • input_op_nodes: List of nodes whose output relations will be concatenated.

    • name: The name of the output relation on this node.

    • column_names: Option to rename the columns of the output relation on this
      node. If unspecified, the column names from the first node of input_op_nodes
      will be used.

Clone this wiki locally