Extend documentation for reduce_agg (facebookincubator#7160)

Summary: Clarify requirements for reduce_agg inputs. Also, fix formatting issues throughout the documentation. Pull Request resolved: facebookincubator#7160 Reviewed By: xiaoxmeng Differential Revision: D50495517 Pulled By: mbasmanova fbshipit-source-id: 130f714d89e9a3ff1d84ea24f859fee805902c12
mbasmanova · Oct 20, 2023 · 7acc337 · 7acc337
1 parent dde33b8
commit 7acc337
Show file tree

Hide file tree

Showing 4 changed files with 18 additions and 11 deletions.
diff --git a/velox/docs/configs.rst b/velox/docs/configs.rst
@@ -30,8 +30,7 @@ Generic Configuration
    * - table_scan_getoutput_time_limit_ms
      - integer
      - 5000
-     - TableScan operator will exit getOutput() method after this many milliseconds even if it has no data to return yet.
-     - Zero means 'no time limit'.
+     - TableScan operator will exit getOutput() method after this many milliseconds even if it has no data to return yet. Zero means 'no time limit'.
    * - abandon_partial_aggregation_min_rows
      - integer
      - 100,000
@@ -211,9 +210,7 @@ Spilling
    * - aggregation_spill_all
      - boolean
      - false
-     - If true and spilling has been triggered during the input processing, the spiller will spill all the remaining
-     - in-memory state to disk before output processing. This is to simplify the aggregation query OOM prevention in
-     - output processing stage.
+     - If true and spilling has been triggered during the input processing, the spiller will spill all the remaining in-memory state to disk before output processing. This is to simplify the aggregation query OOM prevention in output processing stage.
    * - join_spill_memory_threshold
      - integer
      - 0

diff --git a/velox/docs/develop/operators.rst b/velox/docs/develop/operators.rst
@@ -233,8 +233,7 @@ followed by the group ID column. The type of group ID column is BIGINT.
    * - Property
      - Description
    * - groupingSets
-     - List of grouping key sets. Keys within each set must be unique, but keys can repeat across the sets.
-     - Grouping keys are specified with their output names.
+     - List of grouping key sets. Keys within each set must be unique, but keys can repeat across the sets. Grouping keys are specified with their output names.
    * - groupingKeyInfos
      - The names and order of the grouping key columns in the output.
    * - aggregationInputs

diff --git a/velox/docs/functions/presto/aggregate.rst b/velox/docs/functions/presto/aggregate.rst
@@ -143,6 +143,17 @@ General Aggregate Functions
     The final state is returned. Throws an error if ``initialState`` is NULL or
     ``inputFunction`` or ``combineFunction`` returns a NULL.
 
+    Take care when designing ``initialState``, ``inputFunction`` and ``combineFunction``.
+    These need to support evaluating aggregation in a distributed manner using partial
+    aggregation on many nodes, followed by shuffle over group-by keys, followed by
+    final aggregation. Make sure that
+
+     combineFunction(s1, s2) = combineFunction(s2, s1) for any s1 and s2;
+
+     inputFunction(inputFunction(initialState, x), y) = combineFunction(inputFunction(initialState, x), inputFunction(initialState, y)) for any x and y
+
+    Check out `blog post about reduce_agg <https://velox-lib.io/blog/reduce-agg>`_ for more context.
+
     Note that reduce_agg doesn't support evaluation over sorted inputs.::
 
         -- Compute sum (for illustration purposes only; use SUM aggregate function in production queries).

diff --git a/velox/docs/functions/spark/datetime.rst b/velox/docs/functions/spark/datetime.rst
@@ -34,21 +34,21 @@ These functions support TIMESTAMP and DATE input types.
 
     Returns Returns the day of year of the date/timestamp. ::
 
-    SELECT dayofyear('2016-04-09'); -- 100
+        SELECT dayofyear('2016-04-09'); -- 100
 
 .. spark:function:: dayofmonth(date) -> integer
 
     Returns the day of month of the date/timestamp. ::
 
-    SELECT dayofmonth('2009-07-30'); -- 30
+        SELECT dayofmonth('2009-07-30'); -- 30
 
 .. spark:function:: dayofweek(date/timestamp) -> integer
 
     Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday).
     We can use `dow` as alias for ::
 
-    SELECT dayofweek('2009-07-30'); -- 5
-    SELECT dayofweek('2023-08-22 11:23:00.100'); -- 3
+        SELECT dayofweek('2009-07-30'); -- 5
+        SELECT dayofweek('2023-08-22 11:23:00.100'); -- 3
 
 .. function:: dow(x) -> integer