From 351a5f8c004a449013ab25acbcfdd85e9e7868b8 Mon Sep 17 00:00:00 2001 From: Haejoon Lee Date: Fri, 24 Nov 2023 19:38:31 +0900 Subject: [PATCH] [SPARK-46016][DOCS][PS] Fix pandas API support list properly ### What changes were proposed in this pull request? This PR proposes to fix a critical issue in the [Supported pandas API documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html) where many essential APIs such as `DataFrame.max`, `DataFrame.min`, `DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not implemented - marked as "N" - as below: Screenshot 2023-11-24 at 12 37 49 PM The root cause of this issue was that the script used to generate the support list excluded functions inherited from parent classes. For instance, `CategoricalIndex.max` is actually supported by inheriting the `Index` class but was not directly implemented in `CategoricalIndex`, leading to it being marked as unsupported: Screenshot 2023-11-24 at 12 30 08 PM ### Why are the changes needed? The current documentation inaccurately represents the state of supported pandas API, which could significantly hinder user experience and adoption. By correcting these inaccuracies, we ensure that the documentation reflects the true capabilities of Pandas API on Spark, providing users with reliable and accurate information. ### Does this PR introduce _any_ user-facing change? No. This PR only updates the documentation to accurately reflect the current state of supported pandas API. ### How was this patch tested? Manually build documentation, and check if the supported pandas API list is correctly generated as below: Screenshot 2023-11-24 at 12 36 31 PM ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43996 from itholic/fix_supported_api_gen. Authored-by: Haejoon Lee Signed-off-by: Hyukjin Kwon (cherry picked from commit 132bb63a897f4f4049f34deefc065ed3eac6a90f) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/supported_api_gen.py | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/python/pyspark/pandas/supported_api_gen.py b/python/pyspark/pandas/supported_api_gen.py index 06591c5b26ad6..8c3cdec3671c1 100644 --- a/python/pyspark/pandas/supported_api_gen.py +++ b/python/pyspark/pandas/supported_api_gen.py @@ -138,23 +138,11 @@ def _create_supported_by_module( # module not implemented return {} - pd_funcs = dict( - [ - m - for m in getmembers(pd_module, isfunction) - if not m[0].startswith("_") and m[0] in pd_module.__dict__ - ] - ) + pd_funcs = dict([m for m in getmembers(pd_module, isfunction) if not m[0].startswith("_")]) if not pd_funcs: return {} - ps_funcs = dict( - [ - m - for m in getmembers(ps_module, isfunction) - if not m[0].startswith("_") and m[0] in ps_module.__dict__ - ] - ) + ps_funcs = dict([m for m in getmembers(ps_module, isfunction) if not m[0].startswith("_")]) return _organize_by_implementation_status( module_name, pd_funcs, ps_funcs, pd_module_group, ps_module_group