Skip to content

Commit

Permalink
[SPARK-41586][PYTHON] Introduce pyspark.errors and error classes fo…
Browse files Browse the repository at this point in the history
…r PySpark

### What changes were proposed in this pull request?

This PR proposes to introduce `pyspark.errors` and error classes to unifying & improving errors generated by PySpark under a single path.

To summarize, this PR includes the changes below:
- Add `python/pyspark/errors/error_classes.py` to support error class for PySpark.
- Add `ErrorClassesReader` to manage the `error_classes.py`.
- Add `PySparkException` to handle the errors generated by PySpark.
- Add `check_error` for error class testing.

This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users.

While such an active work is being done on the [SQL side to improve error messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is no work to improve error messages in PySpark.

So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR.

Eventually, the errors massage will be shown as below, for example:

- PySpark, `PySparkException` (thrown by Python driver):
```python
>>> from pyspark.sql.functions import lit
>>> lit([df.id, df.id])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/utils.py", line 334, in wrapped
    return f(*args, **kwargs)
  File ".../spark/python/pyspark/sql/functions.py", line 176, in lit
    raise PySparkException(
pyspark.errors.exceptions.PySparkException: [COLUMN_IN_LIST] lit does not allow a column in a list.
```

- PySpark, `AnalysisException` (thrown by JVM side, and capture in PySpark side):
```
>>> df.unpivot("id", [], "var", "val").collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 3296, in unpivot
    jdf = self._jdf.unpivotWithSeq(jids, jvals, variableColumnName, valueColumnName)
  File ".../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 209, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids;
'Unpivot ArraySeq(id#2L), ArraySeq(), var, [val]
+- LogicalRDD [id#2L, int#3L, double#4, str#5], false
```

- Spark, `AnalysisException`:
```scala
scala> df.select($"id").unpivot(Array($"id"), Array.empty,variableColumnName = "var", valueColumnName = "val")
org.apache.spark.sql.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids;
'Unpivot ArraySeq(id#0L), ArraySeq(), var, [val]
+- Project [id#0L]
   +- Range (0, 10, step=1, splits=Some(16))
```

**Next up** for this PR include:
- Migrate more errors into `PySparkException` across all modules (e.g, Spark Connect, pandas API on Spark...).
- Migrate more error tests into error class tests  by using `check_error`.
- Define more error classes onto `error_classes.py`.
- Add documentation.

### Why are the changes needed?

Centralizing error messages & introducing identified error class provides the following benefits:
- Errors are searchable via the unique class names and properly classified.
- Reduce the cost of future maintenance for PySpark errors.
- Provide consistent & actionable error messages to users.
- Facilitates translating error messages into different languages.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adding UTs & running the existing static analysis tools (`dev/lint-python`)

Closes #39387 from itholic/SPARK-41586.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
  • Loading branch information
itholic authored and HyukjinKwon committed Jan 16, 2023
1 parent 47068db commit b8100b5
Show file tree
Hide file tree
Showing 16 changed files with 398 additions and 3 deletions.
2 changes: 1 addition & 1 deletion dev/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ required-version = "22.6.0"
line-length = 100
target-version = ['py37']
include = '\.pyi?$'
extend-exclude = 'cloudpickle'
extend-exclude = 'cloudpickle|error_classes.py'
10 changes: 10 additions & 0 deletions dev/sparktestsupport/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -756,6 +756,16 @@ def __hash__(self):
],
)

pyspark_errors = Module(
name="pyspark-errors",
dependencies=[],
source_file_regexes=["python/pyspark/errors"],
python_test_goals=[
# unittests
"pyspark.errors.tests.test_errors",
],
)

sparkr = Module(
name="sparkr",
dependencies=[hive, mllib],
Expand Down
1 change: 1 addition & 0 deletions dev/tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ per-file-ignores =
# Examples contain some unused variables.
examples/src/main/python/sql/datasource.py: F841,
# Exclude * imports in test files
python/pyspark/errors/tests/*.py: F403,
python/pyspark/ml/tests/*.py: F403,
python/pyspark/mllib/tests/*.py: F403,
python/pyspark/pandas/tests/*.py: F401 F403,
Expand Down
1 change: 1 addition & 0 deletions python/docs/source/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,4 @@ Pandas API on Spark follows the API specifications of latest pandas release.
pyspark.mllib
pyspark
pyspark.resource
pyspark.errors
29 changes: 29 additions & 0 deletions python/docs/source/reference/pyspark.errors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
======
Errors
======

.. currentmodule:: pyspark.errors

.. autosummary::
:toctree: api/

PySparkException.getErrorClass
PySparkException.getMessageParameters
3 changes: 3 additions & 0 deletions python/mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,9 @@ ignore_errors = True
[mypy-pyspark.testing.*]
ignore_errors = True

[mypy-pyspark.errors.tests.*]
ignore_errors = True

; Allow non-strict optional for pyspark.pandas

[mypy-pyspark.pandas.*]
Expand Down
26 changes: 26 additions & 0 deletions python/pyspark/errors/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
PySpark exceptions.
"""
from pyspark.errors.exceptions import PySparkException


__all__ = [
"PySparkException",
]
30 changes: 30 additions & 0 deletions python/pyspark/errors/error_classes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import json


ERROR_CLASSES_JSON = """
{
"COLUMN_IN_LIST": {
"message": [
"<func_name> does not allow a column in a list."
]
}
}
"""

ERROR_CLASSES_MAP = json.loads(ERROR_CLASSES_JSON)
76 changes: 76 additions & 0 deletions python/pyspark/errors/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from typing import Dict, Optional, cast

from pyspark.errors.utils import ErrorClassesReader


class PySparkException(Exception):
"""
Base Exception for handling errors generated from PySpark.
"""

def __init__(
self,
message: Optional[str] = None,
error_class: Optional[str] = None,
message_parameters: Optional[Dict[str, str]] = None,
):
# `message` vs `error_class` & `message_parameters` are mutually exclusive.
assert (message is not None and (error_class is None and message_parameters is None)) or (
message is None and (error_class is not None and message_parameters is not None)
)

self.error_reader = ErrorClassesReader()

if message is None:
self.message = self.error_reader.get_error_message(
cast(str, error_class), cast(Dict[str, str], message_parameters)
)
else:
self.message = message

self.error_class = error_class
self.message_parameters = message_parameters

def getErrorClass(self) -> Optional[str]:
"""
Returns an error class as a string.
.. versionadded:: 3.4.0
See Also
--------
:meth:`PySparkException.getMessageParameters`
"""
return self.error_class

def getMessageParameters(self) -> Optional[Dict[str, str]]:
"""
Returns a message parameters as a dictionary.
.. versionadded:: 3.4.0
See Also
--------
:meth:`PySparkException.getErrorClass`
"""
return self.message_parameters

def __str__(self) -> str:
return f"[{self.getErrorClass()}] {self.message}"
16 changes: 16 additions & 0 deletions python/pyspark/errors/tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
47 changes: 47 additions & 0 deletions python/pyspark/errors/tests/test_errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# -*- encoding: utf-8 -*-
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import unittest

from pyspark.errors.utils import ErrorClassesReader


class ErrorsTest(unittest.TestCase):
def test_error_classes(self):
# Test error classes is sorted alphabetically
error_reader = ErrorClassesReader()
error_class_names = error_reader.error_info_map
for i in range(len(error_class_names) - 1):
self.assertTrue(
error_class_names[i] < error_class_names[i + 1],
f"Error class [{error_class_names[i]}] should place"
f"after [{error_class_names[i + 1]}]",
)


if __name__ == "__main__":
import unittest
from pyspark.errors.tests.test_errors import * # noqa: F401

try:
import xmlrunner

testRunner = xmlrunner.XMLTestRunner(output="target/test-reports", verbosity=2)
except ImportError:
testRunner = None
unittest.main(testRunner=testRunner, verbosity=2)
116 changes: 116 additions & 0 deletions python/pyspark/errors/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import re
from typing import Dict

from pyspark.errors.error_classes import ERROR_CLASSES_MAP


class ErrorClassesReader:
"""
A reader to load error information from error_classes.py.
"""

def __init__(self) -> None:
self.error_info_map = ERROR_CLASSES_MAP

def get_error_message(self, error_class: str, message_parameters: Dict[str, str]) -> str:
"""
Returns the completed error message by applying message parameters to the message template.
"""
message_template = self.get_message_template(error_class)
# Verify message parameters.
message_parameters_from_template = re.findall("<([a-zA-Z0-9_-]+)>", message_template)
assert set(message_parameters_from_template) == set(message_parameters), (
f"Undifined error message parameter for error class: {error_class}. "
f"Parameters: {message_parameters}"
)
table = str.maketrans("<>", "{}")

return message_template.translate(table).format(**message_parameters)

def get_message_template(self, error_class: str) -> str:
"""
Returns the message template for corresponding error class from error_classes.py.
For example,
when given `error_class` is "EXAMPLE_ERROR_CLASS",
and corresponding error class in error_classes.py looks like the below:
.. code-block:: python
"EXAMPLE_ERROR_CLASS" : {
"message" : [
"Problem <A> because of <B>."
]
}
In this case, this function returns:
"Problem <A> because of <B>."
For sub error class, when given `error_class` is "EXAMPLE_ERROR_CLASS.SUB_ERROR_CLASS",
and corresponding error class in error_classes.py looks like the below:
.. code-block:: python
"EXAMPLE_ERROR_CLASS" : {
"message" : [
"Problem <A> because of <B>."
],
"subClass" : {
"SUB_ERROR_CLASS" : {
"message" : [
"Do <C> to fix the problem."
]
}
}
}
In this case, this function returns:
"Problem <A> because <B>. Do <C> to fix the problem."
"""
error_classes = error_class.split(".")
len_error_classes = len(error_classes)
assert len_error_classes in (1, 2)

# Generate message template for main error class.
main_error_class = error_classes[0]
if main_error_class in self.error_info_map:
main_error_class_info_map = self.error_info_map[main_error_class]
else:
raise ValueError(f"Cannot find main error class '{main_error_class}'")

main_message_template = "\n".join(main_error_class_info_map["message"])

has_sub_class = len_error_classes == 2

if not has_sub_class:
message_template = main_message_template
else:
# Generate message template for sub error class if exists.
sub_error_class = error_classes[1]
main_error_class_subclass_info_map = main_error_class_info_map["subClass"]
if sub_error_class in main_error_class_subclass_info_map:
sub_error_class_info_map = main_error_class_subclass_info_map[sub_error_class]
else:
raise ValueError(f"Cannot find sub error class '{sub_error_class}'")

sub_message_template = "\n".join(sub_error_class_info_map["message"])
message_template = main_message_template + " " + sub_message_template

return message_template
Loading

0 comments on commit b8100b5

Please sign in to comment.