Skip to content

Commit

Permalink
modified environment requirements to tie to pyparsing 2.4.7 (#168)
Browse files Browse the repository at this point in the history
* modified environment requirements to tie to pyparsing 2.4.7

* changes to Pipfile

* updated schema parsing

* changes for pyparsing compatibility

* changed readme etc for revised release
  • Loading branch information
ronanstokes-db authored Mar 11, 2023
1 parent ef04c7b commit 93caee4
Show file tree
Hide file tree
Showing 13 changed files with 54 additions and 27 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/onrelease.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ jobs:
- name: Install
run: pip install pipenv

- name: Install dependencies
run: pipenv install --dev

- name: Build dist
run: pipenv run python setup.py sdist bdist_wheel

Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ jobs:
- name: Install
run: pip install pipenv

- name: Install dependencies
run: pipenv install --dev

- name: Run tests
run: make test

Expand Down
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,22 @@
## Change History
All notable changes to the Databricks Labs Data Generator will be documented in this file.

### Unreleased
### Version 0.3.2

#### Changed
* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a
column with a SQL expression can refer to previously created columns without use of a `baseColumn` attribute
* Changed build labelling to comply with PEP440

#### Fixed
* Fixed compatibility of build with older versions of runtime that rely on `pyparsing` version 2.4.7

#### Added
* Parsing of SQL expressions to determine column dependencies

#### Notes
* This does not change actual order of column building - but adjusts which phase columns are built in
* The enhancements to build ordering does not change actual order of column building -
but adjusts which phase columns are built in


### Version 0.3.1
Expand Down
3 changes: 2 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ warrant that you have the legal authority to do so.
# Building the code

## Package Dependencies
See the contents of the file `python/require.txt` to see the Python package dependencies
See the contents of the file `python/require.txt` to see the Python package dependencies.
Dependent packages are not installed automatically by the `dbldatagen` package.

## Python compatibility

Expand Down
19 changes: 11 additions & 8 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,6 @@ verify_ssl = true
[dev-packages]
pytest = "*"
pytest-cov = "*"

numpy = "1.22.0"
pyspark = "3.1.3"
pyarrow = "1.0.1"
pandas = "1.1.3"
pyparsing = ">=2.4.7,<3.0.9"

sphinx = ">=2.0.0,<3.1.0"
nbsphinx = "*"
numpydoc = "0.8"
Expand All @@ -21,6 +14,16 @@ ipython = "7.31.1"
pydata-sphinx-theme = "*"
recommonmark = "*"
sphinx-markdown-builder = "*"
bumpversion = "*"

[packages]
numpy = "==1.22.0"
pyspark = "==3.1.3"
pyarrow = "==4.0.1"
wheel = "==0.38.4"
pandas = "==1.2.4"
setuptools = "==65.6.3"
pyparsing = "==2.4.7"

[requires]
python_version = "3.8"
python_version = ">=3.8.10"
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ details of use and many examples.

Release notes and details of the latest changes for this specific release
can be found in the Github repository
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.2a0/CHANGELOG.md)
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.2/CHANGELOG.md)

# Installation

Expand Down Expand Up @@ -126,6 +126,21 @@ examples.

The Github repository also contains further examples in the examples directory

## Spark and Databricks Runtime Compatibility
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime including
older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes
including `current` and `preview`.

While we dont specifically drop support for older runtimes, changes in Pyspark APIs or
APIs from dependent packages such as `numpy`, `pandas`, `pyarrow` and `pyparsing` make cause issues with older
runtimes.

Installing `dbldatagen` explicitly does not install releases of dependent packages so as to preserve the curated
set of packages installed in any Databricks runtime environment.

When building on local environments, the `Pipfile` and requirements files are used to determine the versions
tested against for releases and unit tests.

## Project Support
Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
Expand Down
2 changes: 1 addition & 1 deletion dbldatagen/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def get_version(version):
return version_info


__version__ = "0.3.2a0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
__version__ = "0.3.2" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
__version_info__ = get_version(__version__)


Expand Down
8 changes: 4 additions & 4 deletions dbldatagen/schema_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,18 +273,18 @@ def _cleanseSQL(cls, sql_string):

# skip over quoted identifiers even if they contain quotes
quoted_ident = pp.QuotedString(quoteChar="`", escQuote="``")
quoted_ident.set_parse_action(lambda s, loc, toks: f"`{toks[0]}`")
quoted_ident.setParseAction(lambda s, loc, toks: f"`{toks[0]}`")

stringForm1 = pp.Literal('r') + pp.QuotedString(quoteChar="'")
stringForm2 = pp.Literal('r') + pp.QuotedString(quoteChar='"')
stringForm3 = pp.QuotedString(quoteChar="'", escQuote=r"\'")
stringForm4 = pp.QuotedString(quoteChar='"', escQuote=r'\"')
stringForm = stringForm1 ^ stringForm2 ^ stringForm3 ^ stringForm4
stringForm.set_parse_action(lambda s, loc, toks: "' '")
stringForm.setParseAction(lambda s, loc, toks: "' '")

parser = quoted_ident ^ stringForm

transformed_string = parser.transform_string(sql_string)
transformed_string = parser.transformString(sql_string)

return transformed_string

Expand Down Expand Up @@ -312,7 +312,7 @@ def columnsReferencesFromSQLString(cls, sql_string, filter=None):
ident = pp.Word(pp.alphas, pp.alphanums + "_") | pp.QuotedString(quoteChar="`", escQuote="``")
parser = ident

references = parser.search_string(cleansed_sql_string)
references = parser.searchString(cleansed_sql_string)

results = set([item for sublist in references for item in sublist])

Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
author = 'Databricks Inc'

# The full version, including alpha/beta/rc tags
release = "0.3.2a0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
release = "0.3.2" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion python/.bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.3.2a0
current_version = 0.3.2
commit = False
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+){0,1}(?P<release>\D*)(?P<build>\d*)
Expand Down
8 changes: 4 additions & 4 deletions python/dev_require.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.19.2
numpy==1.22.0
pandas==1.2.4
pickleshare==0.7.5
py4j==0.10.9
pyarrow==4.0.0
pyspark>=3.1.2
pyarrow==4.0.1
pyspark>=3.1.3
python-dateutil==2.8.1
six==1.15.0
pyparsing>=2.4.7, <= 3.0.9
pyparsing==2.4.7

# The following packages are required for development only
wheel==0.36.2
Expand Down
6 changes: 3 additions & 3 deletions python/require.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ numpy==1.22.0
pandas==1.2.5
pickleshare==0.7.5
py4j==0.10.9
pyarrow==4.0.0
pyspark>=3.1.2
pyarrow==4.0.1
pyspark>=3.1.3
python-dateutil==2.8.1
six==1.15.0
pyparsing>=2.4.7, <= 3.0.9
pyparsing==2.4.7

# The following packages are required for development only
wheel==0.36.2
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

setuptools.setup(
name="dbldatagen",
version="0.3.2a0",
version="0.3.2",
author="Ronan Stokes, Databricks",
description="Databricks Labs - PySpark Synthetic Data Generator",
long_description=long_description,
Expand Down

0 comments on commit 93caee4

Please sign in to comment.