Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge development branch #452

Merged
merged 42 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
b941d3b
Bump pre-commit-ci/lite-action from 1.0.2 to 1.0.3
dependabot[bot] Oct 1, 2024
3a69679
Update number of positional args
camposandro Oct 8, 2024
7798376
Run CI on development branch
camposandro Oct 8, 2024
5054ddb
update hipscat target
smcguire-cmu Oct 8, 2024
0e38536
Merge pull request #426 from astronomy-commons/dependabot/github_acti…
smcguire-cmu Oct 8, 2024
e384736
WIP: sketching out from_lists
dougbrn Oct 8, 2024
76e2af4
Override catalog's `__len__` method (#429)
camposandro Oct 8, 2024
2554f1d
change to nest_lists
dougbrn Oct 9, 2024
ef1940d
change to nest_lists
dougbrn Oct 9, 2024
9bac17f
implement at healpix_dataset level
dougbrn Oct 9, 2024
8875ac5
WIP: sketching out from_lists
dougbrn Oct 8, 2024
59c294c
change to nest_lists
dougbrn Oct 9, 2024
ec092bc
change to nest_lists
dougbrn Oct 9, 2024
11548ea
implement at healpix_dataset level
dougbrn Oct 9, 2024
07d65b0
keep current
dougbrn Oct 9, 2024
1a58311
main test written
dougbrn Oct 9, 2024
642c268
wrap reduce healpix_dataset.py
smcguire-cmu Sep 9, 2024
b45d7db
wrap in catalog
smcguire-cmu Sep 9, 2024
fdfdfad
update signature to match
smcguire-cmu Sep 9, 2024
bc32520
unit test
smcguire-cmu Sep 13, 2024
04f82e4
isort
smcguire-cmu Sep 13, 2024
549f729
wip
smcguire-cmu Oct 3, 2024
d842ece
add reduce append_columns
smcguire-cmu Oct 3, 2024
714bee3
add append_columns test
smcguire-cmu Oct 8, 2024
d75758e
add docstring
smcguire-cmu Oct 8, 2024
1d9fc7f
add unit test
smcguire-cmu Oct 8, 2024
783ed13
pr
smcguire-cmu Oct 10, 2024
b613df7
isort
smcguire-cmu Oct 10, 2024
be78673
Merge pull request #414 from astronomy-commons/sean/reduce
smcguire-cmu Oct 10, 2024
68eb665
main test
dougbrn Oct 10, 2024
c5f1713
Merge branch 'development' into from_lists_wrapper
dougbrn Oct 10, 2024
e9872a3
lint fix
dougbrn Oct 10, 2024
9c3a6a1
test compute
dougbrn Oct 11, 2024
37e258a
Explode kwargs for ra and dec in from_dataframe (#437)
camposandro Oct 14, 2024
ab3f57a
Merge pull request #431 from astronomy-commons/from_lists_wrapper
dougbrn Oct 14, 2024
34974b4
use alignment moc in crossmatched/joined catalogs
smcguire-cmu Oct 14, 2024
fef8f8b
remove unused import
smcguire-cmu Oct 14, 2024
c7dd949
update docs hipscat branch
smcguire-cmu Oct 15, 2024
964a207
Merge pull request #438 from astronomy-commons/sean/alignment-moc
smcguire-cmu Oct 15, 2024
a6516d3
Resolve merge conflicts
camposandro Oct 22, 2024
f5c0d51
Fix unit test
camposandro Oct 22, 2024
c4b6e8a
Merge branch 'main' into development
camposandro Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,5 @@ jobs:
extra_args: --all-files --verbose
env:
SKIP: "check-lincc-frameworks-template-version,no-commit-to-branch,check-added-large-files,validate-pyproject,sphinx-build,pytest-check"
- uses: pre-commit-ci/lite-action@v1.0.2
- uses: pre-commit-ci/lite-action@v1.0.3
if: failure() && github.event_name == 'pull_request' && github.event.pull_request.draft == false
1 change: 1 addition & 0 deletions src/.pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ ignored-parents=
# Maximum number of arguments for function / method.
max-args=10

# Maximum number of positional arguments.
max-positional-arguments=15

# Maximum number of attributes for a class (see R0902).
Expand Down
155 changes: 145 additions & 10 deletions src/lsdb/catalog/catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,19 @@
return self._ddf._meta

def query(self, expr: str) -> Catalog:
"""Filters catalog and respective margin, if it exists, using a complex query expression

Args:
expr (str): Query expression to evaluate. The column names that are not valid Python
variables names should be wrapped in backticks, and any variable values can be
injected using f-strings. The use of '@' to reference variables is not supported.
More information about pandas query strings is available
`here <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`__.

Returns:
A catalog that contains the data from the original catalog that complies with the query
expression. If a margin exists, it is filtered according to the same query expression.
"""
catalog = super().query(expr)
if self.margin is not None:
catalog.margin = self.margin.query(expr)
Expand Down Expand Up @@ -206,8 +219,11 @@
catalog_name=output_catalog_name,
ra_column=self.hc_structure.catalog_info.ra_column + suffixes[0],
dec_column=self.hc_structure.catalog_info.dec_column + suffixes[0],
total_rows=0,
)
hc_catalog = hc.catalog.Catalog(
new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf), moc=alignment.moc
)
hc_catalog = hc.catalog.Catalog(new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf))
return Catalog(ddf, ddf_map, hc_catalog)

def cone_search(self, ra: float, dec: float, radius_arcsec: float, fine: bool = True) -> Catalog:
Expand Down Expand Up @@ -422,9 +438,11 @@
catalog_name=output_catalog_name,
ra_column=self.hc_structure.catalog_info.ra_column + suffixes[0],
dec_column=self.hc_structure.catalog_info.dec_column + suffixes[0],
total_rows=0,
)
hc_catalog = hc.catalog.Catalog(
new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf), moc=alignment.moc
)

hc_catalog = hc.catalog.Catalog(new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf))
return Catalog(ddf, ddf_map, hc_catalog)

def join(
Expand Down Expand Up @@ -471,10 +489,10 @@
catalog_name=output_catalog_name,
ra_column=self.hc_structure.catalog_info.ra_column + suffixes[0],
dec_column=self.hc_structure.catalog_info.dec_column + suffixes[0],
total_rows=0,
)

hc_catalog = hc.catalog.Catalog(
new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf)
new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf), moc=alignment.moc
)
return Catalog(ddf, ddf_map, hc_catalog)
if left_on is None or right_on is None:
Expand All @@ -494,9 +512,11 @@
catalog_name=output_catalog_name,
ra_column=self.hc_structure.catalog_info.ra_column + suffixes[0],
dec_column=self.hc_structure.catalog_info.dec_column + suffixes[0],
total_rows=0,
)
hc_catalog = hc.catalog.Catalog(
new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf), moc=alignment.moc
)

hc_catalog = hc.catalog.Catalog(new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf))
return Catalog(ddf, ddf_map, hc_catalog)

def join_nested(
Expand Down Expand Up @@ -549,11 +569,68 @@
if output_catalog_name is None:
output_catalog_name = self.hc_structure.catalog_info.catalog_name

new_catalog_info = self.hc_structure.catalog_info.copy_and_update(catalog_name=output_catalog_name)

hc_catalog = hc.catalog.Catalog(new_catalog_info, alignment.pixel_tree)
new_catalog_info = self.hc_structure.catalog_info.copy_and_update(
catalog_name=output_catalog_name, total_rows=0
)
hc_catalog = hc.catalog.Catalog(
new_catalog_info, alignment.pixel_tree, schema=get_arrow_schema(ddf), moc=alignment.moc
)
return Catalog(ddf, ddf_map, hc_catalog)

def nest_lists(
self,
base_columns: list[str] | None,
list_columns: list[str] | None = None,
name: str = "nested",
) -> Catalog:
"""Creates a new catalog with a set of list columns packed into a
nested column.

Args:
base_columns (list-like or None): Any columns that have non-list values in the input catalog.
These will simply be kept as identical columns in the result
list_columns (list-like or None): The list-value columns that should be packed into a nested column.
All columns in the list will attempt to be packed into a single
nested column with the name provided in `nested_name`. All columns
in list_columns must have pyarrow list dtypes, otherwise the
operation will fail. If None, is defined as all columns not in
`base_columns`.
name (str): The name of the output column the `nested_columns` are packed into.

Returns:
A new catalog with specified list columns nested into a new nested column.

Note:
As noted above, all columns in `list_columns` must have a pyarrow
ListType dtype. This is needed for proper meta propagation. To convert
a list column to this dtype, you can use this command structure:
`nf= nf.astype({"colname": pd.ArrowDtype(pa.list_(pa.int64()))})`
Where pa.int64 above should be replaced with the correct dtype of the
underlying data accordingly.
Additionally, it's a known issue in Dask
(https://github.com/dask/dask/issues/10139) that columns with list
values will by default be converted to the string type. This will
interfere with the ability to recast these to pyarrow lists. We
recommend setting the following dask config setting to prevent this:
`dask.config.set({"dataframe.convert-string":False})`
"""
new_ddf = super().nest_lists(
base_columns=base_columns,
list_columns=list_columns,
name=name,
)

catalog = Catalog(new_ddf._ddf, self._ddf_pixel_map, self.hc_structure)

if self.margin is not None:
catalog.margin = self.margin.nest_lists(

Check warning on line 626 in src/lsdb/catalog/catalog.py

View check run for this annotation

Codecov / codecov/patch

src/lsdb/catalog/catalog.py#L626

Added line #L626 was not covered by tests
base_columns=base_columns,
list_columns=list_columns,
name=name,
)

return catalog

def dropna(
self,
*,
Expand All @@ -564,6 +641,58 @@
subset: IndexLabel | None = None,
ignore_index: bool = False,
) -> Catalog:
"""Remove missing values for one layer of nested columns in the catalog.

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
Determine if rows or columns which contain missing values are
removed.

* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.

Only a single axis is allowed.

how : {'any', 'all'}, default 'any'
Determine if row or column is removed from catalog, when we have
at least one NA or all NA.

* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values. Cannot be combined with how.
on_nested : str or bool, optional
If not False, applies the call to the nested dataframe in the
column with label equal to the provided string. If specified,
the nested dataframe should align with any columns given in
`subset`.
subset : column label or sequence of labels, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.

Access nested columns using `nested_df.nested_col` (where
`nested_df` refers to a particular nested dataframe and
`nested_col` is a column of that nested dataframe).
ignore_index : bool, default ``False``
If ``True``, the resulting axis will be labeled 0, 1, …, n - 1.

.. versionadded:: 2.0.0

Returns
-------
Catalog
Catalog with NA entries dropped from it.

Notes
-----
Operations that target a particular nested structure return a dataframe
with rows of that particular nested structure affected.

Values for `on_nested` and `subset` should be consistent in pointing
to a single layer, multi-layer operations are not supported at this
time.
"""
catalog = super().dropna(
axis=axis, how=how, thresh=thresh, on_nested=on_nested, subset=subset, ignore_index=ignore_index
)
Expand All @@ -577,3 +706,9 @@
ignore_index=ignore_index,
)
return catalog

def reduce(self, func, *args, meta=None, **kwargs) -> Catalog:
catalog = super().reduce(func, *args, meta=meta, **kwargs)
if self.margin is not None:
catalog.margin = self.margin.reduce(func, *args, meta=meta, **kwargs)

Check warning on line 713 in src/lsdb/catalog/catalog.py

View check run for this annotation

Codecov / codecov/patch

src/lsdb/catalog/catalog.py#L713

Added line #L713 was not covered by tests
return catalog
Loading
Loading