Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IN Carta test data from Colas Lab #141

Merged
merged 16 commits into from
Jan 11, 2024
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,6 @@ dmypy.json

# parsl ignores
runinfo

# test data ignores
tests/in-carta/colas-lab/data
13 changes: 13 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -178,3 +178,16 @@ references:
notes: >-
MapReduce techniques are used via Parsl apps and workflow configuration
to help achieve scalable data engineering for CytoTable.
- authors:
- name: "Colas Lab"
d33bs marked this conversation as resolved.
Show resolved Hide resolved
date-accessed: "2024-01-09"
title: Colas Lab Example IN Carta Dataset
type: data
notes: >-
Colas Lab provided access to dataset created from IN Carta for
d33bs marked this conversation as resolved.
Show resolved Hide resolved
use within CytoTable tests for furthering development efforts.
A modified testing dataset appears within this project
under `tests/data/in-carta/colas-lab`.
See:
- https://sbpdiscovery.org/our-scientists/alexandre-colas-phd
- https://www.moleculardevices.com/products/cellular-imaging-systems/acquisition-and-analysis-software/in-carta-image-analysis-software
60 changes: 60 additions & 0 deletions tests/data/in-carta/colas-lab/shrink_colas_lab_data_for_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""
Shrink datasets from Colas Lab from IN Carta provided as collection of CSV's.

Note: built to be run from CytoTable poetry dev environment from project base, e.g.:
`poetry run python tests/data/in-carta/colas-lab/shrink_colas-lab_data_for_tests.py`
"""

import pathlib

import duckdb
from pyarrow import csv

# set a path for local and target data dir
SOURCE_DATA_DIR = "tests/data/in-carta/colas-lab/data"
TARGET_DATA_DIR = "tests/data/in-carta/colas-lab"

# build a collection of schema
schema_collection = []
for data_file in pathlib.Path(SOURCE_DATA_DIR).rglob("*.csv"):
d33bs marked this conversation as resolved.
Show resolved Hide resolved
with duckdb.connect() as ddb:
# read the csv file as a pyarrow table and extract detected schema
schema_collection.append(
{
"file": data_file,
"schema": ddb.execute(
f"""
SELECT *
FROM read_csv_auto('{data_file}')
"""
)
.arrow()
.schema,
}
)

# determine if the schema are exactly alike
for schema in schema_collection:
for schema_to_compare in schema_collection:
# compare every schema to all others
if schema["file"] != schema_to_compare["file"]:
if not schema["schema"].equals(schema_to_compare["schema"]):
raise TypeError("Inequal schema detected.")


for data_file in pathlib.Path(SOURCE_DATA_DIR).rglob("*.csv"):
with duckdb.connect() as ddb:
# read the csv file as a pyarrow table and output to a new csv
d33bs marked this conversation as resolved.
Show resolved Hide resolved
csv.write_csv(
data=ddb.execute(
f"""
SELECT *
FROM read_csv_auto('{data_file}') as data_file
/* select only the first three objects to limit the dataset */
WHERE data_file."OBJECT ID" in (1,2,3)
/* select rows C and D to limit the dataset */
AND data_file."ROW" in ('C', 'D')
"""
).arrow(),
output_file=f"{TARGET_DATA_DIR}/test-{pathlib.Path(data_file).name}",
d33bs marked this conversation as resolved.
Show resolved Hide resolved
)

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.