Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to Improve core concepts / Formatting Sections #576

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 34 additions & 55 deletions docs/dataset/core-concepts/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,51 @@ search:

# :kolena-dataset-20: Dataset

A **dataset** is a structured assembly of datapoints, designed for model evaluation.
Each datapoint in a dataset is a comprehensive unit that combines data
traditionally segmented into test samples, ground truth, and metadata.
This structure is immutable, meaning once a datapoint is added,
it cannot be altered without creating a new version of the dataset.
This immutability ensures the integrity and traceability of the data used in testing models.
A **dataset** is a version-controlled collection of datapoints.
Datapoints within a given version are immutable.
This immutability ensures the integrity and traceability of the data used in model evaluation.

## Datapoints

Datapoints are integral components within the dataset structure used for evaluating models.
They are versatile and immutable objects that encompass the role traditionally played by
test samples, ground truth, and metadata. Key characteristics of datapoints include:
Datapoints can be thought of as the "units" of model evaluation.
Datapoints may represent a variety of media including images, video, documents, text, or even 3D point clouds.
Datapoints can have properties, known as "fields" which may represent primitive values like strings or numbers,
nested objects, or media assets and annotations like [`bounding boxes`][kolena.annotation.BoundingBox].

- **Unified Object Structure**:
Datapoints replace the need for separate entities like test samples and ground truth.
They are singular, grab-bag objects that can embody various types of data,
including images, as indicated by the presence of a data_type field.
Key characteristics of datapoints include:

- **Immunity to Change**: Once a datapoint is added to a dataset, it cannot be altered.
Any update to a datapoint results in the creation of a new datapoint, and this action consequently versions the dataset.
- **Flexible Data Structure**:
Datapoints are generic data containers, allowing for customization.
Datapoints can represent whatever "unit" of testing is relevant for your problem.
In computer vision, a datapoint may represent an image with associated bounding box ground truths.
For language models, a datapoint may represent text prompts.

- **Exclusive Association with Datasets**:
Datapoints are exclusive to the dataset they belong to and are not shared across different datasets.
This exclusivity ensures clear demarcation and management of data within specific datasets.

- **Role in Data Ingestion**: Datapoints play a central role in the data ingestion process.
They are represented in a DataFrame structure with special treatment for certain columns like `locator` and `text`.

- **Extension of Data Classes**: Datapoints extend data classes, allowing for flexibility and customization.
For instance, they can include annotation objects like [`BoundingBox`][kolena.annotation.BoundingBox],
and these objects can be further extended as needed.

In essence, datapoints in this context are versatile,
immutable data units that are exclusively associated with a specific dataset,
playing a crucial role in model evaluation by
encapsulating various types of data and annotations within a unified object structure.
- **Immutability**: Datapoints within a given dataset version cannot be altered.
Any update to a datapoint results in the creation of a new dataset version.

### How to generate datapoints?
- **Traceability**: Updates to datapoints between versions are recorded for auditing purposes.

Structure your dataset as a *CSV file*. Each row in the file should represent a distinct datapoint.

- **Reserved Columns**: for models that process images, audio, or video,
include a `locator` column with valid URL paths to the respective files.
For text-based models, include a `text` column that contains the input text data directly in the CSV.

- **Additional Fields**: Include relevant metadata depending on the data type.
For instance, image datasets might have metadata like `image_width`, `image_height`, etc.
Similarly, other data types can have their respective metadata fields that are useful for model processing.
- **Isolation Between Datasets**:
Datapoints are exclusive to the dataset they belong to and are not shared across different datasets.
This exclusivity ensures clear demarcation and management of data.

- **Data Consistency and Format**: It's crucial to maintain data consistency.
URLs should be correctly encoded, text should be properly formatted,
and numerical values should adhere to their respective formats.
### How Are Datasets Represented?

- **Data Accessibility**: Ensure the data, especially if linked through URLs, is accessible for processing.
In the case of cloud storage, appropriate permissions should be in place to allow access.
Datasets are represented using tables, with each row representing a distinct datapoint.
One column, or a combination of columns, serve as the primary key to this table and uniquely
identify each datapoint. These are known as the **ID Field(s)**.

### ID Fields
Likewise, Model Results for a dataset are also represented using tables.
Model Results must contain the same ID Fields as your dataset so that
Kolena can associate model results with the appropriate datapoints.

When you upload your dataset, you will need to specify one or more ID fields. These fields should form a primary
key for the dataset. In other words, each datapoint should have a distinct combination of values in the specified ID
field(s).
??? note "Selecting ID Fields"
We recommend
that your ID fields be convenient to generate and pass around with your model results. Typically, this means selecting
a single short string or integer field as the ID.

Kolena uses the ID field(s) in your dataset to associate your model results with the appropriate datapoints, and
you will need to include the ID field(s) as fields in any model results you upload. For this reason, we recommend
that your ID fields be convenient to generate and pass around with your model results. Typically, this means selecting
a single short string or integer field as the ID.
For more information on how to specify your ID fields, see the relevant documentation on
[formatting your datasets](../formatting-your-datasets.md#what-defines-a-datapoint).

For more information on how to specify your ID fields, see the relevant documentation on
[formatting your datasets](../formatting-your-datasets.md#what-defines-a-datapoint).
Datasets can be created using CSV or Parquet files, or more generally from Pandas DataFrames.
See our detailed documentation to learn more about how to [format your datasets](../formatting-your-datasets.md).
6 changes: 3 additions & 3 deletions docs/dataset/core-concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ testing. For a quick overview, refer to the [Quickstart Guide](../quickstart.md)

---

A Dataset represents a structured assembly of datapoints, essential for model evaluation. Each datapoint
within a dataset combines various data types, such as images or text, traditionally segmented into test samples and
metadata.
A dataset is a version-controlled collection of datapoints.
Datapoints represent customizable "units" of testing relevant to your problem,
such as images or text with associated ground truths.

</div>

Expand Down
38 changes: 17 additions & 21 deletions docs/dataset/formatting-your-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ icon: kolena/studio-16

## What is a Dataset

A [dataset](../dataset/core-concepts/dataset.md) is a structured assembly of datapoints, designed for model evaluation.
Each datapoint in a dataset is a comprehensive unit that combines data traditionally segmented into test samples,
ground truth, and metadata.
A [dataset](../dataset/core-concepts/dataset.md) is a version-controlled collection of datapoints.
Datapoints represent customizable "units" of testing relevant to your problem,
such as images or text with associated ground truths.

### What defines a Datapoint

Expand All @@ -20,13 +20,13 @@ dataset with the following columns:
|---------------------------------------------------------------|--------------|----------|-----|
| `s3://kolena-public-examples/cifar10/data/horse0000.png` | horse | 153.994 | 84.126 |

From this you can see that image `horse0000.png` has the ground_truth classification of `horse`,
and has brightness and contrast data.
This datapoint is the image `horse0000.png` with a classification of `horse`
and brightness and contrast data.

When uploading a dataset to Kolena, it is important to be able to differentiate between each datapoint. This is
accomplished by configuring an `id_field` - a unique identifier for a datapoint. You can select any field that is
unique across your data, or generate one if no unique identifiers exist for your dataset. Below are some common patterns
for generating/selecting a unique identifier if your data does not have a natural ID field:
Each datapoint is uniquely identified by a column or set of columns known as `id_field`(s).
`id_field`s also allow model results to be associated to a datapoint. You can select any field that is
unique across your data, or generate one if no unique identifiers exist for your dataset. Some common patterns
for generating or selecting a `id_field` include:

- If your datapoints contain a `locator` field pointing to the external files representing your model inputs, the `locator`
field is usually used as the ID field.
Expand All @@ -35,11 +35,13 @@ for generating/selecting a unique identifier if your data does not have a natura
as the ID field.
- For other kinds of datapoints, we recommend generating and saving a UUID for each datapoint to use as the ID field.

Kolena will attempt to infer common `id_field`s (eg. `locator`, `text`) based on what is present in the dataset during import.
This can be overridden by explicitly declaring id fields when importing via the Web App from the [:kolena-dataset-16: Datasets](https://app.kolena.com/redirect/datasets)
Kolena will attempt to infer common `id_field`s (e.g. `locator`, `text`) based on what is present in the dataset during import.
This can be overridden by explicitly declaring `id_field`s when importing via the Web App from the [:kolena-dataset-16: Datasets](https://app.kolena.com/redirect/datasets)
page, or the SDK by using the [`upload_dataset`](../reference/dataset/index.md#kolena.dataset.dataset.upload_dataset)
function.

### Special Fields

Kolena will look for the following fields when displaying datapoints:

| Field Name | Description |
Expand All @@ -66,15 +68,9 @@ fields like `word_count` may be useful for text datasets.

## How are Datasets viewed

Kolena allows you to visualize your datasets by use of the Studio. The studio experience depends on the type of data
relevant to your problem.

The first experience is the Gallery view which allows you to view your data in a grid. This is useful as you can see
chunks of your data (images, video, audio, text) and view results without having to view each datapoint individually.

The second experience is the Tabular view, used when your data is a set of columns and values.
An example of this is the [:kolena-widget-16: Rain Forcast ↗](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/rain_forecast)
dataset.
Kolena allows you to visualize your datasets by use of the Studio.
Datapoints represented as images, video, audio, point clouds, or text can be viewed within a gallery layout,
while all other datapoints may be viewed within a tablular layout.

In order to use the Gallery view you will need to have the `locator` or `text` fields specified in the dataset.

Expand Down Expand Up @@ -242,7 +238,7 @@ If you wanted to add a thumbnail to the classification data shown above it would
|---------------------------------------------------------------|--------------------------------------------------------------------|--------------|----------|-----|
| `s3://kolena-public-examples/cifar10/data/horse0000.png` | `s3://kolena-public-examples/cifar10/data/thumbnail/horse0000.png` | horse | 153.994 | 84.126 |

## Formatting Results
## Formatting Model Results

### Formatting results for Object Detection

Expand Down
2 changes: 1 addition & 1 deletion docs/dataset/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ standardization and comparison of multiple models.

---

A Dataset represents a structured assembly of datapoints, essential for model evaluation.
A Dataset represents a version-controlled collection of datapoints, essential for model evaluation.

- [:kolena-quality-standard-16: Quality Standard](./core-concepts/quality-standard.md)

Expand Down