kolenaIO · dylangrandmont · Apr 19, 2024 · Apr 19, 2024 · Apr 19, 2024
@@ -6,72 +6,51 @@ search:
 
 # :kolena-dataset-20: Dataset
 
-A **dataset** is a structured assembly of datapoints, designed for model evaluation.
-Each datapoint in a dataset is a comprehensive unit that combines data
-traditionally segmented into test samples, ground truth, and metadata.
-This structure is immutable, meaning once a datapoint is added,
-it cannot be altered without creating a new version of the dataset.
-This immutability ensures the integrity and traceability of the data used in testing models.
+A **dataset** is a version-controlled collection of datapoints.
+Datapoints within a given version are immutable.
+This immutability ensures the integrity and traceability of the data used in model evaluation.
 
 ## Datapoints
 
-Datapoints are integral components within the dataset structure used for evaluating models.
-They are versatile and immutable objects that encompass the role traditionally played by
-test samples, ground truth, and metadata. Key characteristics of datapoints include:
+Datapoints can be thought of as the "units" of model evaluation.
+Datapoints may represent a variety of media including images, video, documents, text, or even 3D point clouds.
+Datapoints can have properties, known as "fields" which may represent primitive values like strings or numbers,
+nested objects, or media assets and annotations like [`bounding boxes`][kolena.annotation.BoundingBox].
 
-- **Unified Object Structure**:
-  Datapoints replace the need for separate entities like test samples and ground truth.
-  They are singular, grab-bag objects that can embody various types of data,
-  including images, as indicated by the presence of a data_type field.
+Key characteristics of datapoints include:
 
-- **Immunity to Change**: Once a datapoint is added to a dataset, it cannot be altered.
-  Any update to a datapoint results in the creation of a new datapoint, and this action consequently versions the dataset.
+- **Flexible Data Structure**:
+  Datapoints are generic data containers, allowing for customization.
+  Datapoints can represent whatever "unit" of testing is relevant for your problem.
+  In computer vision, a datapoint may represent an image with associated bounding box ground truths.
+  For language models, a datapoint may represent text prompts.
 
-- **Exclusive Association with Datasets**:
-  Datapoints are exclusive to the dataset they belong to and are not shared across different datasets.
-  This exclusivity ensures clear demarcation and management of data within specific datasets.
-
-- **Role in Data Ingestion**: Datapoints play a central role in the data ingestion process.
-  They are represented in a DataFrame structure with special treatment for certain columns like `locator` and `text`.
-
-- **Extension of Data Classes**: Datapoints extend data classes, allowing for flexibility and customization.
-  For instance, they can include annotation objects like [`BoundingBox`][kolena.annotation.BoundingBox],
-  and these objects can be further extended as needed.
-
-In essence, datapoints in this context are versatile,
-immutable data units that are exclusively associated with a specific dataset,
-playing a crucial role in model evaluation by
-encapsulating various types of data and annotations within a unified object structure.
+- **Immutability**: Datapoints within a given dataset version cannot be altered.
+  Any update to a datapoint results in the creation of a new dataset version.
 
-### How to generate datapoints?
+- **Traceability**: Updates to datapoints between versions are recorded for auditing purposes.
 
-Structure your dataset as a *CSV file*. Each row in the file should represent a distinct datapoint.
-
-- **Reserved Columns**: for models that process images, audio, or video,
-  include a `locator` column with valid URL paths to the respective files.
-  For text-based models, include a `text` column that contains the input text data directly in the CSV.
-
-- **Additional Fields**: Include relevant metadata depending on the data type.
-  For instance, image datasets might have metadata like `image_width`, `image_height`, etc.
-  Similarly, other data types can have their respective metadata fields that are useful for model processing.
+- **Isolation Between Datasets**:
+  Datapoints are exclusive to the dataset they belong to and are not shared across different datasets.
+  This exclusivity ensures clear demarcation and management of data.
 
-- **Data Consistency and Format**: It's crucial to maintain data consistency.
-  URLs should be correctly encoded, text should be properly formatted,
-  and numerical values should adhere to their respective formats.
+### How Are Datasets Represented?
 
-- **Data Accessibility**: Ensure the data, especially if linked through URLs, is accessible for processing.
-  In the case of cloud storage, appropriate permissions should be in place to allow access.
+Datasets are represented using tables, with each row representing a distinct datapoint.
+One column, or a combination of columns, serve as the primary key to this table and uniquely
+identify each datapoint. These are known as the **ID Field(s)**.
 
-### ID Fields
+Likewise, Model Results for a dataset are also represented using tables.
+Model Results must contain the same ID Fields as your dataset so that
+Kolena can associate model results with the appropriate datapoints.
 
-When you upload your dataset, you will need to specify one or more ID fields. These fields should form a primary
-key for the dataset. In other words, each datapoint should have a distinct combination of values in the specified ID
-field(s).
+??? note "Selecting ID Fields"
+    We recommend
+    that your ID fields be convenient to generate and pass around with your model results. Typically, this means selecting
+    a single short string or integer field as the ID.
 
-Kolena uses the ID field(s) in your dataset to associate your model results with the appropriate datapoints, and
-you will need to include the ID field(s) as fields in any model results you upload. For this reason, we recommend
-that your ID fields be convenient to generate and pass around with your model results. Typically, this means selecting
-a single short string or integer field as the ID.
+    For more information on how to specify your ID fields, see the relevant documentation on
+    [formatting your datasets](../formatting-your-datasets.md#what-defines-a-datapoint).
 
-For more information on how to specify your ID fields, see the relevant documentation on
-[formatting your datasets](../formatting-your-datasets.md#what-defines-a-datapoint).
+Datasets can be created using CSV or Parquet files, or more generally from Pandas DataFrames.
+See our detailed documentation to learn more about how to [format your datasets](../formatting-your-datasets.md).
@@ -14,9 +14,9 @@ testing. For a quick overview, refer to the [Quickstart Guide](../quickstart.md)
 
     ---
 
-    A Dataset represents a structured assembly of datapoints, essential for model evaluation. Each datapoint
-    within a dataset combines various data types, such as images or text, traditionally segmented into test samples and
-    metadata.
+    A dataset is a version-controlled collection of datapoints.
+    Datapoints represent customizable "units" of testing relevant to your problem,
+    such as images or text with associated ground truths.
 
 </div>
 

@@ -6,9 +6,9 @@ icon: kolena/studio-16
 
 ## What is a Dataset
 
-A [dataset](../dataset/core-concepts/dataset.md) is a structured assembly of datapoints, designed for model evaluation.
-Each datapoint in a dataset is a comprehensive unit that combines data traditionally segmented into test samples,
-ground truth, and metadata.
+A [dataset](../dataset/core-concepts/dataset.md) is a version-controlled collection of datapoints.
+Datapoints represent customizable "units" of testing relevant to your problem,
+such as images or text with associated ground truths.
 
 ### What defines a Datapoint
 
@@ -20,13 +20,13 @@ dataset with the following columns:
 |---------------------------------------------------------------|--------------|----------|-----|
 | `s3://kolena-public-examples/cifar10/data/horse0000.png`        | horse        |     153.994     |    84.126  |
 
-From this you can see that image `horse0000.png` has the ground_truth classification of `horse`,
-and has brightness and contrast data.
+This datapoint is the image `horse0000.png` with a classification of `horse`
+and brightness and contrast data.
 
-When uploading a dataset to Kolena, it is important to be able to differentiate between each datapoint. This is
-accomplished by configuring an `id_field` - a unique identifier for a datapoint. You can select any field that is
-unique across your data, or generate one if no unique identifiers exist for your dataset. Below are some common patterns
-for generating/selecting a unique identifier if your data does not have a natural ID field:
+Each datapoint is uniquely identified by a column or set of columns known as `id_field`(s).
+`id_field`s also allow model results to be associated to a datapoint. You can select any field that is
+unique across your data, or generate one if no unique identifiers exist for your dataset. Some common patterns
+for generating or selecting a `id_field` include:
 
 - If your datapoints contain a `locator` field pointing to the external files representing your model inputs, the `locator`
   field is usually used as the ID field.
@@ -35,11 +35,13 @@ for generating/selecting a unique identifier if your data does not have a natura
   as the ID field.
 - For other kinds of datapoints, we recommend generating and saving a UUID for each datapoint to use as the ID field.
 
-Kolena will attempt to infer common `id_field`s (eg. `locator`, `text`) based on what is present in the dataset during import.
-This can be overridden by explicitly declaring id fields when importing via the Web App from the [:kolena-dataset-16: Datasets](https://app.kolena.com/redirect/datasets)
+Kolena will attempt to infer common `id_field`s (e.g. `locator`, `text`) based on what is present in the dataset during import.
+This can be overridden by explicitly declaring `id_field`s when importing via the Web App from the [:kolena-dataset-16: Datasets](https://app.kolena.com/redirect/datasets)
 page, or the SDK by using the [`upload_dataset`](../reference/dataset/index.md#kolena.dataset.dataset.upload_dataset)
 function.
 
+### Special Fields
+
 Kolena will look for the following fields when displaying datapoints:
 
 | Field Name | Description                                                                                                                              |
@@ -66,15 +68,9 @@ fields like `word_count` may be useful for text datasets.
 
 ## How are Datasets viewed
 
-Kolena allows you to visualize your datasets by use of the Studio. The studio experience depends on the type of data
-relevant to your problem.
-
-The first experience is the Gallery view which allows you to view your data in a grid. This is useful as you can see
-chunks of your data (images, video, audio, text) and view results without having to view each datapoint individually.
-
-The second experience is the Tabular view, used when your data is a set of columns and values.
-An example of this is the [:kolena-widget-16: Rain Forcast ↗](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/rain_forecast)
-dataset.
+Kolena allows you to visualize your datasets by use of the Studio.
+Datapoints represented as images, video, audio, point clouds, or text can be viewed within a gallery layout,
+while all other datapoints may be viewed within a tablular layout.
 
 In order to use the Gallery view you will need to have the `locator` or `text` fields specified in the dataset.
 
@@ -242,7 +238,7 @@ If you wanted to add a thumbnail to the classification data shown above it would
 |---------------------------------------------------------------|--------------------------------------------------------------------|--------------|----------|-----|
 | `s3://kolena-public-examples/cifar10/data/horse0000.png`        | `s3://kolena-public-examples/cifar10/data/thumbnail/horse0000.png` | horse        |     153.994     |    84.126  |
 
-## Formatting Results
+## Formatting Model Results
 
 ### Formatting results for Object Detection
 

@@ -31,7 +31,7 @@ standardization and comparison of multiple models.
 
     ---
 
-    A Dataset represents a structured assembly of datapoints, essential for model evaluation.
+    A Dataset represents a version-controlled collection of datapoints, essential for model evaluation.
 
 - [:kolena-quality-standard-16: Quality Standard](./core-concepts/quality-standard.md)
-Original file line number
+Diff line change
@@ Expand Up / @@ -31,7 +31,7 @@ standardization and comparison of multiple models. @@
         ---
-        A Dataset represents a structured assembly of datapoints, essential for model evaluation.
+        A Dataset represents a version-controlled collection of datapoints, essential for model evaluation.
     - [:kolena-quality-standard-16: Quality Standard](./core-concepts/quality-standard.md)
@@ Expand Down @@