Skip to content

Commit

Permalink
testing small headers fix
Browse files Browse the repository at this point in the history
  • Loading branch information
gavrelina committed Oct 15, 2024
1 parent 7b858da commit 5a4c1d4
Showing 1 changed file with 9 additions and 15 deletions.
24 changes: 9 additions & 15 deletions docs/content/howto/dataframe-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ Although RRD files generally contain a single recording, they may occasionally c

For such RRD, the `load_archive()` function can be used:


<!-- NOLINT_START -->

```python
import rerun as rr

Expand All @@ -35,6 +35,7 @@ print(f"The archive contains {archive.num_recordings()} recordings.")
for recording in archive.all_recordings():
...
```

<!-- NOLINT_END -->

The overall content of the recording can be inspected using the `schema()` method:
Expand All @@ -45,7 +46,6 @@ schema.index_columns() # list of all index columns (timelines)
schema.component_columns() # list of all component columns
```


### Creating a view

The first step for getting data out of a recording is to create a view, which requires specifying an index column and some content to include.
Expand Down Expand Up @@ -80,7 +80,7 @@ A view has several APIs to further filter the rows it will return.

<!-- TODO(rerun-io/landing#521): change these headers to h4 when these are properly supported -->

**Filtering by time range**
#### Filtering by time range

Rows may be filtered to keep only a given range of values from its index column:

Expand All @@ -90,11 +90,12 @@ view = view.filter_range_sequence(0, 10)
```

This API exists for both temporal and sequence timeline, and for various units:
- `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments)
- `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments)
- `view.filter_range_nanos(start_nano, end_nano)` (takes `int` arguments)

**Filtering by index value**
- `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments)
- `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments)
- `view.filter_range_nanos(start_nano, end_nano)` (takes `int` arguments)

#### Filtering by index value

Rows may be filtered to keep only those whose index corresponds to a specific set of value:

Expand All @@ -104,8 +105,7 @@ view = view.filter_index_values([0, 5, 10])

Note that a precise match is required. Since Rerun internally stores times as `int64`, this API is only available for integer arguments (nanos or sequence number). Floating point seconds would risk false mismatch due to numerical conversion.


**Filtering by column not null**
##### Filtering by column not null

Rows where a specific column has null values may be filtered out using the `filter_is_not_null()` method. When using this method, only rows for which a logging event exist for the provided column are returned.

Expand All @@ -127,7 +127,6 @@ In this case, the view will return rows in multiples of 1e6 nanoseconds (i.e. fo

Note that this feature is typically used in conjunction with `fill_latest_at()` (see next paragraph) to enable arbitrary resampling of the original data.


### Filling empty values with latest-at data

By default, the rows returned by the view may be sparse and contain values only for the columns where a logging event actually occurred at the corresponding index value. The view can optionally replace these empty cells using a latest-at query. This means that, for each such empty cell, the view traces back to find the last logged value and uses it instead. This is enabled by calling the `fill_latest_at()` method:
Expand All @@ -140,7 +139,6 @@ view = view.fill_latest_at()

Once the view is fully set up (possibly using the filtering features previously described), its content can be read using the `select()` method. This method optionally allows specifying which subset of columns should be produced:


```python
# select all columns
record_batches = view.select()
Expand All @@ -158,7 +156,6 @@ The `select()` method returns a [`pyarrow.RecordBatchReader`](https://arrow.apac

In the rest of this page, we explore how these `RecordBatch`es can be ingested in some of the popular data science packages.


## Load data to a PyArrow `Table`

The `RecordBatchReader` provides a [`read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) method which directly produces a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table):
Expand All @@ -172,12 +169,10 @@ view = recording.view(index="frame_nr", contents="/**")
table = view.select().read_all()
```


## Load data to a Pandas dataframe <!-- NOLINT -->

The `RecordBatchReader` provides a [`read_pandas()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_pandas) method which returns a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):


```python
import rerun as rr

Expand All @@ -201,7 +196,6 @@ view = recording.view(index="frame_nr", contents="/**")
df = pl.from_arrow(view.select().read_all())
```


## Load data to a DuckDB relation

A [DuckDB](https://duckdb.org) relation can be created directly using the `pyarrow.RecordBatchReader` returned by `select()`:
Expand Down

0 comments on commit 5a4c1d4

Please sign in to comment.