Data

Data 数据

The NautilusTrader platform defines a range of built-in data types crafted specifically to represent a trading domain:

NautilusTrader 平台定义了一系列专门用于表示交易领域的内置数据类型：

OrderBookDelta (L1/L2/L3): Most granular order book updates.
OrderBookDelta (L1/L2/L3)：最细粒度的订单簿更新。
OrderBookDeltas (L1/L2/L3): Bundles multiple order book deltas.
OrderBookDeltas (L1/L2/L3)：捆绑多个订单簿增量。
OrderBookDepth10: Aggregated order book snapshot (10 levels per side).
OrderBookDepth10：聚合订单簿快照（每侧 10 个级别）。
QuoteTick: Top-of-book best bid and ask prices and sizes.
QuoteTick：最佳买卖报价和尺寸。
TradeTick: A single trade/match event between counterparties.
TradeTick：交易对手之间的单个交易/匹配事件。
Bar: OHLCV bar data, aggregated using a specific aggregation method.
Bar：OHLCV K线数据，使用特定的聚合方法聚合。
Instrument: General base class for a tradable instrument.
Instrument：可交易Instrument的通用基类。
InstrumentStatus: An instrument level status event.
InstrumentStatus：Instrument级别状态事件。
InstrumentClose: An instrument closing price.
InstrumentClose：Instrument收盘价。

Each of these data types inherits from Data, which defines two fields:

这些数据类型都继承自 Data，它定义了两个字段：

ts_event: UNIX timestamp (nanoseconds) when the data event occurred.
ts_event：数据事件发生时的 UNIX 时间戳（纳秒）。
ts_init: UNIX timestamp (nanoseconds) when the object was initialized.
ts_init：对象初始化时的 UNIX 时间戳（纳秒）。

This inheritance ensures chronological data ordering (vital for backtesting), while also enhancing analytics.

这种继承确保了按时间顺序排列数据（这对回测至关重要），同时也增强了分析能力。

Consistency is key; data flows through the platform in exactly the same way for all system environment contexts (backtest, sandbox, live) primarily through the MessageBus to the DataEngine and onto subscribed or registered handlers.

一致性是关键；数据以完全相同的方式流经平台，适用于所有系统环境上下文（回测、沙盒、实时），主要通过 MessageBus 到 DataEngine，然后到已订阅或注册的处理程序。

For those seeking customization, the platform supports user-defined data types. Refer to the advanced Custom data guide for further details.

对于那些寻求自定义的人，该平台支持用户定义的数据类型。有关更多详细信息，请参阅高级自定义数据指南。

Loading data 加载数据

NautilusTrader facilitates data loading and conversion for three main use cases:

NautilusTrader 促进了三种主要用例的数据加载和转换：

Populating the BacktestEngine directly to run backtests.
直接填充 BacktestEngine 以运行回测。
Persisting the Nautilus-specific Parquet format for the data catalog via ParquetDataCatalog.write_data(...) to be later used with a BacktestNode.
通过 ParquetDataCatalog.write_data(...) 持久化 Nautilus 特定的 Parquet 格式以供数据目录使用，以便稍后与 BacktestNode 一起使用。
For research purposes (to ensure data is consistent between research and backtesting).
用于研究目的（以确保数据在研究和回测之间保持一致）。

Regardless of the destination, the process remains the same: converting diverse external data formats into Nautilus data structures.

无论目的地如何，过程都保持不变：将各种外部数据格式转换为 Nautilus 数据结构。

To achieve this, two main components are necessary:

为此，需要两个主要组件：

A type of DataLoader (normally specific per raw source/format) which can read the data and return a pd.DataFrame with the correct schema for the desired Nautilus object.
一种 DataLoader（通常针对每个原始源/格式特定），它可以读取数据并返回具有所需 Nautilus 对象的正确架构的 pd.DataFrame。
A type of DataWrangler (specific per data type) which takes this pd.DataFrame and returns a list[Data] of Nautilus objects.
一种 DataWrangler（针对每个数据类型特定），它接收此 pd.DataFrame 并返回 Nautilus 对象的 list[Data]。

Data loaders 数据加载器

Data loader components are typically specific for the raw source/format and per integration. For instance, Binance order book data is stored in its raw CSV file form with an entirely different format to Databento Binary Encoding (DBN) files.

数据加载器组件通常特定于原始源/格式和每个集成。例如，Binance 订单簿数据以其原始 CSV 文件形式存储，其格式与 Databento 二进制编码 (DBN) 文件完全不同。

Data wranglers 数据整理器

Data wranglers are implemented per specific Nautilus data type, and can be found in the nautilus_trader.persistence.wranglers module. Currently there exists:

数据整理器是针对每个特定的 Nautilus 数据类型实现的，可以在 nautilus_trader.persistence.wranglers 模块中找到。目前存在：

OrderBookDeltaDataWrangler

QuoteTickDataWrangler

TradeTickDataWrangler

BarDataWrangler

Warning 警告

At the risk of causing confusion, there are also a growing number of DataWrangler v2 components, which will take a pd.DataFrame typically with a different fixed width Nautilus arrow v2 schema, and output pyo3 Nautilus objects which are only compatible with the new version of the Nautilus core, currently in development.

有可能造成混淆的是，还有越来越多的 DataWrangler v2 组件，它们将接收一个通常具有不同固定宽度 Nautilus arrow v2 架构的 pd.DataFrame，并输出 pyo3 Nautilus 对象，这些对象仅与 Nautilus 核心的新版本兼容，目前正在开发中。

These pyo3 provided data objects are not compatible where the legacy Cython objects are currently used (adding directly to a BacktestEngine etc).

这些 pyo3 提供的数据对象与当前使用旧 Cython 对象的地方不兼容（直接添加到 BacktestEngine 等）。

Transformation pipeline 转换管道

Process flow:

流程：

Raw data (e.g., CSV) is input into the pipeline.
原始数据（例如 CSV）输入到管道中。
DataLoader processes the raw data and converts it into a pd.DataFrame.
DataLoader 处理原始数据并将其转换为 pd.DataFrame。
DataWrangler further processes the pd.DataFrame to generate a list of Nautilus objects.
DataWrangler 进一步处理 pd.DataFrame 以生成 Nautilus 对象列表。
The Nautilus list[Data] is the output of the data loading process.
Nautilus list[Data] 是数据加载过程的输出。

This diagram illustrates how raw data is transformed into Nautilus data structures.

此图说明了如何将原始数据转换为 Nautilus 数据结构。

  ┌──────────┐    ┌──────────────────────┐                  ┌──────────────────────┐
  │          │    │                      │                  │                      │
  │          │    │                      │                  │                      │
  │ Raw data │    │                      │  `pd.DataFrame`  │                      │
  │ (CSV)    ├───►│      DataLoader      ├─────────────────►│     DataWrangler     ├───► Nautilus `list[Data]`
  │          │    │                      │                  │                      │
  │          │    │                      │                  │                      │
  │          │    │                      │                  │                      │
  └──────────┘    └──────────────────────┘                  └──────────────────────┘

Conceretely, this would involve:

具体来说，这将涉及：

BinanceOrderBookDeltaDataLoader.load(...) which reads CSV files provided by Binance from disk, and returns a pd.DataFrame.
BinanceOrderBookDeltaDataLoader.load(...) 从磁盘读取 Binance 提供的 CSV 文件，并返回一个 pd.DataFrame。
OrderBookDeltaDataWrangler.process(...) which takes the pd.DataFrame and returns list[OrderBookDelta].
OrderBookDeltaDataWrangler.process(...) 接收 pd.DataFrame 并返回 list[OrderBookDelta]。

The following example shows how to accomplish the above in Python:

以下示例展示了如何在 Python 中完成上述操作：

from nautilus_trader import TEST_DATA_DIR
from nautilus_trader.persistence.loaders import BinanceOrderBookDeltaDataLoader
from nautilus_trader.persistence.wranglers import OrderBookDeltaDataWrangler
from nautilus_trader.test_kit.providers import TestInstrumentProvider


# Load raw data
# 加载原始数据
data_path = TEST_DATA_DIR / "binance" / "btcusdt-depth-snap.csv"
df = BinanceOrderBookDeltaDataLoader.load(data_path)

# Set up a wrangler
# 设置整理器
instrument = TestInstrumentProvider.btcusdt_binance()
wrangler = OrderBookDeltaDataWrangler(instrument)

# Process to a list `OrderBookDelta` Nautilus objects
# 处理为 Nautilus 对象列表 `OrderBookDelta`
deltas = wrangler.process(df)

Data catalog 数据目录

The data catalog is a central store for Nautilus data, persisted in the Parquet file format.

数据目录是 Nautilus 数据的中心存储，以 Parquet 文件格式持久化。

We have chosen Parquet as the storage format for the following reasons:

我们选择 Parquet 作为存储格式的原因如下：

It performs much better than CSV/JSON/HDF5/etc in terms of compression ratio (storage size) and read performance.
在压缩率（存储大小）和读取性能方面，它的性能比 CSV/JSON/HDF5/etc 好得多。
It does not require any separate running components (for example a database).
它不需要任何单独运行的组件（例如数据库）。
It is quick and simple to get up and running with.
它可以快速简单地启动和运行。

The Arrow schemas used for the Parquet format are either single sourced in the core persistence Rust crate, or available from the /serialization/arrow/schema.py module.

Parquet 格式使用的 Arrow 架构要么是核心持久化 Rust crate 中的单一来源，要么可以从 /serialization/arrow/schema.py 模块获得。

note 注意

2023-10-14: The current plan is to eventually phase out the Python schemas module, so that all schemas are single sourced in the Rust core.

2023-10-14：目前的计划是最终逐步淘汰 Python schemas 模块，以便所有架构都在 Rust 核心中有单一来源。

Initializing 初始化

The data catalog can be initialized from a NAUTILUS_PATH environment variable, or by explicitly passing in a path like object.

可以从 NAUTILUS_PATH 环境变量初始化数据目录，或者通过显式传入类似路径的对象。

The following example shows how to initialize a data catalog where there is pre-existing data already written to disk at the given path.

以下示例展示了如何在给定路径下已经写入磁盘的预先存在数据的目录初始化数据目录。

from pathlib import Path
from nautilus_trader.persistence.catalog import ParquetDataCatalog


CATALOG_PATH = Path.cwd() / "catalog"

# Create a new catalog instance
# 创建一个新的目录实例
catalog = ParquetDataCatalog(CATALOG_PATH)

Writing data 写入数据

New data can be stored in the catalog, which is effectively writing the given data to disk in the Nautilus-specific Parquet format. All Nautilus built-in Data objects are supported, and any data which inherits from Data can be written.

新数据可以存储在目录中，这实际上是将给定数据以 Nautilus 特定的 Parquet 格式写入磁盘。支持所有 Nautilus 内置的 Data 对象，并且可以写入任何继承自 Data 的数据。

The following example shows the above list of Binance OrderBookDelta objects being written.

以下示例展示了上面列出的 Binance OrderBookDelta 对象的写入。

catalog.write_data(deltas)

Basename template 基本名称模板

Nautilus makes no assumptions about how data may be partitioned between files for a particular data type and instrument ID.

Nautilus 对特定数据类型和Instrument ID 的文件之间如何分区数据不做任何假设。

The basename_template keyword argument is an additional optional naming component for the output files. The template should include placeholders that will be filled in with actual values at runtime. These values can be automatically derived from the data or provided as additional keyword arguments.

basename_template 关键字参数是输出文件的附加可选命名组件。该模板应包含将在运行时填充实际值的占位符。这些值可以从数据中自动派生，也可以作为附加关键字参数提供。

For example, using a basename template like {date} for AUD/USD.SIM quote tick data, and assuming "date" is a provided or derivable field, could result in a filename like "2023-01-01.parquet" under the "quote_tick/audusd.sim/" catalog directory. If not provided, a default naming scheme will be applied. This parameter should be specified as a keyword argument, like write_data(data, basename_template="{date}").

例如，对 AUD/USD.SIM 报价数据使用 {date} 这样的基本名称模板，并假设“date”是一个提供或可派生的字段，可能会在“quote_tick/audusd.sim/”目录下生成一个名为“2023-01-01.parquet”的文件。如果未提供，将应用默认命名方案。此参数应作为关键字参数指定，例如 write_data(data, basename_template="{date}")。

Warning 警告

Any data which already exists under a filename will be overwritten. If a basename_template is not provided, then its very likely existing data for the data type and instrument ID will be overwritten. To prevent data loss, ensure that the basename_template (or the default naming scheme) generates unique filenames for different data sets.

文件名下已存在的任何数据都将被覆盖。如果未提供 basename_template，则很可能现有数据类型和Instrument ID 的数据将被覆盖。为了防止数据丢失，请确保 basename_template（或默认命名方案）为不同的数据集生成唯一的文件名。

Rust Arrow schema implementations are available for the follow data types (enhanced performance):

以下数据类型提供 Rust Arrow 架构实现（增强性能）：

OrderBookDelta
QuoteTick
TradeTick
Bar

Reading data 读取数据

Any stored data can then we read back into memory:

然后，我们可以将任何存储的数据读回内存：

from nautilus_trader.core.datetime import dt_to_unix_nanos
import pandas as pd


start = dt_to_unix_nanos(pd.Timestamp("2020-01-03", tz=pytz.utc))
end =  dt_to_unix_nanos(pd.Timestamp("2020-01-04", tz=pytz.utc))

deltas = catalog.order_book_deltas(instrument_ids=[instrument.id.value], start=start, end=end)

Streaming data 流式传输数据

When running backtests in streaming mode with a BacktestNode, the data catalog can be used to stream the data in batches.

当使用 BacktestNode 以流式传输模式运行回测时，可以使用数据目录分批流式传输数据。

The following example shows how to achieve this by initializing a BacktestDataConfig configuration object:

以下示例展示了如何通过初始化 BacktestDataConfig 配置对象来实现此目的：

from nautilus_trader.config import BacktestDataConfig
from nautilus_trader.model.data import OrderBookDelta


data_config = BacktestDataConfig(
    catalog_path=str(catalog.path),
    data_cls=OrderBookDelta,
    instrument_id=instrument.id,
    start_time=start,
    end_time=end,
)

This configuration object can then be passed into a BacktestRunConfig and then in turn passed into a BacktestNode as part of a run. See the Backtest (high-level API) tutorial for further details.

然后可以将此配置对象传递到 BacktestRunConfig 中，然后再将其作为运行的一部分传递到 BacktestNode 中。有关更多详细信息，请参阅回测（高级 API）教程。