Skip to content

DataFrame to DMatrix Conversion Spec #874

@tqchen

Description

@tqchen

This is a centralized issue giving specification of how a dataframe(pandas, R's dataframe) can be converted into DMatrix. Dataframe can be a helpful data source. Giving such specification will give chance to direct data ingestion from dataframe, and avoid memory copy issues and possible ease of external memory integration.

Currently it is straightforward to do so for continuous features. Less obvious to do so for categorical features and sparse input.

Goal

Let us not aim to do complicated things. For example, automatically indexing all the factors(categorical features) and accept string input type.

Instead have a _minimum_ specification of how to represent sparse input and categorical features and being able to quickly convert to sparse matrix type. Let the dataframe solutions do the jobs such as feature engineering.

Example Proposal 1

All the categorical columns must already been maped to unique integers. So column C1 will be in [0, n) and column C2 will be in [n, n+m). Where n is number of unique categories in C1, and m is number of unique categories in C2.

Example Proposal 2

Map existing categorical columns into unique integers. C1 will be in [0, n) C2 will be in [0, m). When constructing DMatrix, also pass size of each column [n, m] to the constructor

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions