[!div class="op_single_selector" title1="Select the version of Azure Machine Learning developer platform you use:"]
[!INCLUDE sdk v2] [!INCLUDE cli v2]
Azure ML supports a Table type (mltable
) that allows for a blueprint that defines how you would like data files to materialize into a Pandas or Spark data frame (rows and columns). Azure ML Tables have these two key features:
- An
MLTable
file. A YAML-based file that defines the materialization blueprint. In theMLTable
, you can specify:- The storage location(s) of the data - local, in the cloud, or on a public http(s) server. Azure ML supports Globbing patterns. These locations can specify sets of filenames, with wildcard characters (
*
,?
,[abc]
,[a-z]
). - read transformation - for example, the file format type (delimited text, Parquet, Delta, json), delimiters, headers, etc.
- Column type conversions (enforce schema).
- New column creation, using folder structure information - for example, creation of a year and month column, using the
{year}/{month}
folder structure in the path. - Subsets of data to materialize - for example, filter rows, keep/drop columns, take random samples.
- The storage location(s) of the data - local, in the cloud, or on a public http(s) server. Azure ML supports Globbing patterns. These locations can specify sets of filenames, with wildcard characters (
- A fast and efficient engine to materialize the data into a Pandas or Spark dataframe, according to the blueprint defined in the
MLTable
file. The engine relies on Rust. The Rust language is known for high speed and high memory efficiency.
In this article you'll learn:
[!div class="checklist"]
- When to use Tables instead of Files or Folders.
- How to install the MLTable SDK.
- How to define the materialization blueprint using an
MLTable
file.- Examples that show use of Tables in Azure ML.
- How to use Tables during interactive development (for example, in a notebook).
-
An Azure subscription. Create a free account before you begin if you don't already have an Azure subscription. Try the free or paid version of Azure Machine Learning.
-
An Azure Machine Learning workspace.
Tabular data doesn't require Azure ML Tables (mltable
). You can use Azure ML File (uri_file
) and Folder (uri_folder
) types, and provide your own parsing logic, to materialize the data into a Pandas or Spark data frame. In cases where you have a simple CSV file or Parquet folder, you'll find it easier to use Azure ML Files/Folders rather than Tables.
Let's assume you have a single CSV file on a public http server, and you'd like to read into Pandas. Two lines of Python code can read the data:
import pandas as pd
pd.read_csv("https://azuremlexamples.blob.core.windows.net/datasets/iris.csv")
With Azure ML Tables, you would need to first define the MLTable
file:
# /data/iris/MLTable
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
transformations:
- read_delimited:
delimiter: ','
header: all_files_same_headers
Before reading into Pandas:
import mltable
# load the folder containing MLTable file
tbl = mltable.load("/data/iris")
tbl.to_pandas_dataframe()
For a simple CSV file, definition of the MLTable
creates unnecessary extra work. Instead, you'll find Azure ML Tables (mltable
) much more useful to deal with these scenarios:
- The schema of your data is complex and/or that schema frequently changes.
- You only need a subset of data (for example: a sample of rows or files, specific columns, etc.).
- AutoML jobs that require tabular data.
We explained when to avoid Azure ML Tables. Here, we'll see a motivating example of when Azure ML Tables can help your workflow. Imagine a scenario where you have many text files in a folder:
├── my_data
│ ├── file1.txt
│ ├── file1_use_this.txt
│ ├── file2.txt
│ ├── file2_use_this.txt
.
.
.
│ ├── file1000.txt
│ ├── file1000_use_this.txt
Each text file has this structure:
store_location date zip_code amount x y z staticvar1 stasticvar2
Seattle 20/04/2022 12324 123.4 true no 0 2 4
.
.
.
London 20/04/2022 XX358YY 156 true yes 1 2 4
The data has these important characteristics:
- Only files with this suffix
_use_this.txt
have the relevant data. We can ignore other file names that don't have that suffix. - Date data will have this format
DD/MM/YYYY
. It will not have a string format. - The x, y, z columns are booleans, not strings. Values in the data can be either
yes
/no
,1
/0
,true
/false
. - The store location is an index that is useful for generation of data subsets.
- The file is encoded in
ascii
format. - Every file in the folder contains the same header.
- The first million records for zip_code are numeric, but later on, they'll become alphanumeric.
- Some static variables in the data aren't useful for machine learning.
You can materialize the above text files into a data frame using Pandas and an Azure ML Folder (uri_folder
):
import glob
import datetime
import os
import argparse
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument("--input_folder", type=str)
args = parser.parse_args()
path = os.path.join(args.input_folder, "*_use_this.txt")
files = glob.glob(path)
# create empty list
dfl = []
# dict of column types
col_types = {
"zip": str,
"date": datetime.date,
"x": bool,
"y": bool,
"z": bool
}
# enumerate files into a list of dfs
for f in files:
csv = pd.read_table(
path=f,
delimiter=" ",
header=0,
usecols=["store_location", "zip_code", "date", "amount", "x", "y", "z"],
dtype=col_types,
encoding='ascii',
true_values=['yes', '1', 'true'],
false_values=['no', '0', 'false']
)
dfl.append(csv)
# concatenate the list of dataframes
df = pd.concat(dfl)
# set the index column
df.index_columns("store_location")
However, problems can occur when
- The schema changes (for example, a column name changes): All consumers of the data must independently update their Python code. Other examples can involve type changes, added / removed columns, encoding change, etc.
- The data size increases - If the data becomes too large for Pandas to process, all the consumers of the data will need to switch to a more scalable library (PySpark/Dask).
Azure ML Tables allow the data asset creator to define the materialization blueprint in a single file. Then, consumers can then easily materialize the data into a data frame. The consumers can avoid the need to write their own Python parsing logic. The creator of the data asset defines an MLTable
file:
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
# paths are relative to the location of the MLTable file
paths:
- pattern: ./*_use_this.txt
traits:
index_columns:
- "store_location"
transformations:
- read_delimited:
encoding: ascii
header: all_files_same_headers
delimiter: " "
- keep_columns: ["store_location", "zip_code", "date", "amount", "x", "y", "z"]
- convert_column_types:
- columns: date
column_type:
datetime:
formats:
- "%d/%m/%Y"
- columns: ["x","y","z"]
column_type:
boolean:
mismatch_as: error
true_values:
- "yes"
- "true"
- "1"
false_values:
- "no"
- "false"
- "0"
The consumers can materialize the data into data frame with three lines of Python code:
import mltable
tbl = mltable.load("./my_data")
df = tbl.to_pandas_dataframe()
In this scenario, Azure ML Tables, instead of Files or Folders, offers these key benefits:
- Consumers don't need to create their own Python parsing logic to materialize the data into Pandas or Spark.
- One location (the MLTable file) can handle schema changes, to avoid required code changes in multiple locations.
Type | V2 API | V1 API | Canonical Scenarios | V2/V1 API Difference |
---|---|---|---|---|
File Reference a single file |
uri_file |
FileDataset |
Read/write a single file - the file can have any format. | A type new to V2 APIs. In V1 APIs, files always mapped to a folder on the compute target filesystem; this mapping required an os.path.join . In V2 APIs, the single file is mapped. This way, you can refer to that location in your code. |
Folder Reference a single folder |
uri_folder |
FileDataset |
You must read/write a folder of parquet/CSV files into Pandas/Spark. Deep-learning with images, text, audio, video files located in a folder. |
In V1 APIs, FileDataset had an associated engine that could take a file sample from a folder. In V2 APIs, a Folder is a simple mapping to the compute target filesystem. |
Table Reference a data table |
mltable |
TabularDataset |
You have a complex schema subject to frequent changes, or you need a subset of large tabular data. AutoML with Tables. |
In V1 APIs, the Azure ML back-end stored the data materialization blueprint. This storage location meant that TabularDataset only worked if you had an Azure ML workspace. mltable stores the data materialization blueprint in your storage. This storage location means you can use it disconnected to Azure ML - for example, local, on-premises. In V2 APIs, you'll find it easier to transition from local to remote jobs. |
MLTable is pre-installed on Compute Instance, AzureML Spark, and DSVM. You can install mltable
Python library with this code:
pip install mltable
Note
- MLTable is a separate library from
azure-ai-ml
. - If you use a Compute Instance/Spark/DSVM, we recommend that you keep the package up-to-date with
pip install -U mltable
- MLTable can be used totally ‘disconnected’ to Azure ML (and the cloud). You can use MLTable anywhere you can get a Python session – for example: locally, Cloud VM, Databricks, Synapse, On-prem server, etc.
The MLTable
file is a YAML-based file that defines the materialization blueprint. In the MLTable
file, you can specify:
- The data storage location(s), which can be local, in the cloud, or on a public http(s) server. Globbing patterns are also supported. This way, wildcard characters (
*
,?
,[abc]
,[a-z]
) can specify sets of filenames. - read transformation - for example, the file format type (delimited text, Parquet, Delta, json), delimiters, headers, etc.
- Column type conversions (enforce schema).
- New columns, using folder structure information - for example, create a year and month column using the
{year}/{month}
folder structure in the path. - Subsets of data to materialize - for example, filter rows, keep/drop columns, take random samples.
While the MLTable
file contains YAML, it needs the exact MLTable
name (no filename extensions).
Caution
Filename extensions of yaml or yml, as seen in MLTable.yaml
or MLTable.yml
, will cause an MLTable file not found
error when loading. The MLTable file needs the exact MLTable
name.
You can author MLTable
files with the Python SDK. You can also directly author the MLTable file in an IDE (like VSCode). This example shows an MLTable file authored with the SDK:
import mltable
from mltable import DataType
data_files = {
'pattern': './*parquet'
}
tbl = mltable.from_parquet_files(path=[data_files])
# add additional transformations
# tbl = tbl.keep_columns()
# tbl = tbl.filter()
# save to local directory
tbl.save("<local_path>")
To directly author MLTable
files, we recommend VSCode, because VSCode can handle auto-complete. Additionally, VSCode handles Azure Cloud Storage increases to your workspace, for seamless MLTable
file edits on cloud storage.
To enable autocomplete and intellisense for MLTable
files in VSCode, you'll need to associate the MLTable
file with yaml.
In VSCode, select File>Preferences>Settings. In the search bar of the Settings tab, type associations:
:::image type="content" source="media/how-to-mltable/vscode-mltable.png" alt-text="file association in VSCode":::
Under File: Associations select Add item and then enter the following:
- Item: MLTable
- Value: yaml
You can then author MLTable files with autocomplete and intellisense, if you include the following schema at the top of your MLTable
file.
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
Tip
The Azure ML VSCode extension will show the available schemas and autocomplete when you type $schema:
:
:::image type="content" source="media/how-to-mltable/vscode-storage-ext-0.png" alt-text="Autocomplete":::
We recommend co-location of the MLTable
file with the underlying data, for example:
├── my_data
│ ├── MLTable
│ ├── file_1.txt
.
.
.
│ ├── file_n.txt
Co-location of the MLTable
file with the data ensures a self-contained artifact that stores all needed resources in that one folder, regardless of whether that folder is stored on your local drive, in your cloud store, or on a public http server.
Since the MLTable
will co-locate with the data, the paths
defined in the MLTable
file should be relative to the location of the MLTable
file. For example, in the above scenario, where the MLTable
file is in the same folder as the txt
data, define the paths as:
type: mltable
# paths must be relative to the location of the MLTable file
paths:
- pattern: ./*.txt
transformations:
- read_delimited:
delimiter: ','
header: all_files_same_headers
Typically, you'll author your MLTable file either locally in an IDE (such as VSCode), or with a cloud-based VM such as an Azure ML Compute Instance. To create a self-contained MLTable artifact, you must store the MLTable
file with the data. If you place your MLTable artifact (data and MLTable
file) in your local storage, Create Data asset will become the easiest way to upload that artifact to cloud storage because your artifact will automatically upload.
If you already placed your data in cloud storage, you have some options to co-locate your MLTable
file with the data.
VSCode has an Azure Storage VSCode extension that can directly create and author files on Cloud storage. These steps show how to do it:
- Install the Azure Storage VSCode extension.
- In the left-hand Activity Bar, select Azure, and find your subscription storage accounts (you can filter the UI by subscriptions): :::image type="content" source="media/how-to-mltable/vscode-storage-ext-1.png" alt-text="Screenshot of storage resources.":::
- Next, navigate to the container (filesystem) that has your data, and select the Open in Explorer button: :::image type="content" source="media/how-to-mltable/vscode-storage-ext-2.png" alt-text="Screenshot highlighting the container to open in explorer.":::
- Next, select Add to Workspace: :::image type="content" source="media/how-to-mltable/vscode-storage-ext-3.png" alt-text="Screenshot highlighting Add to Workspace.":::
- In the Explorer, you can see your data in cloud storage, and it will appear alongside your code files:
- To author an
MLTable
in cloud storage directly, navigate to the folder that contains the data, and then right-select New File. Name the fileMLTable
and proceed. :::image type="content" source="media/how-to-mltable/vscode-storage-ext-4.png" alt-text="Screenshot highlighting the data folder that will store the MLTable file.":::
The azcopy
utility - pre-installed on Azure ML Compute Instance and DSVM - allows you to upload/download files from Azure Storage to your compute instance:
azcopy login
SOURCE=<path-to-mltable-file> # for example: ./MLTable
DEST=https://<account_name>.blob.core.windows.net/<container>/<path>
azcopy cp $SOURCE $DEST
If you author files locally (or in a DSVM), you can use azcopy
or the Azure Storage Explorer, which allows you to manage files in Azure Storage with a Graphical User Interface (GUI). Once you download and install Azure Storage Explorer, select the storage account and container where you want to upload the MLTable
file to. Next:
- On the main pane's toolbar, select Upload, and then select Upload Files. :::image type="content" source="../media/vs-azure-tools-storage-explorer-blobs/blob-upload-files-menu.png" alt-text="Screenshot highlighting upload files.":::
- In the Select files to upload dialog box, select the
MLTable
file you want to upload. - Select Open to begin the upload.
Read Transformation | Parameters |
---|---|
read_delimited |
infer_column_types : Boolean to infer column data types. Defaults to True. Type inference requires that the current compute can access the data source. Currently, type inference will only pull the first 200 rows.encoding : Specify the file encoding. Supported encodings: utf8 , iso88591 , latin1 , ascii , utf16 , utf32 , utf8bom and windows1252 . Default encoding: utf8 .header : user can choose one of the following options: no_header , from_first_file , all_files_different_headers , all_files_same_headers . Defaults to all_files_same_headers .delimiter : The separator used to split columns.empty_as_string : Specify if empty field values should load as empty strings. The default (False) will read empty field values as nulls. Passing this setting as True will read empty field values as empty strings. If the values are converted to numeric or datetime, then this setting has no effect, as empty values will be converted to nulls.include_path_column : Boolean to keep path information as column in the table. Defaults to False. This setting is useful when reading multiple files, and you want to know from which file a specific record originated. Additionally, you can keep useful information in the file path.support_multi_line : By default (support_multi_line=False ), all line breaks, including line breaks in quoted field values, will be interpreted as a record break. This approach to data reading increases speed, and it offers optimization for parallel execution on multiple CPU cores. However, it may result in silent production of more records with misaligned field values. Set this value to True when the delimited files are known to contain quoted line breaks. |
read_parquet |
include_path_column : Boolean to keep path information as a table column. Defaults to False. This setting helps when you read multiple files, and you want to know from which file a specific record originated. Additionally, you can keep useful information in the file path. |
read_delta_lake |
timestamp_as_of : Timestamp to be specified for time-travel on the specific Delta Lake data.version_as_of : Version to be specified for time-travel on the specific Delta Lake data. |
read_json_lines |
include_path_column : Boolean to keep path information as an MLTable column. Defaults to False. When you read multiple files, this setting helps identify the source file of a specific record. Also, you can keep useful information in the file path.invalid_lines : How to handle lines that have invalid JSON. Supported values: error and drop . Defaults to error .encoding : Specify the file encoding. Supported encodings: utf8 , iso88591 , latin1 , ascii , utf16 , utf32 , utf8bom and windows1252 . Default is utf8 . |
Transformation | Description | Example(s) |
---|---|---|
convert_column_types |
Adds a transformation step to convert the specified columns into their respective specified new types. Takes parameters columns (column names you would like to convert) and column_type (the type to which you want to convert). |
- convert_column_types: Convert the Age column to integer. - convert_column_types: Convert the date column to the format dd/mm/yyyy . Read to_datetime for more information about datetime conversion.- convert_column_types: Convert the is_weekday column to a boolean; yes/true/1 values in the column will map to True , and no/false/0 values in the column will map to False . Read to_bool for more information about boolean conversion. |
drop_columns |
Adds a transformation step to remove desired columns from the dataset. | - drop_columns: ["col1", "col2"] |
keep_columns |
Adds a transformation step to keep the specified columns, and remove all others from the dataset. | - keep_columns: ["col1", "col2"] |
extract_columns_from_partition_format |
Adds a transformation step to use the partition information of each path, and extract the information into columns based on the specified partition format. | - extract_columns_from_partition_format: {column_name:yyyy/MM/dd/HH/mm/ss} creates a datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' will extract year, month, day, hour, minute and second for the datetime type |
filter |
Filter the data, leaving only the records that match the specified expression. | - filter: 'col("temperature") > 32 and col("location") == "UK"' Only leave rows where the temperature exceeds 32 and the UK is the location. |
skip |
Adds a transformation step to skip the first count rows of this MLTable. | - skip: 10 Skip first 10 rows |
take |
Adds a transformation step to select the first count rows of this MLTable. | - take: 5 Take only the first five rows. |
take_random_sample |
Adds a transformation step to randomly select each row of this MLTable with a probability chance. The probability must be in the range [0, 1]. May optionally set a random seed. | - take_random_sample: Take a 10 percent random sample of rows using a random seed of 123. |
An Azure ML data asset resembles web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.
You can create a Table data asset using:
az ml data create --name <name_of_asset> --version 1 --path <folder_with_MLTable> --type mltable
Note
The path points to the folder that contains the MLTable
file. The path can be local or remote (a cloud storage URI). If the path is a local folder, then the folder will automatically be uploaded to the default Azure ML datastore in the cloud.
Only the data in the same folder as the MLTable
file will upload when you create data assets from a local folder. That upload will go to the default Azure ML datastore located in the cloud. Then, the asset is created. If any relative path in the MLTable
path
section exists, and the data isn't in the same folder, the data won't upload, and the relative path won't work.
You can create a data asset in Azure Machine Learning with this Python Code:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
# my_path must point to folder containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: './<path>'
# blob: 'wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>'
# ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>'
# Datastore: 'azureml://datastores/<data_store_name>/paths/<path>'
my_path = '<path>'
my_data = Data(
path=my_path,
type=AssetTypes.MLTABLE,
description="<description>",
name="<name>",
version='<version>'
)
ml_client.data.create_or_update(my_data)
Note
The path points to the folder containing the MLTable artifact.
In this example, you'll author an MLTable file locally, create an asset, and then use the data asset in an Azure ML job.
To complete this end-to-end example, first download a sample of the Green Taxi data from Azure Open Datasets into a local data folder:
mkdir data
cd data
wget https://azureopendatastorage.blob.core.windows.net/nyctlc/green/puYear%3D2013/puMonth%3D8/part-00172-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2742-1.c000.snappy.parquet
Create an MLTable
file in the data folder:
cd data
touch MLTable
Save the following contents to the MLTable
file:
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
paths:
- pattern: ./*.parquet
transformations:
- read_parquet
- take_random_sample:
probability: 0.5
seed: 154
import mltable
import os
# change the working directory to the data directory
os.chdir("./data")
# define the path to the parquet files using a glob pattern
path = {
'pattern': './*.parquet'
}
# load from parquet files
tbl = mltable.from_parquet_files(paths=[path])
# create a new table with a random sample of 50% of the rows
new_tbl = tbl.take_random_sample(0.5, 154)
# show the first few records
new_tbl.show()
# save MLTable file in the data directory
new_tbl.save(".")
Next, create a Data asset:
Execute the following command
cd .. # come up one level from the data directory
az ml data create --name green-sample --version 1 --type mltable --path ./data
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
my_path = './data'
my_data = Data(
path=my_path,
type=AssetTypes.MLTABLE,
name="green-sample",
version='1'
)
ml_client.data.create_or_update(my_data)
Note
Your local data folder - containing the parquet file and MLTable - will automatically upload to cloud storage (default Azure ML datastore) on asset create.
Create a Python script called read-mltable.py
in an src
folder that contains:
# ./src/read-mltable.py
import argparse
import mltable
# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input', help='mltable artifact to read')
args = parser.parse_args()
# load mltable
tbl = mltable.load(args.input)
# show table
print(tbl.show())
To keep things simple, we only show how to read the table into Pandas, and print the first few records.
Your job will need a Conda file that includes the Python package dependencies. Save that Conda file as conda_dependencies.yml
:
# ./conda_dependencies.yml
dependencies:
- python=3.10
- pip=21.2.4
- pip:
- mltable
- azureml-dataprep[pandas]
Next, submit the job:
Create the following job YAML file:
# mltable-job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: ./src
command: python read-mltable.py --input ${{inputs.my_mltable}}
inputs:
my_mltable:
type: mltable
path: azureml:green-sample:1
compute: cpu-cluster
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda_dependencies.yml
In the CLI, create the job:
az ml job create -f mltable-job.yml
from azure.ai.ml import MLClient, command, Input
from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential
# Create a client
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
# get the data asset
data_asset = ml_client.data.get(name="green-sample", version="1")
job = command(
command="python read-mltable.py --input ${{inputs.my_mltable}}",
inputs={
"my_mltable": Input(type="mltable",path=data_asset.id)
},
compute="cpu-cluster",
environment=Environment(
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
conda_file="./conda_dependencies.yml"
),
code="./src"
)
ml_client.jobs.create_or_update(job)
This example assumes you have a CSV file stored in the following Azure Data Lake location:
abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>/<file-name>.csv
Note
You must update the <>
placeholders for your Azure Data Lake filesystem and account name, along with the path on Azure Data lake to your CSV file.
Create an MLTable
file in the abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>/
location:
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- file: ./<file-name>.csv
transformations:
- read_delimited:
delimiter: ','
empty_as_string: false
encoding: utf8
header: all_files_same_headers
include_path_column: false
infer_column_types: true
support_multi_line: false
If you don't already use Option 1: Directly author MLTable
in cloud storage with VSCode, then you can upload your MLTable
file with azcopy
:
SOURCE=<local_path-to-mltable-file>
DEST=https://<account_name>.blob.core.windows.net/<filesystem>/<folder>
azcopy cp $SOURCE $DEST
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding
from azure.storage.blob import BlobClient
from azure.identity import DefaultAzureCredential
# update the file name
my_path = {
'file': './<file_name>.csv'
}
tbl = mltable.from_delimited_files(
paths=[my_path],
header=MLTableHeaders.all_files_same_headers,
delimiter=',',
encoding=MLTableFileEncoding.utf8,
empty_as_string=False,
include_path_column=False,
infer_column_types=True,
support_multi_line=False)
# save the table to the local file system
local_folder = "local"
tbl.save(local_folder)
# upload the MLTable file to your storage account so that you have an artifact
storage_account_url = "https://<account_name>.blob.core.windows.net"
container_name = "<filesystem>"
data_folder_on_storage = '<folder>'
# get a blob client using default credential
blob_client = BlobClient(
credential=DefaultAzureCredential(),
account_url=storage_account_url,
container_name=container_name,
blob_name=f'{data_folder_on_storage}/MLTable'
)
# upload to cloud storage
with open(f'{local_folder}/MLTable', "rb") as mltable_file:
blob_client.upload_blob(mltable_file)
Consumers of the data can load the MLTable artifact into Pandas or Spark with:
import mltable
# the URI points to the folder containing the MLTable file.
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>"
tbl = mltable.load(uri)
df = tbl.to_pandas_dataframe()
[!INCLUDE preview disclaimer]
You must ensure that the mltable
package is installed on the Spark cluster. For more information, read:
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
- Quickstart: Apache Spark jobs in Azure Machine Learning (preview).
# the URI points to the folder containing the MLTable file.
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>"
df = spark.read.mltable(uri)
df.show()
This example assumes that you have a folder of parquet files stored in the following Azure Data Lake location:
abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>/
You'd like to take a 20% random sample of rows from all the parquet files in the folder.
Create an MLTable
file in the abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>/
location:
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- pattern: ./*.parquet
transformations:
- read_parquet:
include_path_column: false
- take_random_sample:
probability: 0.20
seed: 132
If you don't already use Option 1: Directly author MLTable
in cloud storage with VSCode, then you can upload your MLTable
file with azcopy
:
SOURCE=<local_path-to-mltable-file>
DEST=https://<account_name>.blob.core.windows.net/<filesystem>/<folder>
azcopy cp $SOURCE $DEST
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding
from azure.storage.blob import BlobClient
from azure.identity import DefaultAzureCredential
# update the file name
my_path = {
'pattern': './*.parquet'
}
tbl = mltable.from_parquet_files(paths=[my_path])
tbl = tbl.take_random_sample(probability=0.20, seed=132)
# save the table to the local file system
local_folder = "local"
tbl.save(local_folder)
# upload the MLTable file to your storage account
storage_account_url = "https://<account_name>.blob.core.windows.net"
container_name = "<filesystem>"
data_folder_on_storage = '<folder>'
# get a blob client using default credential
blob_client = BlobClient(
credential=DefaultAzureCredential(),
account_url=storage_account_url,
container_name=container_name,
blob_name=f'{data_folder_on_storage}/MLTable'
)
# upload to cloud storage
with open(f'{local_folder}/MLTable', "rb") as mltable_file:
blob_client.upload_blob(mltable_file)
Consumers of the data can load the MLTable artifact into Pandas or Spark with:
import mltable
# the URI points to the folder containing the MLTable file.
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>"
tbl = mltable.load(uri)
df = tbl.to_pandas_dataframe()
[!INCLUDE preview disclaimer]
You'll need to ensure the mltable
package is installed on the Spark cluster. For more information, read:
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
- Quickstart: Apache Spark jobs in Azure Machine Learning (preview).
# the URI points to the folder containing the MLTable file.
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>"
df = spark.read.mltable(uri)
df.show()
This example assumes that you have a data in an Azure Data Lake Delta format:
abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>/
This folder will have the following structure
├── 📁 <folder>
│ ├── 📁 _change_data
│ ├── 📁 _delta_index
│ ├── 📁 _delta_log
│ ├── 📄 part-0000-XXX.parquet
│ ├── 📄 part-0001-XXX.parquet
Also, you'd like to read the data as of a specific timestamp: 2022-08-26T00:00:00Z
.
Create an MLTable
file in the abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>/
location:
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- folder: ./
transformations:
- read_delta_lake:
timestamp_as_of: '2022-08-26T00:00:00Z'
If you don't use Option 1: Directly author MLTable
in cloud storage with VSCode, then you can upload your MLTable
file with azcopy
:
SOURCE=<local_path-to-mltable-file>
DEST=https://<account_name>.blob.core.windows.net/<filesystem>/<folder>
azcopy cp $SOURCE $DEST
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding
from azure.storage.blob import BlobClient
from azure.identity import DefaultAzureCredential
# update the file name
my_path = {
'folder': './'
}
tbl = mltable.from_delta_lake(
paths=[my_path],
timestamp_as_of='2022-08-26T00:00:00Z'
)
# save the table to the local file system
local_folder = "local"
tbl.save(local_folder)
# upload the MLTable file to your storage account
storage_account_url = "https://<account_name>.blob.core.windows.net"
container_name = "<filesystem>"
data_folder_on_storage = '<folder>'
# get a blob client using default credential
blob_client = BlobClient(
credential=DefaultAzureCredential(),
account_url=storage_account_url,
container_name=container_name,
blob_name=f'{data_folder_on_storage}/MLTable'
)
# upload to cloud storage
with open(f'{local_folder}/MLTable', "rb") as mltable_file:
blob_client.upload_blob(mltable_file)
Consumers of the data can load the MLTable artifact into Pandas or Spark with:
import mltable
# the URI points to the folder containing the MLTable file.
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>"
tbl = mltable.load(uri)
df = tbl.to_pandas_dataframe()
[!INCLUDE preview disclaimer]
You must ensure that the mltable
package is installed on the Spark cluster. For more information, read:
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
- Quickstart: Apache Spark jobs in Azure Machine Learning (preview).
# the URI points to the folder containing the MLTable file.
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<folder>"
df = spark.read.mltable(uri)
df.show()
To access your data during interactive development (for example, in a notebook) without creating an MLTable artifact, use the mltable
Python SDK. This code sample shows the general format to read data into Pandas, using the mltable
Python SDK:
import mltable
# define a path or folder or pattern
path = {
'file': '<supported_path>'
# alternatives
# 'folder': '<supported_path>'
# 'pattern': '<supported_path>'
}
# create an mltable from paths
tbl = mltable.from_delimited_files(paths=[path])
# alternatives
# tbl = mltable.from_parquet_files(paths=[path])
# tbl = mltable.from_json_lines_files(paths=[path])
# tbl = mltable.from_delta_lake(paths=[path])
# materialize to pandas
df = tbl.to_pandas_dataframe()
df.head()
The mltable
library supports tabular data reads from different path types:
Location | Examples |
---|---|
A path on your local computer | ./home/username/data/my_data |
A path on a public http(s) server | https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv |
A path on Azure Storage | wasbs://<container_name>@<account_name>.blob.core.windows.net/<path> abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> |
A long-form Azure ML datastore | azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path> |
Note
mltable
handles user credential passthrough for paths on Azure Storage and Azure ML datastores. If you do not have permission to the data on the underlying storage, you can't access the data.
mltable
supports reading from:
- file(s), for example:
abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv
- folder(s), for example
abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
- glob pattern(s), for example
abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/*.csv
- Or, a combination of files, folders, globbing patterns
The mltable
flexibility allows for data materialization into a single dataframe, from a combination of local/cloud storage and combinations of files/folder/globs. For example:
path1 = {
'file': 'abfss://filesystem@account1.dfs.core.windows.net/my-csv.csv'
}
path2 = {
'folder': './home/username/data/my_data'
}
path3 = {
'pattern': 'abfss://filesystem@account2.dfs.core.windows.net/folder/*.csv'
}
tbl = mltable.from_delimited_files(paths=[path1, path2, path3])
mltable
supports the following file formats:
- Delimited Text (for example: CSV files):
mltable.from_delimited_files(paths=[path])
- Parquet:
mltable.from_parquet_files(paths=[path])
- Delta:
mltable.from_delta_lake(paths=[path])
- JSON lines format:
mltable.from_json_lines_files(paths=[path])
Update the placeholders (<>
) in the code snippet with your specific information.
import mltable
path = {
'file': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/<file_name>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your specific information.
import mltable
path = {
'file': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/<file_name>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your specific information.
import mltable
path = {
'file': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/<file>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Tip
To avoid the need to remember the datastore URI format, copy-and-paste the datastore URI from the Studio UI with these steps:
- Select Data from the left-hand menu, followed by the Datastores tab.
- Select your datastore name, and then Browse.
- Find the file/folder you want to read into Pandas, and select the ellipsis (...) next to it. Select from the menu Copy URI. Select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::
import mltable
path = {
'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
This example code shows how mltable
can use glob patterns - such as wildcards - to ensure that only the parquet files are read.
Update the placeholders (<>
) in the code snippet with your specific information.
import mltable
path = {
'pattern': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your specific information.
import mltable
path = {
'pattern': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/*.parquet'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your specific information.
import mltable
path = {
'pattern': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Tip
To avoid the need to remember the datastore URI format, copy-and-paste the datastore URI from the Studio UI with these steps:
- Select Data from the left-hand menu, then the Datastores tab.
- Select your datastore name, and then Browse.
- Find the file/folder you want to read into Pandas, and select the ellipsis (...) next to it. Select from the Copy URI menu. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::
Update the placeholders (<>
) in the code snippet with your specific information.
Important
Access at the folder level is required to glob the pattern on a public HTTP server.
import mltable
path = {
'pattern': '<https_address>/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
In this section, you'll learn how to access your Azure ML data assets into Pandas.
Earlier, if you created a Table asset in Azure ML (an mltable
, or a V1 TabularDataset
), you can load that asset into Pandas with:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()
If you registered a File asset that you want to read into Pandas data frame - for example, a CSV file - use this code:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
path = {
'file': data_asset.path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
If you registered a Folder asset (uri_folder
or a V1 FileDataset
) that you want to read into Pandas data frame - for example, a folder containing CSV file - use this code:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
path = {
'folder': data_asset.path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()