Feature request "DataFrameClient" transfer from version 1.7 to 2 - Pandas dataframe not possible to ingest to influxdb 2 #79

Sutyke · 2020-04-02T10:24:18Z

Dear Great Maintainer,

I would like to request to transfer feature which is currently in Influxdb 1.7 to 2.0.

There are 2 proposals:

Transfer DataFrameClient from 1.7 to 2

In the version 1.7, function DataFrameClient allows to insert the dataframe to idb.

import pandas as pd
from influxdb import DataFrameClient
dbConnDF = DataFrameClient('localhost', '8086', 'd', 'password', 'securities')
df = pd.read_parquet('/home/d/fi/01_Data/01_raw_data/qd.parquet').set_index('date').sort_values()
%time dbConnDF.write_points(df, 'securities', tag_columns=['symbol'], protocol="json", batch_size=10000)
%time d = dbConnDF.query("select * from securities")
d['securities']

Create some helping function to change dataframe to format readable for influxdb 2

@Anaisdg created the function below.

https://github.com/Anaisdg/Influx_Pandas
**`import pandas as pd

import time
from datetime import datetime

def lp(df,measurement,tag_key,field,value,datetime):
    lines= [str(df[measurement][d]) + "," 
            + str(tag_key) + "=" + str(df[tag_key][d]) 
            + " " + str(df[field][d]) + "=" + str(df[value][d]) 
            + " " + str(int(time.mktime(df[datetime][d].timetuple()))) + "000000000" for d in range(len(df))]
    return lines`**

Thank you in advance for your help it will save a lot of time to your customers

Sutyke

The text was updated successfully, but these errors were encountered:

bednar · 2020-04-02T11:14:13Z

@Sutyke thanks for feedback. I hope so that We would be able to implement it pretty soon.

lifeisawavesorideit · 2020-05-01T14:51:01Z

This would be extremely helpful!

bednar · 2020-05-02T04:24:07Z

@lifeisawavesorideit you could follow progress at #88

…nt name is in kwargs

bednar · 2020-05-04T08:58:57Z

Hi @lifeisawavesorideit,

We just merged support for ingesting a Pandas DataFrame into master branch.

You could use it with something like:

_now = pd.Timestamp().now('UTC')
_data_frame = pd.DataFrame(data=[["coyote_creek", 1.0], ["coyote_creek", 2.0]],
                           index=[now, now + timedelta(hours=1)],
                           columns=["location", "water_level"])

_write_client.write(bucket.name, record=data_frame, data_frame_measurement_name='h2o_feet',
                    data_frame_tag_columns=['location'])

If you would like to test it then install client via:

pip install git+https://github.com/influxdata/influxdb-client-python.git@master

and also a feedback is welcome.

Regards

Anaisdg · 2020-05-06T18:47:22Z

Hello @bednar thank you! Does this write method support batching as usual? Thank you.

bednar · 2020-05-07T05:27:22Z

Hi @Anaisdg, yeah it is supports batching.

benb92 · 2020-05-07T06:33:52Z

Hello,

Thanks for this function, seems to work. This is the code I ran below to do the import. Do you need to specify batch size anywhere? How would I amend the code below to do that? Or do you simply set data_frame_object to be a big dataframe of 10^6 rows etc. and it will auto batch it?

Re. multiprocessing - I saw that InfluxDB 2.0 would support it. This is fantastic - how do I do this with python, or would this function automatically support that?

from influxdb_client import InfluxDBClient
bucket = "bucket_name"
org = "org_name"
token = "tokenID...."
client = InfluxDBClient(
url="http://localhost:9999",
token=token,
org=org
)
write_api = client.write_api()

write_api.write(bucket=bucket, record=data_frame_object, data_frame_measurement_name='measurement_name',
data_frame_tag_columns=tag_columns)

bednar · 2020-05-07T08:11:23Z

@Anaisdg, @bburden to clarifying how the batching works:

the client read whole DataFrame and write data in batches into the InfluxDB. If we have a very big DataFrame than a performance will be limited by memory size...

If we want to store a potentially unbound DataFrame we could improve our implementations with a streaming tweaks.

Whats is expected size (rows, columns) of DataFrame?

benb92 · 2020-05-07T09:15:04Z

this example I am doing a small historical import 1.7 million rows 8 columns, nothing major. broadly, i have various flows and milliseconds matter. for (1) historical imports it is not so important but still good to establish best practice. is that best practice below? generally the file sizes will be bigger for historical imports. can i utilise multiprocessing somehow in the code below?

so just to double check - I would amend the above code with something like this:
write_api.write(bucket=bucket, record=data_frame_object, data_frame_measurement_name='measurement_name',
data_frame_tag_columns=tag_columns,
protocol="line", batch_size=10000)

live i will be importing minutely data and aspiring for 25ms from grabbing data from websockets to importing data into influxdb database. any tips on how to crank max performance out of influxdb for (1) big historical imports and (2) smaller live imports is appreciated!

bednar · 2020-05-07T10:30:05Z

(1)
The client is able to run in multiprocessing environment. Best approach for import large amount of data is create a singleton instance of write_api:

self.client.write_api(
            write_options=WriteOptions(write_type=WriteType.batching, batch_size=50_000, flush_interval=10_000))

and push data by separate processes:

self.write_api.write(bucket="my-bucket", record=next_record)

The most critical part is parsing data into LineProtocol. Our implementation of writing DataFrame is generic... so the best performance you achieve by directly creating a LineProtocol from your DataFrame row.

but everything depends on your data and there is not a one general approach.

(2)
It depends, but something like synchronous approach will perform well:

write_api = self.client.write_api(write_options=SYNCHRONOUS)
records = "mem,host=host1 used_percent=23.43234543\n" \
          "mem,host=host1 available_percent=15.856523"
write_api.write(bucket, org, records)

See also:
https://github.com/influxdata/influxdb-client-python/blob/master/examples/import_data_set_multiprocessing.py
https://github.com/influxdata/influxdb-client-python/blob/master/examples/import_data_set.py

cjelsa · 2020-05-13T13:52:23Z

Would it be possible to next to, data_frame_tag_columns=tag_columns, also have a 'tags=' argument? This way a tag can be added to a DF which doesn't appear in the DF.

For example, I have a DF with timestamp, open, high, low, close (etc) data. I would like to be able to add tags as ticker, exchange etc which don't appear in the DF.

bednar · 2020-05-13T13:54:39Z

@cjelsa Yes, we could.

Could you please create a new issue for that?

Thanks for using our client.

galgal770 · 2021-03-09T13:46:59Z

@bednar, following your reply to @bburden, I'm also trying to write a large amount of data directly from a DataFrame. The multiprocessing script you supplied (import_data_set_multiprocessing.py) processes each line of a csv file and converts it to a LineProtocol. As I understand, two amendments are needed here:

and push data by separate processes

How do I adjust the current process to read rows of a DataFrame (and not lines from a csv-file)? Where's the relevant endpoint? Any example?

The most critical part is parsing data into LineProtocol. Our implementation of writing DataFrame is generic... so the best performance you achieve by directly creating a LineProtocol from your DataFrame row.

Any example? How do I do that? Where is the relevant endpoint to do the conversion?

Is there any corresponding end-to-end documentary to support DataFrame ingestion in a multiprocessing environment?

bednar · 2021-03-10T06:38:15Z

@galgal770

How do I adjust the current process to read rows of a DataFrame (and not lines from a csv-file)? Where's the relevant endpoint? Any example?

You could use - DataFrame.itertuples

import rx
from rx import operators as ops
import pandas as pd

df = pd.read_csv("vix-daily.csv")

batches = rx \
    .from_iterable(df.itertuples(index=False)) \
    .pipe(ops.buffer_with_count(500))

batches.subscribe(on_next=lambda batch: print(f"my batch: {batch}"),
                  on_error=lambda ex: print(f'Unexpected error: {ex}'),
                  on_completed=lambda: print('Import finished!'))

Any example? How do I do that? Where is the relevant endpoint to do the conversion?

The DataFrame.itertuples returns tuples. Your implementation just concats this tuples into LineProtocol.

Is there any corresponding end-to-end documentary to support DataFrame ingestion in a multiprocessing environment?

No, but your code will be pretty same as import_data_set_multiprocessing.py, except:

change def parse_row(row: OrderedDict): to parse from Pandas dict
change source of dat to DataFrame.itertuples

From my POV the best approach will be:

modify this example - import_data_set_sync_batching.py
try to find where is a performance bottleneck

bednar added the enhancement New feature or request label Apr 2, 2020

bednar added this to the 1.7.0 milestone Apr 2, 2020

bednar assigned rolincova Apr 22, 2020

rolincova added a commit that referenced this issue Apr 28, 2020

feat: support for writing pandas DataFrame (#79)

212cee3

rolincova added a commit that referenced this issue Apr 29, 2020

feat: support for writing pandas DataFrame (#79) - default tags

aa49182

rolincova added a commit that referenced this issue Apr 29, 2020

feat: support for writing pandas DataFrame (#79) - readme

baf6951

rolincova added a commit that referenced this issue Apr 29, 2020

feat: support for writing pandas DataFrame (#79) - batching

0cacefc

rolincova added a commit that referenced this issue Apr 29, 2020

feat: support for writing pandas DataFrame (#79) - readme

3b47326

rolincova added a commit that referenced this issue May 4, 2020

fix: support for writing pandas DataFrame (#79)

3a3aab8

rolincova mentioned this issue May 4, 2020

feat: support for writing pandas DataFrame #88

Merged

4 tasks

rolincova added a commit that referenced this issue May 4, 2020

feat: support for writing pandas DataFrame (#79) - check if measureme…

f951d43

…nt name is in kwargs

bednar closed this as completed in #88 May 4, 2020

cjelsa mentioned this issue May 13, 2020

When ingest dataframe, use alternative tagging #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request "DataFrameClient" transfer from version 1.7 to 2 - Pandas dataframe not possible to ingest to influxdb 2 #79

Feature request "DataFrameClient" transfer from version 1.7 to 2 - Pandas dataframe not possible to ingest to influxdb 2 #79

Sutyke commented Apr 2, 2020 •

edited

Loading

bednar commented Apr 2, 2020

lifeisawavesorideit commented May 1, 2020

bednar commented May 2, 2020

bednar commented May 4, 2020

Anaisdg commented May 6, 2020

bednar commented May 7, 2020

benb92 commented May 7, 2020

bednar commented May 7, 2020

benb92 commented May 7, 2020

bednar commented May 7, 2020

cjelsa commented May 13, 2020

bednar commented May 13, 2020

galgal770 commented Mar 9, 2021 •

edited

Loading

bednar commented Mar 10, 2021 •

edited

Loading

Feature request "DataFrameClient" transfer from version 1.7 to 2 - Pandas dataframe not possible to ingest to influxdb 2 #79

Feature request "DataFrameClient" transfer from version 1.7 to 2 - Pandas dataframe not possible to ingest to influxdb 2 #79

Comments

Sutyke commented Apr 2, 2020 • edited Loading

bednar commented Apr 2, 2020

lifeisawavesorideit commented May 1, 2020

bednar commented May 2, 2020

bednar commented May 4, 2020

Anaisdg commented May 6, 2020

bednar commented May 7, 2020

benb92 commented May 7, 2020

bednar commented May 7, 2020

benb92 commented May 7, 2020

bednar commented May 7, 2020

cjelsa commented May 13, 2020

bednar commented May 13, 2020

galgal770 commented Mar 9, 2021 • edited Loading

bednar commented Mar 10, 2021 • edited Loading

Sutyke commented Apr 2, 2020 •

edited

Loading

galgal770 commented Mar 9, 2021 •

edited

Loading

bednar commented Mar 10, 2021 •

edited

Loading