-
-
Notifications
You must be signed in to change notification settings - Fork 0
Technical
Working with large CSV spreadsheets and JSON datasets can be a pain, particularly when they are too large to fit into memory, as our project is. In cases like this, a combination of command-line tools and Python (>= v3.7) can make for an efficient way to explore and analyze our data. In this section, we will look at how to leverage tools like Pandas to explore Datasets list.
The basic process of loading data from a CSV format into a Pandas DataFrame
is achieved using the read_csv
function in Pandas. It allows reading the data from this format which, in our case, was generated by the Google Spreadsheet or write this data in a preferred format.
The first step to do is to import the project data. Often, we will work with data in several formats and run into solutions at the very start of our workflow. In this section, we will see how we can use the read_csv()
function from the Pandas library to deal with common issues when importing data and see why loading CSV files, specifically with Pandas, has become standard practice for working across our project.
We are now ready to import the CSV file into Python using read_csv()
from the Pandas library. The following example shows a particular use case of a possible solution.
import pandas as pd
df = pd.read_csv('dataset-uta4-sus/dataset/main_sheet.csv')
print(df.head(5))
Now, if we want to export Pandas DataFrame
to a CSV file we can use the to_csv
function. When storing a DataFrame
object into a CSV file using the to_csv
method, we probably won't be needing to store the preceding indices (see next section) of each row of the DataFrame
object. Therefore, we can avoid that by passing a False
boolean value to index
parameter. What makes it easier.
import pandas as pd
...
df.to_csv('dataset-uta4-sus/dataset/main_sheet_copy.csv')