Skip to content

Technical

Francisco Maria Calisto edited this page Apr 5, 2020 · 6 revisions

Working with large CSV spreadsheets and JSON datasets can be a pain, particularly when they are too large to fit into memory, as our project is. In cases like this, a combination of command-line tools and Python (>= v3.7) can make for an efficient way to explore and analyze our data. In this section, we will look at how to leverage tools like Pandas to explore Datasets list.

Index

Reading and Writing

The basic process of loading data from a CSV format into a Pandas DataFrame is achieved using the read_csv function in Pandas. It allows reading the data from this format which, in our case, was generated by the Google Spreadsheet or write this data in a preferred format.

Contents

The first step to do is to import the project data. Often, we will work with data in several formats and run into solutions at the very start of our workflow. In this section, we will see how we can use the read_csv() function from the Pandas library to deal with common issues when importing data and see why loading CSV files, specifically with Pandas, has become standard practice for working across our project.

Reader

We are now ready to import the CSV file into Python using read_csv() from the Pandas library. The following example shows a particular use case of a possible solution.

Example
import pandas as pd
df = pd.read_csv('dataset-uta4-sus/dataset/main_sheet.csv')
print(df.head(5))

Writer

Now, if we want to export Pandas DataFrame to a CSV file we can use the to_csv function. When storing a DataFrame object into a CSV file using the to_csv method, we probably won't be needing to store the preceding indices (see next section) of each row of the DataFrame object. Therefore, we can avoid that by passing a False boolean value to index parameter. What makes it easier.

Example
import pandas as pd
...
df.to_csv('dataset-uta4-sus/dataset/main_sheet_copy.csv')

Important Links

Clone this wiki locally