Loading Flat Files into Pandas DataFrames

Ingesting data is an important step in any data related project. Before you can analyze data, it needs to be accessed. I will explore data ingestion with a popular Python library named Pandas. The Pandas library is built upon Numpy which is a Python library used to work with arrays whereas Pandas is used to analyze and manipulate tabular data. This will be the first in a series of posts focusing on data ingestion in the data engineering process.

Pandas

Pandas was developed in 2008 mostly for use in financial quantitative analysis. Today, it is used in various disciplines. Pandas is often used to load and manipulate data. Additionally, it integrates well with many Python analysis and visualization libraries.

DataFrames

A core part of Pandas is the DataFrame. A DataFrame is a two-dimensional data structure with columns and rows. Columns are usually labeled with variable names. Rows have labels known as an index in Pandas. The default index for rows is the row number although you can also specify a column as the index. In terms of loading data into a DataFrame, Pandas is quite versatile in handling many data formats. We will start by looking at flat files.

Flat Files

Flat files are pretty simple in structure which makes them a popular choice for storing and sharing data. Flat files can be generated from spreadsheets, databases, software applications, online data sources, etc.

The data in flat files are stored as plain text with each line in the file representing one row. Column values are separated by a specified character called a delimiter. Typically, the delimiter is a comma resulting in a file with Comma Separated Values or .CSV file. However, other delimiters can be used for example a tab delimiter.

Below is an example of a .CSV file:

Below is an example of the same data with a tab delimiter:

Loading CSVs into a DataFrame

To load a CSV file into a Pandas DataFrame:

import pandas as pd
df = pd.read_csv("myfile.csv")
print(df.head(4))

Code explanation:

import pandas as pd
This python code imports the Pandas library then assigns it an alias “pd” which is the standard alias given to the Pandas library. This enables you to use pandas functions by simply typing pd. function_name rather than pandas.

df = pd.read_csv(“myfile.csv”)
pd.read_csv uses the Pandas read_csv function to import the CSV file from the file path specified in the function argument. In this examplee, the CSV file is located in the same directory as the python file so the file path is simply “myfile.csv” Finally the imported CSV file is assigned to a variable named df.

print(df.head(4))
The head function returns the first n rows in an object. Here, we return the 1st 4 rows from df DataFrame then display it using the print function.

Loading Other Flat Files into a DataFrame

Now, what if you have a flat file with a different delimiter such as tab? Instead of having a different function for every delimiter, Pandas allows you to use the sep key word argument with read_csv to import any flat file.

To load a tab delimited file into a Pandas DataFrame:

import pandas as pd
df = pd.read_csv("myfile.txt", sep="\t")
print(df.head(4))

Code explanation:

import pandas as pd
Again, we import pandas as pd.

df = pd.read_csv(“myfile.txt”, sep=”\t”)
Then just like before, we pass the file path string to read_csv. Please note that our tab delimited text file has a .txt extension hence “myfile.txt”. This time, we add the argument sep=”\t”. The “backslash t” represents a tab.

print(df.head(4))
Once again, we use the head function within the print function to display the first 4 rows of the df DataFrame as shown below.

Now, you should have a basic understanding of how to load flat files into a Pandas DataFrame. In a future article in this series, we will learn how to import a subset of columns from a flat file as well as how to import a flat file in chunks.