Pandas is one of the most widely used Python libraries in data science and analytics. Pandas is a large library with immense capabilities. In this introductory blog post to Pandas I will cover the basics of its famous data structure, the DataFrame, and some basic functions that are key to your success using Pandas..

import pandas as pd

Data Structures

First of all, there are two types of data structures in pandas; DataFrames and Series.

Series

In Pandas, a Series is a one-dimensional structure with a unique index attached to it.

DataFrame

A DataFrame in Pandas, is a many dimensional data structure built up by rows and columns. Similarly to a Series, a DataFrame’s row has a unique index attached to it. Additionally, the columns have headers.

Below is an example of what a DataFrame looks like. Notice the left-most column counting up from 0. This is the index of the data frame and will always be unique.

Sample DataFrame.
An example of a data frame

read_csv

The first function I am showing you is a function most people already know about; the read_csv function. This function makes it very easy to import CSV data into a Pandas dataframe, but what most people don’t realize is that when you are working with large CSV files, loading it may take forever.

By adding the nrows argument, you can specify how many rows of the CSV file to import. Now you can look at a portion of the file and make a decision on what columns you actually need, and if you know the datatype, save your computer some time by specifying this as well. These two parameters are called usecols and dtype respectively.

Here is an example of how all these parameters are used:

# Load only 10 rows of the CSV
df = pd.read_csv("large_data.csv", nrows=10)

###
# Figure out what columns you need
###

df = pd.read_csv(
    "large_data.csv",
    usecols=["column_1", "column_3"],
    dtype={"column_1": float, "column_3": str}
)

head

The head function returns the specified number of rows, or 5 rows by default if no number is specified. This comes in handy when you want to get an idea of what type of data you are working with and what a sample of the values look like.

df.head(10)

map

The function name might be short, but is very intuitively named. This function maps values to keys.

Let’s say for example that you wanted to set all “1” values to “Yes”, and “0” values to “No”, this function will help you with that. First, we would define a python dictionary with the key-value pairs, before using the map function. Like this:

yes_no_map = {1: "Yes", 2: "No"}
df['column_3_mapped'] = df['column_3'].map(yes_no_map)

It is worth noting that in the map, if your column data type is string, then the key should be “1” not 1, and vice versa.

apply

apply allows you to take other columns as inputs and send them to a function (for example a user-defined function or lambda function).

Let’s for example create a user-defined function that takes 2 inputs; x and y. With these inputs the function check if y is equal to “Yes”, and that x is less than 5.

def my_function(x, y):
    if x < 5 and y == "Yes":
        return 1
    else:
        return 0

Now let’s apply this function to our DataFrame from earlier with our mapped values.

df['applied_column'] = df.apply(
    lambda x: my_function(x['column_1'], x['column_3_mapped']), axis=1
)

The axis=1 parameter tells pandas that the apply function should be executed on each row.

It is important to keep in mind that if a built-in function exist for what you are trying to do, it is very likely that it will be faster than a user-defined function applied through this method.

loc

This function is brilliant in helping you filter the DataFrame. Let’s say you wanted to filter your DataFrame for all values of column_3_mapped that were equal to “Yes”, and where column_1 was less than 5. Below is how this could be accomplished with the loc function.

subset = df.loc[
    (df['column_1'] < 5) &
    (df['column_3_mapped'] == "Yes")
]

Another way to select a subset of a DataFrame is to give the dataframe a true/false list like below:

df_filter = df['id_column'].isin('car_3', 'car_6')
df[df_filter]

This will give you all rows where id_column is in the given range of IDs.

groupby

The groupby function does exactly what it says; it groups your DataFrame given the specified columns and aggregations. Below are a couple of ways to accomplish this.

grouped = df.groupby(['column_3_mapped'])['column_1'].min() # sum, max, and other functions available.

grouped = df.groupby(['column_3_mapped']).agg({'column_1': 'sum', 'column_1': 'min'})

In the above examples, we grouped by the column_3_mapped column, and aggregated the column_1 column.

to_csv

If you are done with the data wrangling and require the end result in the DataFrame to be exported as a CSV, you can call the to_csv function. There are a few parameters that may come in handy when employing this function. The first is index=False. You will set this parameter if you do not want to output the index in the CSV. Another handy parameter is the float_format parameter. This parameter will let you format the float numbers, by for example allowing only 2 decimal places. I’ve included these parameters in the example below so that you have a frame of reference of what this might look like (the float format is set to 2 decimal places).

df.to_csv(
    "output_filename.csv",
    float_format='%.2f',
    index=False
)

Summary

This blogpost aimed at only covering a handful of basic Pandas functions. There are many, many more functions out there that are just as useful as these, but these provide you with a basic understanding of how to navigate a Pandas DataFrame, and do some simple data wrangling in the process.

Drop me a comment below if you feel like I missed a vital, basic function, or just let me know what functions you are using on a daily basis.

I have half a decade of experience working with data science and data engineering in a variety of fields both professionally and in academia. I ahve demonstrated advanced skills in developing machine learning algorithms, econometric models, intuitive visualizations and reporting dashboards in order to communicate data and technical terminology in an easy to understand manner for clients of varying backgrounds.

Write A Comment