Pandas is one of the most widely used Python libraries in data science and analytics. Pandas is a large library with immense capabilities. In this introductory blog post to

import pandas as pd

## Data Structures

First of all, there are two types of data structures in pandas; **DataFrames** and **Series**.

### Series

In Pandas, a Series is a one-dimensional structure with a unique index attached to it.

### DataFrame

A DataFrame in Pandas, is a many dimensional data structure built up by rows and columns. Similarly to a Series, a DataFrame’s row has a unique index attached to it. Additionally, the columns have headers.

Below is an example of what a DataFrame looks like. Notice the left-most column counting up from 0. This is the index of the data frame and will always be unique.

## read_csv

The first function I am showing you is a function most people already know about; the *read_csv* function. This function makes it very easy to import CSV data into a Pandas

By adding the * nrows* argument, you can specify how many rows of the CSV file to import. Now you can look at a portion of the file and make a decision on what columns you actually need, and if you know the datatype, save your computer some time by specifying this as well. These two parameters are called

*and*usecols

*dtype*respectively.

Here is an example of how all these parameters are used:

# Load only 10 rows of the CSV df = pd.read_csv("large_data.csv", nrows=10) ### # Figure out what columns you need ### df = pd.read_csv( "large_data.csv", usecols=["column_1", "column_3"], dtype={"column_1": float, "column_3": str} )

## head

The *head* function returns the specified number of rows, or 5 rows by default if no number is specified. This comes in handy when you want to get an idea of what type of data you are working with and what a sample of the values look like.

df.head(10)

## map

The function name might be short, but is very intuitively named. This function *maps* values to keys.

Let’s say for example that you wanted to set all “1” values to “Yes”, and “0” values to “No”, this function will help you with that. First, we would define a python dictionary with the key-value pairs, before using the *map* function. Like this:

yes_no_map = {1: "Yes", 2: "No"} df['column_3_mapped'] = df['column_3'].map(yes_no_map)

It is worth noting that in the map, if your column data type is *“1”* *1*, and vice versa.

## apply

*apply* allows you to take other columns as inputs and send them to a function (for example a user-defined function or lambda function).

Let’s *y* is equal to “Yes”, and that *x* is less than 5.

def my_function(x, y): if x < 5 and y == "Yes": return 1 else: return 0

Now let’s apply this function to our DataFrame from earlier with our mapped values.

df['applied_column'] = df.apply( lambda x: my_function(x['column_1'], x['column_3_mapped']), axis=1 )

The *axis=1 *parameter tells pandas that the apply function should be executed on each row.

It is important to keep in mind that if a built-in function

## loc

This function is brilliant in helping you filter the DataFrame. Let’s say you wanted to filter your DataFrame for all values of column_3_mapped that were equal to “Yes”, and where column_1 was less than 5. Below is how this could be accomplished with the *loc* function.

subset = df.loc[ (df['column_1'] < 5) & (df['column_3_mapped'] == "Yes") ]

Another way to select a subset of a DataFrame is to give the dataframe a true/false list like below:

df_filter = df['id_column'].isin('car_3', 'car_6') df[df_filter]

This will give you all rows where id_column is in the given range of IDs.

## groupby

The * groupby* function does exactly what it says; it groups your DataFrame given the specified columns and aggregations. Below are a couple of ways to accomplish this.

grouped = df.groupby(['column_3_mapped'])['column_1'].min() # sum, max, and other functions available. grouped = df.groupby(['column_3_mapped']).agg({'column_1': 'sum', 'column_1': 'min'})

In the above examples, we grouped by the *column_3_mapped* *column_1* column.

## to_csv

If you are done with the data wrangling and require the end result in the DataFrame to be exported as a CSV, you can call the to_csv function. There are a few parameters that may come in handy when employing this function. The first is *index=False*. You will set this parameter if you do not want to output the index in the CSV. Another handy parameter is the *float_format* parameter. This parameter will let you format the float numbers, by for example allowing only 2 decimal places. I’ve included these parameters in the example below so that you have a frame of reference of what this might look like (the float format is set to 2 decimal places).

df.to_csv( "output_filename.csv", float_format='%.2f', index=False )

## Summary

This blogpost aimed at only covering a handful of basic Pandas functions. There are many, many more functions out there that are just as useful as these, but these provide you with a basic understanding of how to navigate a Pandas DataFrame, and do some simple data wrangling in the process.

Drop me a comment below if you feel like I missed a vital, basic function, or just let me know what functions you are using on a daily basis.

## 1 Comment

Pingback: 5 Must Have Python Libraries for Data Science – Classy Data