Processing, Programming

Reading on Pandas DataFrame

Motivation for the blog

So I got a little overwhelmed with a lot of ways we can CRUD on DataFrame and which to use when. I am writing this blog to gather my understanding and make it concrete.

Context for developers

DataFrame is a data structure available in the pandas library in python. You will see how do we use it now.

You can start thinking about the DataFrame as a tabular data structure. Much like in spreadsheets where each cell can be identified with the help of a (row,column). If you aware of a python list and coming to the level of list of lists which makes it a tabular data structure, let’s create that first so that we can appreciate what DataFrames offer.

food = [['apple', 'banana', 'pineapple'],
        ['potato', 'cabbage', 'cauliflower', 'eggplant'],
        ['wheat', 'oats', 'rice'],
        ['kidney beans', 'chickpeas', 'soybeans']]

We know in our minds that the first row is for “fruits”, then “vegetables”, then “grains”, then “legumes”. But this information is absent in the above list of lists. Accessing a particular food will require you to already being aware about which food is at which index. for example, If we need “fruits”, you should know that food[0] will give you all “fruits”; food[2][0] gives you “wheat”.

The next problem seems to be solved by using a dictionary object where we can define the same data as

food = {
    "fruits": ['apple', 'banana', 'pineapple'],
    "vegetables": ['potato', 'cabbage', 'cauliflower', 'eggplant'],
    "grains": ['wheat', 'oats', 'rice'],
    "legumes": ['kidney beans', 'chickpeas', 'soybeans']
}
food["fruits"] // ['apple', 'banana', 'pineapple']

With the dictionary data structure we can now access fruits by the lable name. If you notice, we can have different number of fruits, let’s say 10 fruits, and 12 vegetables. Both lists & dictionary data structure allow it.

  1. Lists did not allow labelling of rows and columns.
  2. Dictionary doesn’t allow reaching an element by number indexes.
  3. List of lists & Dictionary doesn’t guarantee to be table-like since it allows different length of child lists
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 1
----> 1 food[0]

KeyError: 0

Why do we care about tabular structure so much?

Think about a domain or an industry, let’s take the case of food itself first so that we can connect. The example data above mentions catagorical data. The number of categories are limited, there can not be millions of categories of food. If I try to make a tabular data in a spreadsheet, the columns can be fruits, vegetables, grains, legumes. These columns are independent of each other. Also, picking a row in the table doesn’t make much sense. Let’s visualise it a little.

fruitsvegetablesgrainslegumes
1applepotatowheatkidney beans
2bananacabbageoatschickpeas
3pineapplecauliflowerricesoybeans
4eggplant
categorical food data

What kind of data would make sense? Let’s take this data

fruitsvegetablesgrainslegumes
1applepotatowheatkidney beans
2applecabbagericesoybeans
3bananacauliflowerricesoybeans
4applepotatooatschickpeas
5pineappleeggplantricesoybeans
6bananapotatowheatkidney beans
7bananaeggplantoatschickpeas
8orangefrench beansNAchickpeas
9kiwiNAragiNA
meal intake data

Assume, each row is the meal intake of a person. This is not categorical data. Each row means something. Also assume, we collect this data for an entire neighbourhood, the number of rows can keep increasing with time. Add a few columns like id, name and scale the data to an entire country, the number of rows can be in millions.

This kind of tabular data makes sense. Each row is an event. Each column can be thought of as important details of the event (a.k.a dimensions). We can add more important-details-of-the-event by adding more columns. Let’s say we add a column called drink, or protein (chicken, fish, egg, etc.)

list of lists & dictonary data structures can hold this tabular data too. We will be able to do all sort of processing on it too. But we have to write a lot of efficient code, many times to CRUD, manipulate the data. This problem is solved by python libraries like numpy and pandas. These libraries provide data structures like Series and DataFrame to easily read/manipulate the data, combine the data, do sql like joins, choose smaller frames of the data (and manipulate it too). etc.

DataFrame

fruitsvegetablesgrainslegumes
0applepotatowheatkidney beans
1applecabbagericesoybeans
2bananacauliflowerricesoybeans
3applepotatooatschickpeas
4pineappleeggplantricesoybeans
5bananapotatowheatkidney beans
6bananaeggplantoatschickpeas
7orangefrench beansNAchickpeas
8kiwiNAragiNA
meal intake data

It’s a tabular data structure. The first horizontal block (here fruits vegetables grains legumes) are the columns. The first vertical block (here 0 1 2 3 4 5 6 7 8) are the index.

We will talk about creating a DataFrame a little later, for now let’s talk about the more interesting part of reading the data.

What do you want to read? This is the main topic of this blog.

specific columns
specific rows
specific columns more than 1
specific rows more than 1
smaller frames
specific row and column

So you see how different data can be extracted from the bigger tabular data. Tabular data stored in DataFrame allows ways to extract smaller datasets. Of course, DataFrame is capable of doing a lot of other things, but let’s focus on reading the data first.

What methods are available to read smaller datasets?

Let’s define our DataFrame. Remember we will talk about defining the DataFrame later. This is just one way to define it.

meal = [['apple', 'potato', 'wheat', 'kidney beans'],
        ['apple', 'cabbage', 'rice', 'soybeans'],
        ['banana', 'cauliflower', 'rice', 'soybeans'],
        ['apple', 'potato', 'oats', 'chickpeas'],
        ['pineapple', 'eggplant', 'rice', 'soybeans'],
        ['banana', 'potato', 'wheat', 'kidney beans'],
        ['banana', 'eggplant', 'oats', 'chickpeas'],
        ['orange', 'french beans', None, 'chickpeas'],
        ['kiwi', None, 'ragi', None]]

df = pd.DataFrame(meal, columns=['fruits', 'vegetables', 'grains', 'legumes'])

This is how it looks

Choosing specific column(s), all rows

fruits = df['fruits']
print(type(fruits)) // <class 'pandas.core.series.Series'>

A Series data structure allows accessing specific element by regular indexing fruits[0] // 'apple'

Accessing a single column on a DataFrame via the syntax df['column name'] will result in a Series

fruits = df[['fruits']]
print(type(fruits)) // <class 'pandas.core.frame.DataFrame'>
fruits_and_grains = df[['fruits', 'grains']]
print(type(fruits_and_grains)) // <class 'pandas.core.frame.DataFrame'>

Accessing one or more column on a DataFrame via syntax df[['column name', 'column name']] will result in a DataFrame

The important point to note here is that you are not just choosing specific columns. You are also choosing all the rows of those specific columns. If you need all the rows, go for this syntax df['column name'] or df[['column name', column name']]. Let’s call this column indexing.

Choosing a specific cell by row-number, column-number

This looks like another simple case. DataFrame doesn’t allow syntax like df[row-number, column-number], rather it allows this via a method called iloc. The syntax will look like df.iloc[row-number, column-number]. The column fruits in the DataFrame df is column-number 0, and so on.

If I want to find what legume(index 3, since 0 based indexing) was eaten in the 5th meal(index 4, since 0 based indexing), i’ll go for syntax like df.iloc[4, 3], which results in soybeans. check in the image below.

Choosing a smaller frame in the DataFrame

The interesting thing about iloc is that it allows another syntax i.e. df.iloc['row-range', 'column-range']. If you remember slicing in python lists, the row-range, column-range will follow the same format.

  1. :” represents all
  2. 1:” represents that leave the one at 0th index and take till last
  3. :5” represents that start from the 0th index and take till 4th index
  4. and so on

If I want to get fruits, vegetables, grains, legumes for 3rd till 7th index, my row-range will be 3:8 and my column-range will be : since I want to fetch all columns. entire syntax will be df.iloc[3:8, :]

if I want to get vegetables and grains for 3rd till 7th index, my row-range is still 3:8 and my new column-range will be 1:3. entire syntax will be df.iloc[3:8, 1:3]

If you remember in python list slicing, there was also an option to provide step. let’s try giving a step. If I want to get all even number meals, my row-range will look like ::2. Now df.iloc[::2, 1:3] results in

iloc will be useful when we have enough information about the digit of the row or column to be chosen.

Choosing a specific cell by row-name, column-name

When I use the word row-name i mean the the first vertical block of the DataFrame i.e. 0 1 2 3 4 5 6 7 8. In the terminology of DataFrame these are not really integer digits, they are “names”. Specifically called index. (for the curious, I can have indexes like a b c d e or india china srilanka. it’s also possible to have multiple indexes, but that is out of scope of this discussion)

When I use the word column-name. It’s simple, they are the column names, i.e. fruits vegetables grains legumes.

If I want to find what legume was consumed in meal name 5, we use the syntax df.loc[5, 'legume'] it results in kidney beans

Choosing specific cells by row-name-array, column-name-array

loc allows another syntax where I can pass array of row-names and array of column-names. i.e. df.loc[[3, 5, 6], ['vegetables', 'legumes']]

This is where it becomes a little bit more interesting. Sometimes, we don’t know the exact row-names and column-names to choose because there can be millions of rows and hundreds of columns.

Assume there are millions of rows and we can filter them based on a condition. Similarly let’s say there are hundreds of columns and we can choose specific columns by filtering on their names, let’s say we choose all the columns that start with solid_ for solid food and fluid_ for fluid foods.

In our example data, we have 9 meals. Let’s say I am able to somehow create an array for those 9 meals like so [False, False, False, True, False, True, True, False, False]. This array indicates that row-name 3, 5 & 6 are to be chosen since only they are true.

Let’s see what this syntax does df.loc[[False, False, False, True, False, True, True, False, False], ['vegetables', legumes']]

It gave me the same result as df.loc[[3,5,6], ['vegetables', 'legumes']]. We will talk about it a little later how passing multiple trues and falses are more interesting to us.

Similarily, let’s say we want to execute a similar syntax for column-name-array also. i.e. df.loc[[False, False, False, True, False, True, True, False, False], [False, True, False, True]]. This syntax is again same, we want 3,5,6 row-names and vegetables and legumes as the columns. the [False, True, False, True] indicates that fruits is false, vegetables is true, grains is false and legumes is true. This results in

Another interesting thing that you may have noticed is that I can use any way of providing row-name-array and column-name-array. I can use [3,5,6] row-name-array with ['vegetables', 'legumes']. I can use [True, False, ....] with ['vegetables', ...]. I can use [3,5,6] with [True, False, ...] and I can use [True, False, ...] with [True, False, ...].

Fun part is that I can also use row-name with column-name-array and row-name-array with column-name too. For example df.loc[1, [False, True, False, True]] will result in a Series.

Come to think of it, iloc also allows a mix and match of row-number/row-range and column-number/column-range.

Let’s sum it up now.

df[a], in this a can be

  1. column name => all rows of specific column
  2. range (python list type slicing syntax) => a bunch of rows
  3. all-row-boolean-array => specific rows, all columns

iloc[a,b], in this both a and b can be

  1. a digit (row-number or column-number) => specific row/column
  2. range (python list type slicing syntax) => a bunch of row/column
  3. all-row/column-boolean-array (True, False, ..) => specific rows/columns => this array has to be a python list NOT Series.

loc[a,b] in this both a and b can be

  1. name (row-name or column-name) => specific row/column
  2. name-array (row-name-array or column-name-array) => more than one rows/columns
  3. all-row/column-boolean-array (True, False, ..) => specific rows/columns

What do we mean by [True, True, False, … etc.]?

As promised, we will now talk about this way of passing all-row/column-boolean-array. And what does it mean. To be quick, its filters.

In our data, let’s say we want to filter all the meals which had apple in it.

df['fruits'] == 'apple' this will result in a Series of [True, False, True, False..]

Let’s write it in a better way. I want all rows and columns where apple was a part of the meal

condition = df['fruits'] == 'apple'
df[condition]
df.iloc[condition.tolist(), :] //since condition is pandas-Series, we need to convert it to python list
condition = df['fruits'] == 'apple'
df.loc[condition.tolist(), :]

Summary

df[a], in this a can be

  1. column name => all rows of specific column
  2. range (python list type slicing syntax) => a bunch of rows
  3. all-row-boolean-array => specific rows, all columns

iloc[a,b], in this both a and b can be

  1. a digit (row-number or column-number) => specific row/column
  2. range (python list type slicing syntax) => a bunch of row/column
  3. all-row/column-boolean-array (True, False, ..) => specific rows/columns => this array has to be a python list NOT Series.

loc[a,b] in this both a and b can be

  1. name (row-name or column-name) => specific row/column
  2. name-array (row-name-array or column-name-array) => more than one rows/columns
  3. all-row/column-boolean-array (True, False, ..) => specific rows/columns