Processing

Exploratory Data Analysis at a glance

All we get is data. Data in different locations. We understand the data from different angles and provide observations. It’s obvious that you need a problem statement first.

yours truly

Data in different locations

Data is present in physical and digital format. If we need to do data analysis using the software tools, we need to convert all that physical data to digital format.

Digital data is stored in a pre-defined format. For example a number 56 can be stored in multiple ways.

  1. integer (programming languages have their own integer data type)
  2. float (programming languages have their own float data type)
  3. string (programming languages have their own implementation of the data type)

When I say data, i need a data type.

In digital systems, data exists in these places.

  1. memory aka RAM, which is the running process on a computer. A process can run using any runtime. python has its own runtime, java has its own runtime and so on.
  2. in transit aka network-cable, when the data is moving from one system on the internet to another system on the internet
  3. in database, which generally uses the filesystem on the computer (postgres, mysql, mongo, etc)
  4. filesystem itself (csv, excel, etc.)

To do a data analysis, on my computer, using a jupyter-notebook (for example). I need to bring all the required data from these different systems on the internet, over the network-cable, in memory of my python runtime.

So to do this, we fetch data from a database. we download a file on our local filesystem and then we load it. we fetch data from an api. There are multiple ways. Once this data is available in my python runtime, in a variable, we can play with it.

Trusting the data, and then making it trustworthy by cleaning it

The data we have is just raw, you can look at it from different angles, but you can’t just start doing analysis on it yet. You just bought it from the market, so clean it first, then cook.

Generally, these are tabular data which we can load in our dataframe. Data with rows and columns.

  1. missing data (remove it, fill it with something)
  2. change the datatype to a more suited datatype, for example convert a column which is supposed to be int, but came in as string, to int.
  3. sanity check (be sure that some data is in the right range, or it is what its supposed to be, for example a month column can not have “guitar” in it)
  4. remove outliers (or take them out in a different dataframe)

Missing data

  • use df.isnull() to give you a new Dataframe, each cell filled with either True/False. True means that its null.
  • use df.isnull().sum() to give you a new Series with the column names and how many nulls are in their. Take a call, if you think there are too many nulls in a particular columns, remove that column.
  • use df.isnull().sum(axis=0) to find the nulls in rows. Take a call, if you think some rows have too many nulls, remove those rows.
  • for remaining rows, Take a call, see how can you fill them with some assumed data. It can be mean, or median or mode, or anything else.

Remove outliers

It’s hard to just go through all the data manually and find the outliers. So use visualisations to help you. Use a histogram/boxplot to find the outliers. Take a call, if you feel its actually an anamoly, remove that row.

Proceeding to analysis

Once we have clean data, we need a problem statement to proceed. Generally, we already have a problem statement.

  1. An event has occurred, here is the data, find some patterns or root cause for it. Can we predict when it can happen again?
  2. A group of data columns (events, facts) when occured, can lead to a particular data (target). In the entire dataset, can we find what are those cause data-columns and the effect data-column(s).
  3. Corelation, are there any combination of 2 data columns which are proportional/or inversely-proportional.

Single column analysis

  • histogram/boxplot, both give an idea about where most of the data exists, and near which bucket. histogram properly shows those buckets.
  • if the data is numeric, we can directly use histogram/boxplot
  • if the data is categorical, get the frequency distribution using value_counts() and then draw a bar chart where each bar is a category.

2 column analysis

  • if both columns are numeric, create a line chart,scatter plot, jointplot, regplot, use pairplot if there are many 2 column combinations
  • if one column is categorical, create a bar chart, or a boxplot

2 categorical and 1 numeric column analysis (heatmap)

This looks tricky, let’s understand this by an example.

How does Virat Kohli performs in indian grounds, across months. The performance will be measured by total runs scored.

Remember there can be other analysis too. for example. We might want to find Virat Kohli’s performance by indian grounds alone. This will become a bar chart where each bar is a ground and the length of the bar is the aggregate on that ground, aggregate can be anything like total-score(sum), average-score(mean), highest-score(max), etc. There can be a similar analysis done i.e. find Virat Kohli’s performance by months alone.

But we also want a more granular picture, how does Virat performs on indian grounds, across months, and the performance can be any aggregate but we chose it to be total-runs-scored.

The visualisation can be imagined to be a tabular structure where rows can be grounds, columns can be months and the value in the cell can be total-runs-scored. Looking at this table may not give much insight, but if you look at a corresponding heatmap, you will get insight.

Let’s say there is a column which contains numerical value, but you can convert it into bins like high/medium/low using pd.qcut ,which mean you can convert a numerical value to a categorical value and then you can use this column in your heatmap.

1 Time series, 1 numeric analysis

  • line chart, bar chart
  • numeric per category => stacked bar chart