If you are into data science, you must have heard the famous quote that “If you torture the data enough it will confess!!”. To torture the data, you need to have skills to handle it and make it yield. What better than doing EDA i.e. Exploratory Data Analysis! Yeah, Exploratory data Analysis is the first step you do when you embark on your journey of doing any data science projects.
As difficult it is to collect or mine the data, EDA is equally painful but worth the effort. If you go wrong in EDA the whole modelling may go for a toss. So data scientist take utmost care and maximum time to explore the data set and apply principles of EDA to arrive at a conclusion before moving to training their model.
The father of EDA is John Tukey who officially coined the term in his 1977 masterpiece.Statically defining, Exploratory Analysis is a phase in data analysis where the data analysts discover and learn the basic statistical values and interrelationships of data points in a given dataset. Exploratory Data Analysis (EDA) can also be defined as the numerical and graphical examination of data characteristics and relationships before formal,rigorous statistical analyses are applied. EDA rewards the user with a better understanding of the intricacies of the relationships among data. In addition, EDA can be used to probe the validity of assumptions that are made by formal statistical tests.
So what is exploratory data analysis all about?
Lets assume that you are going on a vacation trip to a place. You already have a list of all the main attraction points you want to visit.You check about places where you can stay. In short, you make a complete itinerary before going on the trip. Similarly, exploratory data analysis done by data scientists before training their model is what it can be compared to.
EDA mainly does two things:
1. It helps clean up the dataset.
2. It also gives us a better understanding of all the variables and the relationships between them.
Accordingly there are primarily 3 components of EDA listed below:-
- Understanding the variables
- Cleaning the dataset
- Analyzing the relationships between all variables
EDA is NOT about making fancy visualizations or even aesthetically pleasing ones, the goal is to try and answer questions with data. Your goal should be to be able to create a figure which someone can look at in a couple of seconds and understand what is going on.
The Methods of Exploratory Data Analysis
Exploratory data analysis is carried out using methods like:
- Univariate Visualization – This is a simple type of analysis where the data analyzed consists of a single variable. Univariate analysis is mainly used to report the data and trace patterns.
- Bivariate visualization – This type of analysis is used to determine the relationships between two variables and the significance of these relationships.
- Multivariate visualization – When the data sets are more complex, multivariate analysis is used to trace relationships between different fields. It reduces Type I errors. It is, however, unsuitable for small data sets.
- Dimensionality Reduction – This analysis helps to deduce which parameters contribute to the maximum variation in results and enables fast processing by reducing the volume of data.
Using these methods, a data scientist can grasp the problem at hand and select appropriate models to corroborate the generated data. After studying the distribution of the data, you can check if there’s and missing data and find ways to cope with it.
Exploratory Data Analysis: A simple example
Suppose you have to find the sales trend for an online cloth retailer. The data set consists of features like Customer ID, invoice number, stock code, description, quantity, unit price, states, region,etc. Before starting, you can do your data pre-processing, that is, checking the outliers, missing values, etc.
You can also add new features.Depending on the business requirement, you can choose which features to add. By grouping the states, region and quantity or total amount together, you can find out which region are giving them the maximum and minimum sales.Next, by grouping the year and total amount, you can find out the sales trend for the given number of years. You can also do the same for each month and find you out which time of the year has shown a spike or drop in sales. Using this same method, you can identify further problems and find out ways to fix them.
So as we depicted above, Domain understanding and core knowledge is very essential to do EDA thoroughly. In order to take care of all necessary outliers, missing value treatment and clusters identification which are essential before you move in to the next step of Model building, you need to know the business.
Do share your thoughts about this article in comments.
Please Provide your Feedback Below