Khalida Douibi
4 min readJan 19, 2022

Data Profiling in Machine learning

Data profiling is one of the main steps in data analysis, that aims to collect descriptive statistics and informative summaries of the data. A data scientist uses the results from this first step to discover business knowledge hidden in the data, it helps him to decide about the modelling strategy needed for the analysis.

In this post, I will be talking especially about data profiling in E-commerce as an example. Let’s suppose that we have some data collected from a website of a company and we want to group together typical or ideal clients for a given business in order to propose adequate services accordingly. The problem here is called customer-profiling analysis.

Customer profiling is based mainly on statistical methods allowing a
company to carry out its marketing strategy and make the best decisions by respecting its customer's preferences, as Orvel Ray Wilson, a marketing expert, said "Customers buy for their reasons, not yours.[1], that’s why the companies should be aligned according to what the buyer wants not the inverse!

The analysis of user browsing and purchasing data on a website helps to draw interesting conclusions about their current needs and to anticipate their future behaviour (predictive analysis). Recently, another crucial source of information has also been used by several companies in order to meet this need, it involves the analysis of user opinions (sentiment analysis) to enhance dormant data contained in their feedback on a purchase or an interaction on a website.

In summary, the main idea is based on the use of intelligent methods of data science and machine learning to analyze data collected on a website to segment user profiles and help the company, on the one hand, to make the right decisions and on the other hand, to allow the user to live an optimal experience by finding what he is looking for easily and quickly, and also by anticipating his future need.

MACHINE LEARNING FOR CUSTOMER PROFILING

Data science & Machine learning tools help to analyze data to make decisions and anticipate future events. Let’s talk about some useful approaches to deal with this problem. Depending on the quality of the available data, and the main purpose of the experts, several ML modelling could be imagined.

  • First, we can model the problem as Predictive causal Analytics to predict the probability of occurrence of an event in the future based on the past. eg. Predict whether a customer will be interested in a product based on its history of purchases of other products. (read more about causal analysis and time series forecasting)
  • Prescriptive analytics modelling to generate systems of recommendation is a deeper vision, where the model is not satisfied only by using the past to predict the present but suggests several possible new actions with their predictable results. eg. propose a recommender system that uses for example the customer reviews on a product. (read more about Recommender systems, Natural language processing and sentiment analysis)
  • Predictive segmentation modelling: by using for example non-supervised ML approaches to discover patterns in the data. eg. User’s segmentation to
    discover similar profiles to offer targeted products and services. (read more about Clustering & graph mining)

Data profiling in Python:

Let’s suppose that we want to explore a dataset by using a simple function in Python. Let’s start by defining the following simple function: data_profiling()

import pandas as pd

from pandas_profiling import ProfileReport

def data_profiling(df):
profile = ProfileReport(df,title=” Dataset Profile”)
profile.to_widgets() #if using Jupyter
return data_profiling(df)

Many other python libraries exist for data profiling such as skimpy, Dataprep, Sweetviz, and Autoviz,

#skimpy

from skimpy import skim

skim (df)

#Dataprep

from dataprep.eda import *

from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report

create_report(df)

#Sweetviz

import sweetviz as sv

report = sweetviz.analyze(df)

Try it on your dataset and tell us more about your findings and interpretations in the comments 😉.

Author: Khalida Douibi, PhD

For more content related to Data Science, ML, visit my Linkedin: https://www.linkedin.com/in/khalida-douibi/

#datascience #artificialintelligence #machinelearning #phd #innovation #research #datascientist

References:

https://github.com/ydataai/pandas-profiling

Khalida Douibi
Khalida Douibi

Written by Khalida Douibi

Sn. Data Scientist. PhD. Biomedical Informatics, Machine learning

No responses yet