Clear representation of output of confusion matrix

What is the default output of confusion_matrix from sklearn? Image by Author


In one of my recent projects — a transaction monitoring system generates a lot of False Positive alerts (these alerts are then manually investigated by the investigation team). We were required to use machine learning to auto close those false alerts. Evaluation criteria for the machine learning model was a metric Negative Predicted Value that means out of total negative predictions by the model how many cases it has identified correctly.

NPV = True Negative / (True Negative + False Negative)

The cost of false-negative is extremely high because these are the cases where our model is saying they are…

Sklearn train test split is not enough. We need something better, and faster

Photo by Nathan Dumlao on Unsplash


Why do you need to split data?

You don’t want your model to over-learn from training data and perform poorly after being deployed in production. You need to have a mechanism to assess how well your model is generalizing. Hence, you need to separate your input data into training, validation, and testing subsets to prevent your model from overfitting and to evaluate your model effectively.

In this post, we will cover the following things.

  1. A brief definition of training, validation, and testing datasets
  2. Ready to use code for creating these datasets (2 methods)
  3. Understand the science behind dataset split ratio

Definition of Train-Valid-Test Split

Learn how to make more meaningful features from DateTime type variables to be used by Machine Learning Models

Feature Engineering of DateTime Variables. Image by Author.


DateTime fields require Feature Engineering to turn them from data to insightful information that can be used by our Machine Learning Models. This post is divided into 3 parts and a Bonus section towards the end, we will use a combination of inbuilt pandas and NumPy functions as well as our functions to extract useful features.

  • Part 1 — Extract Date / Time Components
  • Part 2 — Create Boolean Flags
  • Part 3 — Calculate Date / Time Differences
  • Bonus — Feature Engineering in 2 lines of code using fast_ml


Whenever I have worked on e-commerce related data, in some way…

To establish quantitative connections to largely qualitative questions is the heart of statistics


Almost all of us make decisions daily — however big, or small those decisions may be. And we spend a lot of time and effort in getting those decisions right.

Why is that? And what does taking a decision really mean?

Decision making is just this — choosing a plan of action when faced with ‘uncertainty

There are 2 ways of making a decision:

  1. Intuitive way, wherein one takes a decision based on ‘gut feeling’
  2. Data-driven decision making, wherein you use data or information to come up with a plan of action

This quantitative approach to decision making is essence…

Using the package fast_ml

Photo by JJ Ying on Unsplash

Below are the usual steps involved in building the ML pipeline:

  1. Import Data
  2. Exploratory Data Analysis (EDA)
  3. Missing Value Imputation
  4. Outlier Treatment
  5. Feature Engineering
  6. Model Building
  7. Feature Selection
  8. Model Interpretation
  9. Save the model
  10. Model Deployment *

Problem Statement and Getting the Data

I’m using a relatively bigger and more complicated data set to demonstrate the process. Refer to the Kaggle competition — IEEE-CIS Fraud Detection.

Navigate to Data Explorer and you will see something like this:

Data Scientist’s Toolkit — bootstrapping, sampling, confidence intervals, hypothesis testing

Photo by Nathan Dumlao on Unsplash

Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption. It’s just magical to form a sampling distribution just from only one sample data. No formula needed for my statistical inference. Not only that, in fact, it is widely applied in other statistical inference such as confidence interval, regression model, even the field of machine learning.

In this article we will primarily talk about two things

  1. Building Confidence Intervals
  2. Hypothesis Testing

Link to the github for the code and dataset.

I. Confidence Intervals

In real world — we don’t really know about our true population. For that it…

Master all — csv, tsv, zip, txt, api, json, sql …

Photo by Jakob Owens on Unsplash

Not so long ago!

Do you remember the time when data was sent to you in an external hard drive for your analysis or model building.

Now — as a data scientist, you are not limited to those means. There are several ways of storing data, sharing data as well as different sources to acquire data, augment data.

Below, I’m listing down several ways of gathering data for your analysis

Table of contents:

  1. CSV file
  2. Flat File (tab, space, or any other separator)
  3. Text File (In a single file — reading data all at once)
  4. ZIP file
  5. Multiple Text Files (Data is split…

Using PyTorch framework for Deep Learning

Photo by Paul Skorupskas on Unsplash

One of the most crucial part of building a Deep Neural Network is— to have a clear view on your data as it flows through layers undergoing change in dimensions, alteration in shape, flattening and then re-shaping…

We will refer to the LSTM Architecture that we have seen earlier in our Sentiment Analysis Tutorial. Link to the article here.

Looking at text data through the lens of Neural Nets

Photo by Dmitry Ratushny on Unsplash

Word Embedding => Collective term for models that learned to map a set of words or phrases in a vocabulary to vectors of numerical values.

Neural Networks are designed to learn from numerical data.

Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding.

This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary.

How it is done!

General approach for dealing with words in your text data is to…

Samarth Agrawal

Follow me for valuable tips, ideas, and code snippets for Machine Learning & Deep Learning.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store