Clear representation of output of confusion matrix

What is the default output of confusion_matrix from sklearn? Image by Author

INTRODUCTION

In one of my recent projects — a transaction monitoring system generates a lot of False Positive alerts (these alerts are then manually investigated by the investigation team). We were required to use machine learning to auto close those false alerts. Evaluation criteria for the machine learning model was a metric Negative Predicted Value that means out of total negative predictions by the model how many cases it has identified correctly.

NPV = True Negative / (True Negative + False Negative)

The cost of false-negative is extremely high because these are the cases where our model is saying they are…


Feature selection using fast_ml

Photo by Tolga Ulkan on Unsplash

Introduction

Two things distinguish top data scientists from others in most cases: Feature Creation and Feature Selection. i.e., creating features that capture deeper/hidden insights about the business or customer and then making the right choices about which features to choose for your model.

Importance of Feature Selection in Machine Learning

Feature Selection is the process of reducing the number of input variables when developing a predictive model.

After an extensive Feature Engineering step, you would end up with a large number of features. You may not use all the features in your model. You would be interested in feeding your model only those significant features or remove the…


Feature Selection using fast_ml

Photo by Red Zeppelin on Unsplash

Introduction

Two things distinguish top data scientists from others in most cases: Feature Creation and Feature Selection. i.e., creating features that capture deeper/hidden insights about the business or customer and then making the right choices about which features to choose for your model.

Importance of Feature Selection in Machine Learning

Quality of Machine Learning model depends upon your data — Garbage in, Garbage out. (Garbage here would mean bad data/noise in data).

After an extensive Feature Engineering step, you would end up with a large number of features. You may not use all the features in your model. …


Sklearn train test split is not enough. We need something better, and faster

Photo by Nathan Dumlao on Unsplash

INTRODUCTION

Why do you need to split data?

You don’t want your model to over-learn from training data and perform poorly after being deployed in production. You need to have a mechanism to assess how well your model is generalizing. Hence, you need to separate your input data into training, validation, and testing subsets to prevent your model from overfitting and to evaluate your model effectively.

In this post, we will cover the following things.

  1. A brief definition of training, validation, and testing datasets
  2. Ready to use code for creating these datasets (2 methods)
  3. Understand the science behind dataset split ratio

Definition of Train-Valid-Test Split


Learn how to make more meaningful features from DateTime type variables to be used by Machine Learning Models

Feature Engineering of DateTime Variables. Image by Author.

INTRODUCTION

DateTime fields require Feature Engineering to turn them from data to insightful information that can be used by our Machine Learning Models. This post is divided into 3 parts and a Bonus section towards the end, we will use a combination of inbuilt pandas and NumPy functions as well as our functions to extract useful features.

  • Part 1 — Extract Date / Time Components
  • Part 2 — Create Boolean Flags
  • Part 3 — Calculate Date / Time Differences
  • Bonus — Feature Engineering in 2 lines of code using fast_ml

BACKGROUND

Whenever I have worked on e-commerce related data, in some way…


To establish quantitative connections to largely qualitative questions is the heart of statistics

INTRODUCTION

Almost all of us make decisions daily — however big, or small those decisions may be. And we spend a lot of time and effort in getting those decisions right.

Why is that? And what does taking a decision really mean?

Decision making is just this — choosing a plan of action when faced with ‘uncertainty

There are 2 ways of making a decision:

  1. Intuitive way, wherein one takes a decision based on ‘gut feeling’
  2. Data-driven decision making, wherein you use data or information to come up with a plan of action

This quantitative approach to decision making is essence…


Using the package fast_ml

Photo by JJ Ying on Unsplash

Below are the usual steps involved in building the ML pipeline:

  1. Import Data
  2. Exploratory Data Analysis (EDA)
  3. Missing Value Imputation
  4. Outlier Treatment
  5. Feature Engineering
  6. Model Building
  7. Feature Selection
  8. Model Interpretation
  9. Save the model
  10. Model Deployment *

Problem Statement and Getting the Data

I’m using a relatively bigger and more complicated data set to demonstrate the process. Refer to the Kaggle competition — IEEE-CIS Fraud Detection.

Navigate to Data Explorer and you will see something like this:


Data Scientist’s Toolkit — bootstrapping, sampling, confidence intervals, hypothesis testing

Photo by Nathan Dumlao on Unsplash

Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption. It’s just magical to form a sampling distribution just from only one sample data. No formula needed for my statistical inference. Not only that, in fact, it is widely applied in other statistical inference such as confidence interval, regression model, even the field of machine learning.

In this article we will primarily talk about two things

  1. Building Confidence Intervals
  2. Hypothesis Testing

Link to the github for the code and dataset.

I. Confidence Intervals

In real world — we don’t really know about our true population. For that it…


Master all — csv, tsv, zip, txt, api, json, sql …

Photo by Jakob Owens on Unsplash

Not so long ago!

Do you remember the time when data was sent to you in an external hard drive for your analysis or model building.

Now — as a data scientist, you are not limited to those means. There are several ways of storing data, sharing data as well as different sources to acquire data, augment data.

Below, I’m listing down several ways of gathering data for your analysis

Table of contents:

  1. CSV file
  2. Flat File (tab, space, or any other separator)
  3. Text File (In a single file — reading data all at once)
  4. ZIP file
  5. Multiple Text Files (Data is split…

Samarth Agrawal

Follow me for valuable tips, ideas, and code snippets for Machine Learning & Deep Learning.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store