In one of my recent projects — a transaction monitoring system generates a lot of False Positive alerts (these alerts are then manually investigated by the investigation team). We were required to use machine learning to auto close those false alerts. Evaluation criteria for the machine learning model was a metric Negative Predicted Value that means out of total negative predictions by the model how many cases it has identified correctly.
NPV = True Negative / (True Negative + False Negative)
The cost of false-negative is extremely high because these are the cases where our model is saying they are…
Why do you need to split data?
You don’t want your model to over-learn from training data and perform poorly after being deployed in production. You need to have a mechanism to assess how well your model is generalizing. Hence, you need to separate your input data into training, validation, and testing subsets to prevent your model from overfitting and to evaluate your model effectively.
In this post, we will cover the following things.
DateTime fields require Feature Engineering to turn them from data to insightful information that can be used by our Machine Learning Models. This post is divided into 3 parts and a Bonus section towards the end, we will use a combination of inbuilt pandas and NumPy functions as well as our functions to extract useful features.
Whenever I have worked on e-commerce related data, in some way…
Almost all of us make decisions daily — however big, or small those decisions may be. And we spend a lot of time and effort in getting those decisions right.
Why is that? And what does taking a decision really mean?
Decision making is just this — choosing a plan of action when faced with ‘uncertainty’
There are 2 ways of making a decision:
This quantitative approach to decision making is essence…
Below are the usual steps involved in building the ML pipeline:
I’m using a relatively bigger and more complicated data set to demonstrate the process. Refer to the Kaggle competition — IEEE-CIS Fraud Detection.
Navigate to Data Explorer and you will see something like this:
Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption. It’s just magical to form a sampling distribution just from only one sample data. No formula needed for my statistical inference. Not only that, in fact, it is widely applied in other statistical inference such as confidence interval, regression model, even the field of machine learning.
In this article we will primarily talk about two things
In real world — we don’t really know about our true population. For that it…
Not so long ago!
Do you remember the time when data was sent to you in an external hard drive for your analysis or model building.
Now — as a data scientist, you are not limited to those means. There are several ways of storing data, sharing data as well as different sources to acquire data, augment data.
Below, I’m listing down several ways of gathering data for your analysis
One of the most crucial part of building a Deep Neural Network is— to have a clear view on your data as it flows through layers undergoing change in dimensions, alteration in shape, flattening and then re-shaping…
We will refer to the LSTM Architecture that we have seen earlier in our Sentiment Analysis Tutorial. Link to the article here.
Word Embedding => Collective term for models that learned to map a set of words or phrases in a vocabulary to vectors of numerical values.
Neural Networks are designed to learn from numerical data.
Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding.
This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary.
General approach for dealing with words in your text data is to…
Follow me for valuable tips, ideas, and code snippets for Machine Learning & Deep Learning.