Table of Contents
- What is KNN?
- Basic Definition of K-nearest Neighbors:
- Usage in Machine Learning and Data Science:
- How Does KNN Work?
- Detailed Breakdown of the Algorithm:
- Distance Calculation:
- Selecting Neighbors:
- Distance Metrics in Depth:
- Applications of KNN:
- Advantages and Disadvantages:
- Detecting Credit Card Fraud with K-Nearest Neighbors and Python
- Dataset Loading and Previewing:
- Data Prep
- Data Scaling:
- Model Training and Evaluation:
- Confusion Matrix Visualization:
- Conclusion:
What is KNN?
Basic Definition of K-nearest Neighbors:
The K-nearest neighbors (KNN) algorithm is a versatile and intuitive method rooted in the principle that items in a particular category or class will likely be found near other items in that same category or class. At its core, KNN is about leveraging the power of proximity, utilizing the idea that similar data points will cluster together.
In this visualization, if most of the 5 closest points (green circled) belong to a particular class, the new data point (blue) will be classified into that class. The plot is effectively a snapshot of how the algorithm works: it shows the data landscape and identifies the new point's closest neighbors, and through these neighbors, we get a sense of how the new point might be classified.
Usage in Machine Learning and Data Science:
KNN holds a significant place in the field of machine learning and data science, primarily for its simplicity and effectiveness. While it's predominantly known for its applications in supervised learning, particularly classification, it has seen use in semi-supervised scenarios where data might be partly labeled. One of the standout aspects of KNN is its ability to adapt to non-linear decision boundaries, making it versatile when dealing with complex datasets.
How Does KNN Work?
Detailed Breakdown of the Algorithm:
Choosing 'k': Begin by deciding the number of neighbors, 'k'. The choice of 'k' is crucial. A smaller 'k' can be noisy and susceptible to outliers, whereas a large 'k' can smooth the decision boundaries excessively.
Distance Calculation:
For a new input data point, the algorithm computes its distance to every other point in the dataset. This step is pivotal in identifying its neighbors.
Selecting Neighbors:
The 'k' nearest data points to the new point, based on the calculated distances, are selected. These will be instrumental in making the final prediction.
Making the Prediction:
For classification tasks, the mode (most frequent class) of the 'k' selected points becomes the predicted class for the new data point. For regression tasks, a mean or median of the 'k' points provides the prediction.
Distance Metrics in Depth:
The choice of a distance metric can greatly influence the algorithm's outcome:
Euclidean Distance:
Derived from the Pythagorean theorem, it captures the straight-line distance between two points in space.
Manhattan Distance:
Imagine navigating a grid-based path, like city blocks. This metric is the total sum of the vertical and horizontal distance between two points.
Minkowski Distance:
An encompassing metric that generalizes both Euclidean and Manhattan distances.
Hamming Distance:
Especially useful for datasets with categorical variables, it calculates the positions at which two strings of equal length differ. Several other metrics like cosine similarity and Jaccard index can also be employed, depending on the problem domain.
Applications of KNN:
Classification: KNN's poster child application. It determines the category of a new data point by examining its neighbors. For instance, in healthcare, it could classify whether a tumor is malignant or benign based on the characteristics of nearby tumors.
Regression: Venturing beyond categorical outcomes, KNN can predict continuous values. In real estate, it could forecast house prices based on the prices of neighboring properties.
Recommendation Systems: By leveraging user behaviors, KNN can be instrumental in collaborative filtering, suggesting products or media by finding users with similar tastes.
Advantages and Disadvantages:
Pros:
Straightforward Implementation: Newcomers to machine learning often appreciate KNN for its ease of understanding and execution.
Data Agnostic: Unlike certain algorithms that make strong assumptions about data distribution, KNN is more flexible, making it handy for real-world, messy data.
Adaptability: Its ability to handle both classification and regression makes it a tool of choice for various problems.
Cons:
Performance Issues: With large datasets, KNN can be a computational hog, requiring substantial time and resources.
Feature Sensitivity: KNN can stumble if presented with irrelevant or redundant features. Rigorous feature selection or dimensionality reduction often becomes necessary.
Normalization Needs: Different scales among features can mislead the algorithm. Normalizing or standardizing data becomes a prerequisite.
Struggles with High Dimensionality: The curse of dimensionality means that as features grow, distances between data points tend to converge, making KNN's decisions less reliable.
Detecting Credit Card Fraud with K-Nearest Neighbors and Python
Credit card fraud detection is a critical component in financial security. This post will walk you through creating a model to detect potential fraudulent transactions using the K-Nearest Neighbors algorithm in Python, on a sample credit card transactions dataset.
1. Dataset Loading and Previewing:
Start by importing necessary libraries and loading the dataset. For demonstration, we are using a sample credit card dataset.
import pandas as pd
# Load the Dataset
file_path = '/content/creditcard.csv'
data = pd.read_csv(file_path)
# Display the first 5 rows of the dataset
print(data.head())
The dataset under consideration encompasses transactions executed through credit cards by cardholders in Europe. This compilation of data reveals transactions spanning two days, featuring 492 fraudulent transactions among a total of 284,807 transactions. The imbalanced nature of the dataset is evident, with the fraudulent or positive class transactions constituting merely 0.172% of the total transactions.
The dataset exclusively comprises numerical input variables, derived from a PCA (Principal Component Analysis) transformation. Due to stringent confidentiality constraints, the disclosure of original features and additional background details about the data is restricted. The features labeled V1, V2, …, V28 are the principal components extracted through PCA. The exceptions to PCA transformation within this dataset are the 'Time' and 'Amount' features. The 'Time' feature enumerates the seconds that have elapsed between each transaction and the first transaction in the dataset. The 'Amount' feature represents the transaction amount and can be employed for instance-dependent cost-sensitive learning. The 'Class' feature serves as the response variable, assuming the value 1 for fraudulent cases and 0 otherwise.
2. Data Prep
Before training our model, we need to preprocess the data. We'll remove any rows with missing target values and split the dataset into training and testing sets.
# Drop rows where 'Class' is NaN
data = data.dropna(subset=['Class'])
from sklearn.model_selection import train_test_split
# Define features (X) and target (y)
X = data.drop('Class', axis=1)
y = data['Class']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
3. Data Scaling:
Scaling the features is crucial when working with K-NN as it is distance-based.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
4. Model Training and Evaluation:
We will use the K-Nearest Neighbors classifier, train it on our training data, and evaluate its performance using accuracy and a classification report.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Print accuracy score
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
Classification Report
Class 0.0 (Non-fraudulent Transactions)
· Precision: 1.00 (100%) — Out of all the transactions predicted as non-fraudulent, 100% are actually non-fraudulent.
· Recall: 1.00 (100%) — Out of all the actual non-fraudulent transactions, 100% were correctly predicted by the model.
· F1-score: 1.00 (100%) — Harmonic mean of precision and recall for non-fraudulent transactions.
Class 1.0 (Fraudulent Transactions)
· Precision: 0.87 (87%) — Out of all the transactions predicted as fraudulent, 87% are actually fraudulent.
· Recall: 0.95 (95%) — Out of all the actual fraudulent transactions, 95% were correctly predicted by the model.
· F1-score: 0.91 (91%) — Harmonic mean of precision and recall for fraudulent transactions.
Accuracy Score
The accuracy of the model is 0.999570 suggesting that the model is highly accurate in classifying the transactions into fraudulent and non-fraudulent. However, in imbalanced datasets like this one, accuracy may not be the best metric to evaluate model performance, and one should also look at precision, recall, and F1-score
Interpretation
This model performs exceptionally well in identifying non-fraudulent transactions, with perfect scores in precision, recall, and F1-score. When it comes to identifying fraudulent transactions, the model still performs well, with high precision and recall. This implies that the model is reliable for identifying fraud and has a very low chance of misclassifying non-fraudulent transactions as fraudulent (low false positive rate) and has a high chance of catching actual fraudulent transactions (high true positive rate). However, even with high scores, given the severe consequences of missing a fraudulent transaction, there is always a need to continually improve the model, especially in enhancing its ability to detect true fraudulent transactions and reduce false negatives (i.e., actual frauds that the model fails to catch).
5. Confusion Matrix Visualization:
Visualizing the confusion matrix helps in understanding the model's performance more intuitively.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
A confusion matrix is used in classification to understand the performance of an algorithm, typically a supervised learning one. It is a table with four different combinations of predicted and actual values, for a binary classification:
True Positive (TP): Model correctly predicted the Positive class as Positive.
True Negative (TN): Model correctly predicted the Negative class as Negative.
False Positive (FP): Model incorrectly predicted the Negative class as Positive.
False Negative (FN): Model incorrectly predicted the Positive class as Negative.
For the above confusion matrix:
True Negative (TN): 18575
Interpretation: 18,575 transactions were correctly classified as non-fraudulent by the model. This is the number of True Negatives.
False Positive (FP): 6
Interpretation: 6 transactions were wrongly classified as fraudulent by the model. This is a Type I error, and these are instances where the actual class is Negative (non-fraudulent), but the predicted class is Positive (fraudulent).
True Positive (TP): 41
Interpretation: 41 transactions were correctly classified as fraudulent by the model. This is the number of True Positives.
False Negative (FN): 2
Interpretation: 2 transactions were wrongly classified as non-fraudulent by the model. This is a Type II error, and these are instances where the actual class is Positive (fraudulent), but the predicted class is Negative (non-fraudulent).
Conclusion:
The K-Nearest Neighbors (KNN) algorithm, despite its simplicity, has proven to be a formidable method for classification tasks, including those that are intricate and critical, such as fraud detection in credit card transactions. The core principle of the algorithm is to classify a data point based on the majority class of its 'k' nearest neighbors, where 'k' is a user-defined number. This seemingly simple rule allows the KNN algorithm to adapt and be effective in a variety of scenarios and datasets. In the context of our endeavor to detect fraudulent transactions within a dataset predominantly comprised of non-fraudulent transactions, KNN exhibited significant proficiency. It was able to discern between fraudulent and legitimate transactions with remarkable precision and recall, ensuring that true fraudulent cases are identified while minimizing the risk of false alarms.
The K-Nearest Neighbors (KNN) algorithm is also suitably applied in areas like healthcare fraud, where it can detect patterns correlating to fraudulent claims and billing, and in cybersecurity, to identify malicious activities or intrusions by analyzing network traffic patterns and behaviors. In eCommerce, KNN helps in identifying fraudulent purchase transactions and reviewing customer behaviors to flag potentially deceptive actions.
To access and run the Python code yourself, visit the Pyfi GitHub
Written by Numan Yaqoob, PHD candidate