Fraud Detection Intelligence

Bank Transaction
Analytics Report

Exploratory Data Analysis ของชุดข้อมูลธุรกรรมธนาคาร 50,000 รายการ เพื่อค้นหารูปแบบและสัญญาณความผิดปกติ

Total Transactions
0
Avg $297.87 per transaction • $14.8M volume
Accounts
0
~101 transactions per account
Date Range
6 years
Jan 2020 — Dec 2025
2020 2025
Transaction Locations
0
cities across the United States
00 — Executive Summary
Pipeline at a Glance
End-to-end unsupervised fraud detection — from raw data to actionable risk scores
Transactions Analyzed
0
$14.8M total volume
ML Models
3
ISO Forest + DBSCAN + LOF
Expert Rules
7
94.7% recall on ML anomalies
Critical Alerts
24
0.05% highest risk
Hybrid Risk Distribution
Final Output
Top Risk Factors
Key Drivers
01 — Data Quality
Data Quality Overview
Dataset structure and completeness assessment
0
Missing Values
0
Duplicate Rows
📊
15
Columns
📅
2020–2025
Date Range
Transaction Basics
6 Fields
TransactionIDUnique Key
AccountID495 unique
TransactionAmountNumeric
TransactionDateDatetime
TransactionType2 types
AccountBalanceNumeric
Behavioral & Technical
7 Fields
TransactionDurationSeconds
Channel3 types
Location43 cities
DeviceID681 unique
IP AddressText
MerchantID100 unique
LoginAttempts1–5
Customer Demographics
2 Fields
CustomerAge18–80 yrs
CustomerOccupation4 types
Data Quality Verdict

No missing values or duplicates detected. The dataset is clean and ready for ML pipeline processing.

02 — Transaction Overview
Transaction Patterns
Distribution and composition of banking transactions
Transaction Amount Distribution
Key Metric
Average
$297.87
Median
$209.36
Min
$0.24
Std Dev
$292.82
Max
$2,060
Insight

Mean > Median indicates right-skewed distribution. High-value transactions pull the average up. Std Dev of $292 shows significant dispersion.

Transaction Type Breakdown
Imbalanced
Debit 77.5%
Credit 22.5%
Debit : Credit Ratio
3.4 : 1
Insight

Debit transactions outnumber Credit by 3.4x. Fraud typically targets debit-side operations, making this imbalance critical for model design.

Channel Distribution
Balanced
Branch17,278 (34.6%)
ATM16,552 (33.1%)
Online16,170 (32.3%)
Insight

All three channels are evenly distributed (~33% each). Fraud monitoring must cover all channels equally.

Customer Occupation
Demographics
Student13,059 (26.1%)
Doctor12,578 (25.2%)
Engineer12,491 (25.0%)
Retired11,872 (23.7%)
03 — Behavioral Analysis
Behavioral Patterns
Time patterns, device usage, and variable relationships
Transactions by Day of Week
Time Pattern
Time Distribution Warning
Data Issue
95%
of transactions recorded at Hour = 0 (midnight)
47,488
Hour = 0
2,512
Other Hours
Warning

Most records default to 00:00 — likely date-only entries without timestamps. Hour-based features should be used with caution in modeling.

Age vs Balance Relationship
Demographics
Average Balance
$5,122
Max Balance
$14,978
Correlation Matrix
Relationships
AmountBalanceDurationLoginAge
Amount1.00-0.020.01-0.02-0.02
Balance-0.021.000.010.010.32
Duration0.010.011.000.03-0.02
Login-0.020.010.031.000.01
Age-0.020.32-0.020.011.00
Key Finding

Age vs Balance (r=0.32) is the only meaningful correlation. Older customers tend to have higher balances. Low inter-feature correlation is beneficial for ML — each feature contributes unique information.

04 — Anomaly Detection
Fraud Risk Indicators
Signals that may indicate fraudulent transaction activity
High Amount Outliers
2,375 transactions (4.75%) exceed IQR upper bound of $899.72. Maximum amount reaches $2,060.
🔒
Suspicious Login Attempts
1,924 transactions (3.85%) required 3–5 login attempts. Frequency increases with attempts — potential automated attack pattern.
📱
Device Sharing — Critical
609 devices are shared across multiple accounts. One device serves up to 9 distinct accounts — strong indicator of account takeover.
🌎
Multi-Location Activity
266 accounts operate across 5+ cities. 428 accounts use 3+ devices — potential compromised account indicators.
Login Attempts Distribution
Security
Anomalous Pattern

95.1% succeed on first attempt. Unusually, transactions increase from 3 to 5 attempts (619 → 645 → 660) instead of decreasing — suggests automated brute-force behavior.

Risk Summary Dashboard
Risk Matrix
Device Sharing 89.4% of devices
Timestamp Quality 95% missing time
Multi-Location Accounts 53.7% of accounts
Amount Outliers 4.75%
High Login Attempts (3+) 3.85%
Overall Assessment

Multiple fraud signals detected across dimensions. Device sharing is the most severe risk (89.4%). A composite risk score combining all signals is recommended.

05 — Model Evaluation
Anomaly Detection Results
3 Models evaluated — Isolation Forest, DBSCAN, LOF — with composite risk scoring
Consensus Anomalies
190
flagged by 2+ models (0.4%)
All 3 Agree
18
highest confidence anomalies
High + Critical
129
risk score > 0.5
Accounts at Risk
15
of 495 accounts have High+ txns
Internal Validation Metrics
Unsupervised
ModelAnomaliesSilhouetteCH IndexDBI
Isolation Forest 2,500 (5.0%) 0.3923 2,208 2.70
DBSCAN 24 (0.05%) 0.5900 114 1.23
LOF 2,500 (5.0%) 0.0113 29 17.06
Best Model

Isolation Forest ได้ Silhouette + Calinski-Harabasz สูงสุด → แยก Normal/Anomaly ได้ชัดเจนที่สุด DBSCAN จับได้น้อยแต่แม่นยำ (DBI ต่ำสุด)

Model Agreement
Consensus
0 models (Normal) 45,184 (90.4%)
1 model only 4,626 (9.3%)
2 models agree 172 (0.3%)
All 3 models agree 18 (0.04%)
Overlap

Jaccard similarity ต่ำ (ISO∩LOF = 0.038) → แต่ละ model จับ anomaly คนละแบบ การรวม ensemble จึงมีประสิทธิภาพ

Key Risk Drivers
Impact
FactorRisk PremiumDetail
Login Attempts 3+ +164.3% Strongest single risk indicator
ATM Channel Highest risk Avg risk score 0.0851
Student Occupation Highest risk Avg risk score 0.0893
Amount-to-Balance +527% Anomaly ratio 6x higher than normal
Anomaly vs Normal Profile
Statistical (p<0.001)
Transaction Amount +99.7% higher
Daily Txn Count +79.0% higher
Login Attempts +67.5% higher
Amount Z-Score +3,704% higher
Unique Devices +13.8% higher
All Significant

ทุก feature มีความแตกต่างอย่างมีนัยสำคัญทางสถิติ (Mann-Whitney U test, p<0.001)

06 — Rule Engine & Hybrid Scoring
Expert Rules + ML Fusion
7 expert fraud rules combined with ML scores — Hybrid Risk Score captures 94.7% of ML-detected anomalies
Rule Recall
94.7%
rules catch ML-detected anomalies
Critical Risk
24
0.05% — highest risk transactions
High Risk
3,028
6.1% of all transactions
Rules Triggered
61.9%
transactions trigger ≥ 1 rule
7 Expert Fraud Rules
Rule Engine
RuleTrigger RateSeverity
Login Attempts ≥ 3 3.85% Critical
Device Shared ≥ 5 Accounts 45.92% Critical
Amount Z-Score > 2 3.39% High
Amount / Balance > 0.8 5.62% High
Multi-Location (8+ Cities) 18.79% High
Rapid Transaction (< 1hr) 2.41% Medium
High Velocity (3+ txn/day) 0.12% Medium
Top Trigger

Device Sharing triggers มากที่สุด (45.92%) → อุปกรณ์เครื่องเดียวถูกใช้หลายบัญชี เป็นสัญญาณ fraud ที่ชัดเจน

Hybrid Risk Distribution
ML + Rules
Hybrid Score = 0.5 × ML Score + 0.5 × Rule Score
Low 44.7%
Medium 49.2%
High 6.1%
Critical 0.05%
Hybrid Approach

Hybrid Score รวม ML กับ Rules → 94.7% recall สำหรับ consensus anomalies พร้อม interpretability จาก rule-based reasoning

Full Pipeline Architecture
End-to-End
Raw Data
50K Transactions
EDA
Data Quality
Features
~30 Derived
ML Models
ISO + DBSCAN + LOF
Rule Engine
7 Expert Rules
Hybrid Score
ML + Rules
07 — Next Steps
Roadmap & Recommendations
Completed phases and strategic action plan for the fraud detection pipeline

Feature Engineering

~30 derived features: z-scores, velocity, device sharing, amount ratios, rapid txn flags

Anomaly Detection

3 models deployed: Isolation Forest, DBSCAN, LOF with composite risk scoring

Model Evaluation

Internal metrics, sensitivity analysis, feature importance, statistical significance tests

Rule Engine + Dashboard

7 expert rules, Hybrid Score (ML+Rules), Streamlit dashboard, executive report

Business Recommendations
Strategic
Pain PointEvidenceRecommended ActionPriority
Device Sharing 609 devices shared, max 9 accounts Implement device fingerprinting & alerts Critical
Brute Force Login 3.85% with 3+ attempts, increasing pattern Rate limiting + CAPTCHA after 2 failures Critical
Amount Anomalies 4.75% outlier transactions above $899 ML-based dynamic thresholds per account High
Geo Anomalies 266 accounts active in 5+ cities Location velocity check (impossible travel) High
Data Quality 95% of timestamps default to 00:00 Improve data pipeline for time field capture Medium