Exploratory Data Analysis ของชุดข้อมูลธุรกรรมธนาคาร 50,000 รายการ เพื่อค้นหารูปแบบและสัญญาณความผิดปกติ
| TransactionID | Unique Key |
| AccountID | 495 unique |
| TransactionAmount | Numeric |
| TransactionDate | Datetime |
| TransactionType | 2 types |
| AccountBalance | Numeric |
| TransactionDuration | Seconds |
| Channel | 3 types |
| Location | 43 cities |
| DeviceID | 681 unique |
| IP Address | Text |
| MerchantID | 100 unique |
| LoginAttempts | 1–5 |
| CustomerAge | 18–80 yrs |
| CustomerOccupation | 4 types |
No missing values or duplicates detected. The dataset is clean and ready for ML pipeline processing.
Mean > Median indicates right-skewed distribution. High-value transactions pull the average up. Std Dev of $292 shows significant dispersion.
Debit transactions outnumber Credit by 3.4x. Fraud typically targets debit-side operations, making this imbalance critical for model design.
All three channels are evenly distributed (~33% each). Fraud monitoring must cover all channels equally.
Most records default to 00:00 — likely date-only entries without timestamps. Hour-based features should be used with caution in modeling.
| Amount | Balance | Duration | Login | Age | |
|---|---|---|---|---|---|
| Amount | 1.00 | -0.02 | 0.01 | -0.02 | -0.02 |
| Balance | -0.02 | 1.00 | 0.01 | 0.01 | 0.32 |
| Duration | 0.01 | 0.01 | 1.00 | 0.03 | -0.02 |
| Login | -0.02 | 0.01 | 0.03 | 1.00 | 0.01 |
| Age | -0.02 | 0.32 | -0.02 | 0.01 | 1.00 |
Age vs Balance (r=0.32) is the only meaningful correlation. Older customers tend to have higher balances. Low inter-feature correlation is beneficial for ML — each feature contributes unique information.
95.1% succeed on first attempt. Unusually, transactions increase from 3 to 5 attempts (619 → 645 → 660) instead of decreasing — suggests automated brute-force behavior.
Multiple fraud signals detected across dimensions. Device sharing is the most severe risk (89.4%). A composite risk score combining all signals is recommended.
| Model | Anomalies | Silhouette | CH Index | DBI |
|---|---|---|---|---|
| Isolation Forest | 2,500 (5.0%) | 0.3923 | 2,208 | 2.70 |
| DBSCAN | 24 (0.05%) | 0.5900 | 114 | 1.23 |
| LOF | 2,500 (5.0%) | 0.0113 | 29 | 17.06 |
Isolation Forest ได้ Silhouette + Calinski-Harabasz สูงสุด → แยก Normal/Anomaly ได้ชัดเจนที่สุด DBSCAN จับได้น้อยแต่แม่นยำ (DBI ต่ำสุด)
Jaccard similarity ต่ำ (ISO∩LOF = 0.038) → แต่ละ model จับ anomaly คนละแบบ การรวม ensemble จึงมีประสิทธิภาพ
| Factor | Risk Premium | Detail |
|---|---|---|
| Login Attempts 3+ | +164.3% | Strongest single risk indicator |
| ATM Channel | Highest risk | Avg risk score 0.0851 |
| Student Occupation | Highest risk | Avg risk score 0.0893 |
| Amount-to-Balance | +527% | Anomaly ratio 6x higher than normal |
ทุก feature มีความแตกต่างอย่างมีนัยสำคัญทางสถิติ (Mann-Whitney U test, p<0.001)
| Rule | Trigger Rate | Severity |
|---|---|---|
| Login Attempts ≥ 3 | 3.85% | Critical |
| Device Shared ≥ 5 Accounts | 45.92% | Critical |
| Amount Z-Score > 2 | 3.39% | High |
| Amount / Balance > 0.8 | 5.62% | High |
| Multi-Location (8+ Cities) | 18.79% | High |
| Rapid Transaction (< 1hr) | 2.41% | Medium |
| High Velocity (3+ txn/day) | 0.12% | Medium |
Device Sharing triggers มากที่สุด (45.92%) → อุปกรณ์เครื่องเดียวถูกใช้หลายบัญชี เป็นสัญญาณ fraud ที่ชัดเจน
Hybrid Score รวม ML กับ Rules → 94.7% recall สำหรับ consensus anomalies พร้อม interpretability จาก rule-based reasoning
~30 derived features: z-scores, velocity, device sharing, amount ratios, rapid txn flags
3 models deployed: Isolation Forest, DBSCAN, LOF with composite risk scoring
Internal metrics, sensitivity analysis, feature importance, statistical significance tests
7 expert rules, Hybrid Score (ML+Rules), Streamlit dashboard, executive report
| Pain Point | Evidence | Recommended Action | Priority |
|---|---|---|---|
| Device Sharing | 609 devices shared, max 9 accounts | Implement device fingerprinting & alerts | Critical |
| Brute Force Login | 3.85% with 3+ attempts, increasing pattern | Rate limiting + CAPTCHA after 2 failures | Critical |
| Amount Anomalies | 4.75% outlier transactions above $899 | ML-based dynamic thresholds per account | High |
| Geo Anomalies | 266 accounts active in 5+ cities | Location velocity check (impossible travel) | High |
| Data Quality | 95% of timestamps default to 00:00 | Improve data pipeline for time field capture | Medium |