Why Regex Isn't Enough: The Hidden Dangers of PII Detection
As developers, our first instinct when asked to "find a credit card number in text" is to write a Regular Expression. We fire up Regex101, type \d{16}, and call it a day. But in the world of Data Loss Prevention (DLP), this naive approach is a recipe for disaster.
The Problem with Pure Regex
Regular Expressions are powerful pattern matchers, but they lack mathematical context. A regex sees a sequence of 16 digits, but it cannot tell the difference between a high-value Visa card and a random UUID or a long version number.
⚠️ The False Positive Trap: If you scrub every 16-digit number, you might accidentally break version strings like 2024.01.05.1234 or UUIDs, rendering your data useless for debugging.
A Common (Bad) Example
Consider this standard regex often found in StackOverflow answers:
const lazyRegex = /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g;
✅ What it catches (Correctly):
4532-1234-5678-9010 (A real card)
❌ What it also catches (Incorrectly):
1234-5678-1234-5678 (A generic test ID)
4444 4444 4444 4444 (A placeholder)
If you are building a PII scrubber, you need precision, not just pattern matching.
Enter the Luhn Algorithm
Real credit card numbers (Visa, MasterCard, Amex) aren't random. They are generated using the Luhn Algorithm (also known as the Modulus 10 algorithm). This mathematical formula ensures that typos can be detected accidentally.
If a sequence of numbers doesn't pass the Luhn check, it is NOT a valid credit card, regardless of what the regex says.
The Algorithm Logic
- Drop the last digit (check digit).
- Reverse the numbers.
- Multiply odd digits by 2.
- Subtract 9 if the result is over 9.
- Add all numbers together.
- The result + check digit must be divisible by 10.
How SafetyLayer Implements Validation
At SafetyLayer, we use a "Two-Step Verification" process to ensure we never scrub non-PII data.
Loose Regex Detection We use a broad regex to find potential candidates (groups of 13-19 digits).
Algorithmic Verification We strip non-digits and run the Luhn check. Only if it passes do we scrub.
The Code
Here is a simplified version of the TypeScript logic we use in the browser:
// The "Luhn Check" - Mathematical Validation
export const luhnCheck = (val: string): boolean => {
let checksum = 0;
let j = 1;
// Process the string from right to left
for (let i = val.length - 1; i >= 0; i--) {
let calc = 0;
// Extract the digit
calc = Number(val.charAt(i)) * j;
// If the result is double digits, sum them (e.g., 18 → 1+8=9)
if (calc > 9) {
checksum = checksum + 1;
calc = calc - 10;
}
checksum = checksum + calc;
// Flip the multiplier (1 to 2, 2 to 1)
if (j === 1) {
j = 2;
} else {
j = 1;
}
}
// Valid if divisible by 10
return checksum % 10 === 0;
};
💡 Did you know? SafetyLayer performs this check locally in your browser in under 2 milliseconds.
Why This Matters for Security
By combining Regex with Algorithmic Validation, we achieve two goals:
Zero False Negatives
We catch weirdly formatted cards (e.g., 4111-1111 1111-1111).
Zero False Positives We ignore long server IDs or phone numbers that happen to look like cards.
When you use SafetyLayer, you aren't just doing a "Find & Replace." You are running a sophisticated validation engine that ensures your data remains clean, usable, and safe.
Test it yourself: Paste a fake credit card number into the input box above. Change one digit, and watch the system ignore it because the math no longer adds up!