Skip to main content

Command Palette

Search for a command to run...

Data Cleaning in the Browser: Turning Web Tables into Analysis-Ready Data

Published
8 min read
Data Cleaning in the Browser: Turning Web Tables into Analysis-Ready Data

You've found the perfect data table on a website. You export it. You open it in Excel or load it into Pandas. And then the problems start.

  • Numbers are strings: "1,234,567" instead of 1234567

  • Decimals are inconsistent: some use ., others use ,

  • Dates are all over the place: 02/15/2024, 15-02-2024, 2024.02.15

  • Boolean values are text: "Yes", "No", "✓", "✗"

  • Null values are disguised: "-", "N/A", "—", "n/a", ""

You spend 30 minutes cleaning data before you can even start your analysis. Every. Single. Time.

When I built HTML Table Exporter, I decided to tackle this at the source. Clean the data during export, not after. Here's how.

The Challenge: International Number Formats

The same number is written differently around the world:

FormatRegionExample
1,234,567.89US, UKComma thousands, dot decimal
1.234.567,89Germany, Brazil, SpainDot thousands, comma decimal
1 234 567,89France, RussiaSpace thousands, comma decimal
12,34,567.89IndiaLakh/crore grouping

When you scrape a table, you don't know which format you're dealing with. And the same table might mix formats across columns.

The Heuristic

I developed a heuristic that correctly handles most real-world cases:

function normalizeNumberString(value) {
  if (typeof value !== "string") return value;
  const v = value.trim();
  if (!v) return value;

  // Only attempt if contains digits
  if (!/[0-9]/.test(v)) return value;

  // Step 1: Clean currency and percentage symbols
  let cleaned = v
    .replace(/ /g, " ")
    .replace(/^(USD|EUR|GBP|JPY|CHF|CAD|AUD|CNY|INR|BRL|R\$|MXN|KRW)\s*/i, "")
    .replace(/[$€£¥₹₽₩₪฿₫₴₦]/g, "")
    .replace(/%/g, "")
    .replace(/\s+/g, "");

  // Step 2: Handle negative signs
  const isNegative = cleaned.match(/^-/) ? true : false;
  if (isNegative) {
    cleaned = cleaned.substring(1).trim();
  }

  // Step 3: Simple case - digits only
  if (/^\d+$/.test(cleaned)) {
    return isNegative ? `-${cleaned}` : cleaned;
  }

  // Step 4: Detect separators
  const hasComma = cleaned.includes(",");
  const hasDot = cleaned.includes(".");

  if (!hasComma && !hasDot) {
    return isNegative ? `-${cleaned}` : cleaned;
  }

  // Step 5: The tricky part - which is decimal, which is thousands?
  let decimalSeparator = ".";
  let thousandsSeparator = ",";

  if (hasComma && hasDot) {
    // Both present: the LAST one is the decimal
    const lastCommaPos = cleaned.lastIndexOf(",");
    const lastDotPos = cleaned.lastIndexOf(".");

    if (lastCommaPos > lastDotPos) {
      // "1.234.567,89" → European format
      decimalSeparator = ",";
      thousandsSeparator = ".";
    }
    // else: "1,234,567.89" → US format (default)
  } else if (hasComma && !hasDot) {
    const parts = cleaned.split(",");

    if (parts.length === 2) {
      const afterComma = parts[1];
      // 1-2 digits after comma → decimal
      // 3+ digits after comma → thousands
      if (afterComma.length <= 2) {
        decimalSeparator = ",";
        thousandsSeparator = ".";
      } else {
        decimalSeparator = null;
        thousandsSeparator = ",";
      }
    } else {
      // Multiple commas → thousands separators
      decimalSeparator = null;
      thousandsSeparator = ",";
    }
  } else if (hasDot && !hasComma) {
    const parts = cleaned.split(".");

    if (parts.length === 2) {
      const afterDot = parts[1];

      if (afterDot.length <= 2) {
        // "1.23" → decimal
        decimalSeparator = ".";
      } else if (afterDot.length >= 4) {
        // "3.14159" → decimal (precision)
        decimalSeparator = ".";
      } else if (afterDot.length === 3) {
        // AMBIGUOUS: "1.234" 
        // In HTML tables, this is usually THOUSANDS
        decimalSeparator = null;
        thousandsSeparator = ".";
      }
    } else {
      // Multiple dots → thousands separators
      decimalSeparator = null;
      thousandsSeparator = ".";
    }
  }

  // Step 6: Remove thousands separator, normalize decimal
  let result = cleaned;
  if (thousandsSeparator) {
    result = result.split(thousandsSeparator).join("");
  }
  if (decimalSeparator && decimalSeparator !== ".") {
    result = result.replace(decimalSeparator, ".");
  }

  return isNegative ? `-${result}` : result;
}

Test Cases That Drove the Design

These are real values from tables I've scraped:

InputExpectedFormat
"1.234.567,89""1234567.89"LATAM/EU
"1,234,567.89""1234567.89"US/UK
"$ 1.200,50""1200.50"Brazilian Real
"€ 1 234,56""1234.56"French Euro
"12.5%""12.5"Percentage
"- 5.000""-5000"Negative thousands
"2.400""2400"Ambiguous → thousands
"3.14159""3.14159"Scientific decimal

The "2.400" case is the hardest. Is it 2.4 or 2,400? In the context of HTML tables (populations, prices, quantities), it's almost always 2,400. For scientific data, you'd want different behavior—which is why the PRO version lets you override this.

Boolean Normalization

Tables use all kinds of representations for true/false:

const COMMON_TRUE_VALUES = [
  "yes", "true", "1", "si", "sí", "oui", "ja", 
  "✓", "✔", "☑", "x", "active", "enabled"
];

const COMMON_FALSE_VALUES = [
  "no", "false", "0", "non", "nein",
  "✗", "✘", "☐", "-", "inactive", "disabled"
];

But there's a trap: numeric columns might contain "0" and "1" as actual numbers, not booleans.

function applyBooleanNormalization(rows, booleans) {
  if (!booleans?.true || !booleans?.false) return rows;

  const trueSet = new Set(booleans.true.map(v => String(v).toLowerCase().trim()));
  const falseSet = new Set(booleans.false.map(v => String(v).toLowerCase().trim()));

  return rows.map((row, rowIndex) => {
    if (rowIndex === 0) return row; // Don't touch header

    return row.map((cell) => {
      if (cell == null) return cell;
      const cellStr = String(cell).toLowerCase().trim();

      // CRITICAL: Skip purely numeric values
      // This prevents "0" → "false" in numeric columns like tax_rate
      if (/^-?\d+(\.\d+)?$/.test(cellStr)) {
        return cell;
      }

      if (trueSet.has(cellStr)) return "true";
      if (falseSet.has(cellStr)) return "false";

      return cell;
    });
  });
}

Null Value Detection

Every data source has its own way of saying "no data":

const COMMON_NULL_VALUES = [
  "", "-", "–", "—",           // Dashes
  "n/a", "N/A", "NA", "na",    // Not available
  "null", "NULL", "nil",       // Programming nulls
  "none", "None", "NONE",      // Python-style
  "#N/A", "#REF!", "#VALUE!",  // Excel errors
  ".", "..", "...",            // Placeholder dots
  "TBD", "TBA",                // To be determined
  "unknown", "Unknown"         // Unknown
];
function applyNullValues(rows, nullValues = []) {
  if (!nullValues?.length) return rows;

  const nullSet = new Set(
    nullValues.map(v => String(v).toLowerCase().trim())
  );

  return rows.map((row, rowIndex) => {
    if (rowIndex === 0) return row; // Don't touch header

    return row.map((cell) => {
      if (cell == null) return null;
      const cellStr = String(cell).toLowerCase().trim();
      return nullSet.has(cellStr) ? null : cell;
    });
  });
}

Type Inference

For SQL and typed formats, knowing the column type matters. I infer types by sampling the data:

function inferColumnType(columnValues, maxSamples = 100) {
  const nonNullValues = columnValues.filter(v => v != null && v !== "");
  const nullCount = columnValues.length - nonNullValues.length;

  if (nonNullValues.length === 0) {
    return { type: "string", confidence: 0, nullable: true };
  }

  const sample = nonNullValues.slice(0, maxSamples);

  const typeCounts = {
    integer: 0,
    number: 0,
    boolean: 0,
    date: 0,
    string: 0
  };

  for (const val of sample) {
    const type = inferValueType(val);
    if (type !== "null") {
      if (type === "integer") {
        typeCounts.integer++;
        typeCounts.number++; // Integer is also a number
      } else {
        typeCounts[type]++;
      }
    }
  }

  const total = sample.length;

  // Priority: BOOLEAN > DATE > INTEGER > NUMBER > STRING
  // A type is selected if 90%+ of values match

  if (typeCounts.boolean >= total * 0.9) {
    return { type: "boolean", confidence: typeCounts.boolean / total };
  }

  if (typeCounts.date >= total * 0.9) {
    return { type: "date", confidence: typeCounts.date / total };
  }

  // For numeric columns, use the most specific type
  if (typeCounts.number >= total * 0.9) {
    // If all numbers are integers, use integer
    if (typeCounts.integer >= total * 0.9) {
      return { type: "integer", confidence: typeCounts.integer / total };
    }
    return { type: "number", confidence: typeCounts.number / total };
  }

  return { type: "string", confidence: 1 };
}

Preset Profiles: One-Click Workflows

All these transformations are configurable. The PRO version includes preset profiles for common workflows:

For Pandas

const PANDAS_PROFILE = {
  cleaningPresets: {
    trimStrings: true,
    normalizeNumbers: true,
    nullValues: ["", "-", "N/A", "n/a", "NA", "null", "—"],
    booleans: {
      true: ["yes", "true", "1", "si", "sí"],
      false: ["no", "false", "0"]
    }
  },
  export: {
    format: "csv",
    delimiter: ",",
    namingCase: "snake"  // "Player Name" → "player_name"
  }
};

For Excel (European)

const EXCEL_EU_PROFILE = {
  cleaningPresets: {
    trimStrings: true,
    normalizeNumbers: false  // Keep European format for Excel
  },
  export: {
    format: "xlsx",
    delimiter: ";"  // European CSV uses semicolon
  }
};

For DuckDB/SQL

const DUCKDB_PROFILE = {
  cleaningPresets: {
    trimStrings: true,
    normalizeNumbers: true,
    nullValues: ["", "-", "N/A", "null"]
  },
  export: {
    format: "sql",
    namingCase: "snake",
    inferTypes: true  // Generate typed CREATE TABLE
  }
};

The Pipeline

Here's how all the cleaning steps chain together:

function applyCleaningPresets(tableInfo, presets) {
  let rows = cloneRows(tableInfo.rows);

  // 1. Trim whitespace (almost always wanted)
  if (presets.trimStrings) {
    rows = applyTrimStrings(rows);
  }

  // 2. Normalize null values (before other transformations)
  if (presets.nullValues?.length) {
    rows = applyNullValues(rows, presets.nullValues);
  }

  // 3. Normalize booleans (before numbers, to avoid 0/1 conflicts)
  if (presets.booleans) {
    rows = applyBooleanNormalization(rows, presets.booleans);
  }

  // 4. Normalize numbers (after nulls are handled)
  if (presets.normalizeNumbers) {
    rows = applyNumberNormalization(rows);
  }

  // 5. Normalize dates (if enabled)
  if (presets.normalizeDates) {
    rows = applyDateNormalization(rows);
  }

  return { ...tableInfo, rows };
}

Order matters! Null values must be converted before number normalization, otherwise "-" might be interpreted as a negative sign.

Why Do This in the Browser?

You might ask: why not clean data after export, using Pandas or R?

  1. Privacy. The data never leaves your browser. No upload to a server. No API calls. This matters for sensitive data (financial reports, HR tables, medical data).

  2. Consistency. Apply the same cleaning to every export, automatically. No forgetting a step.

  3. Speed. One click to export clean data. No post-processing script to run.

  4. Non-technical users. Your colleague who needs clean data but doesn't know Python can use the same workflow you do.

Try It

  1. Install HTML Table Exporter

  2. Find a table with messy data (financial sites are great for this)

  3. Click the extension, select the table

  4. In PRO: choose a cleaning profile or customize

  5. Export and see the difference

Your analysis starts with clean data, not cleaning data.


Learn more at gauchogrid.com/html-table-exporter or get it from the Chrome Web Store. What's the messiest data you've encountered in a web table? Share your horror stories in the comments.