The Table Extraction Edge Cases That Took Weeks to Solve

When I started building HTML Table Exporter, I thought table extraction would be straightforward. Iterate through rows, grab cell text, done.

Then I encountered real-world tables.

Wikipedia tables that duplicate columns horizontally. Sports statistics sites with two-level grouped headers. Financial tables with merged cells spanning multiple rows. Each edge case required its own detection and handling logic.

Here are the patterns that took the most time to solve.

Edge Case 1: Wikipedia's Horizontally Duplicated Tables

Some Wikipedia tables display the same column structure multiple times horizontally to save vertical space:

| Rank | Name    | Value | Rank | Name    | Value |
|------|---------|-------|------|---------|-------|
| 1    | Alpha   | 100   | 6    | Foxtrot | 50    |
| 2    | Bravo   | 95    | 7    | Golf    | 45    |
| 3    | Charlie | 90    | 8    | Hotel   | 40    |

Visually, this shows ranks 1-5 on the left and 6-10 on the right. But if you export it directly, you get 6 columns instead of 3, and the data relationships are broken.

The Detection

The pattern is recognizable:

Header row has repeated column names
The repetition is exact (same names in same order)
First column of each "section" is typically numeric (ranks)

function detectHorizontalDuplication(headerRow) {
  if (!headerRow || headerRow.length < 4) return null;
  
  const uniqueHeaders = [...new Set(headerRow.map(h => h.toLowerCase().trim()))];
  
  // If we have significantly fewer unique headers than total,
  // there might be duplication
  if (uniqueHeaders.length >= headerRow.length * 0.7) {
    return null; // Not enough repetition
  }
  
  // Try to find the repetition point
  for (let splitPoint = 2; splitPoint <= headerRow.length / 2; splitPoint++) {
    const leftHalf = headerRow.slice(0, splitPoint);
    const rightHalf = headerRow.slice(splitPoint, splitPoint * 2);
    
    // Check if patterns match
    const matches = leftHalf.every((h, i) => 
      h.toLowerCase().trim() === (rightHalf[i] || "").toLowerCase().trim()
    );
    
    if (matches && splitPoint * 2 <= headerRow.length) {
      return { splitPoint, sections: Math.floor(headerRow.length / splitPoint) };
    }
  }
  
  return null;
}

The Transformation

Once detected, transform the wide table into a tall one:

function transformHorizontallyDuplicatedTable(matrix, splitPoint) {
  const headerRow = matrix[0].slice(0, splitPoint);
  const newRows = [headerRow];
  
  for (let rowIdx = 1; rowIdx < matrix.length; rowIdx++) {
    const originalRow = matrix[rowIdx];
    
    // Extract each section as a separate row
    for (let section = 0; section < Math.floor(originalRow.length / splitPoint); section++) {
      const start = section * splitPoint;
      const sectionData = originalRow.slice(start, start + splitPoint);
      
      // Skip if section is empty
      if (sectionData.every(cell => !cell || !cell.trim())) continue;
      
      newRows.push(sectionData);
    }
  }
  
  return newRows;
}

Result: 6 columns become 3, and the data stacks vertically as it should.

Edge Case 2: FBREF's Grouped Column Headers

Sports statistics sites like FBREF use two-level headers:

|        |        | Playing Time      | Performance    |
| Player | Nation | MP | Starts | Min | Gls | Ast     |

The first row contains group names that span multiple sub-columns. The second row contains the actual column headers.

The Challenge

After extracting the matrix (with colspan expansion), the group row looks like:

["", "", "Playing Time", "Playing Time", "Playing Time", "Performance", "Performance"]

The values repeat because colspan="3" becomes three cells with the same value.

The Detection

Group header rows have specific characteristics:

Empty cells at the beginning (columns without groups)
Multiple unique non-empty values (unlike title rows which have one)
High ratio of consecutive repeated values
Next row has more unique values (the actual headers)

function detectGroupHeaderRow(row, nextRow) {
  if (!row || !nextRow || row.length < 4) return false;
  
  // Must have empty first cell (ungrouped columns like Player, Nation)
  const firstCellEmpty = !(row[0] || "").trim();
  if (!firstCellEmpty) return false;
  
  // Count unique non-empty values
  const uniqueValues = new Set(
    row.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
  );
  
  // Title rows have ONE unique value; group rows have MULTIPLE
  if (uniqueValues.size <= 1) return false;
  
  // Check for consecutive repeats (colspan expansion signature)
  let consecutiveRepeats = 0;
  for (let i = 1; i < row.length; i++) {
    if ((row[i] || "").trim() === (row[i-1] || "").trim()) {
      consecutiveRepeats++;
    }
  }
  
  const repeatRatio = consecutiveRepeats / (row.length - 1);
  return repeatRatio > 0.3;
}

The Merge

Combine group and sub-header rows into composite names:

function mergeGroupAndSubHeaders(groupRow, subHeaderRow) {
  return subHeaderRow.map((subHeader, index) => {
    const group = (groupRow[index] || "").trim();
    const sub = (subHeader || "").trim();
    
    if (!group) return sub;  // No group for this column
    if (!sub) return group;  // No sub-header (shouldn't happen)
    
    return `\({group}: \){sub}`;  // "Playing Time: MP"
  });
}

Result: Clear, hierarchical column names like "Playing Time: MP", "Performance: Gls".

Wikipedia templates often include navigation links at the start of titles:

v t e World Heritage Sites in France

The "v t e" stands for "view | talk | edit" and links to template management. It's noise in exported data.

The Patterns

The prefix appears in several formats:

v t e (space-separated)
v | t | e (pipe-separated)
[v] [t] [e] (bracket-separated)

function cleanWikipediaNavPrefix(text) {
  if (!text) return text;
  
  return text
    .replace(/^\s*v\s+t\s+e\s+/i, "")
    .replace(/^\s*v\s*\|\s*t\s*\|\s*e\s+/i, "")
    .replace(/^\s*\[v\]\s*\[t\]\s*\[e\]\s+/i, "")
    .trim();
}

This runs on table names extracted from captions or title rows.

Edge Case 4: Multi-Level Headers with Units

Some tables have headers spanning multiple rows to show units or categories:

|       | Area        | Population |
|       | km² | mi²   | millions   |
| USA   | 9.8 | 3.8   | 331        |

The first header row shows the category; the second shows the unit.

Detection vs. Group Headers

This looks similar to grouped headers but has a key difference: the second row contains units or sub-categories, not column names that should replace the first row.

The heuristic: if the second row values are very short (1-5 characters) and look like units (km², %, $), treat it as a unit row rather than a sub-header row.

The Merge Strategy

For unit rows, append rather than replace:

function mergeHeaderWithUnits(headerRow, unitRow) {
  return headerRow.map((header, index) => {
    const unit = (unitRow[index] || "").trim();
    if (!unit) return header;
    
    // Common unit patterns
    if (/^[km²³%\(€£¥°]+\)/i.test(unit) || unit.length <= 3) {
      return `\({header} (\){unit})`;  // "Area (km²)"
    }
    
    return header;  // Don't append if it doesn't look like a unit
  });
}

Edge Case 5: Nested Tables

Tables inside table cells are common in layout-heavy pages. When extracting, you usually want the outer table's structure, not recursively nested content.

function extractCellText(cell) {
  if (!cell) return "";
  
  // Clone to avoid modifying the DOM
  const clone = cell.cloneNode(true);
  
  // Remove nested tables entirely
  clone.querySelectorAll("table").forEach(el => el.remove());
  
  // Also remove script/style/template noise
  clone.querySelectorAll("script, style, noscript, template, link")
    .forEach(el => el.remove());
  
  // Get clean text
  return clone.textContent.replace(/\s+/g, " ").trim();
}

The nested table's content becomes part of the cell's text (if there's readable content) rather than creating structural confusion.

Edge Case 6: Layout Tables vs. Data Tables

Not every <table> element contains data. Many are used for page layout.

Detection criteria for layout tables:

Very few rows (< 2)
Only 1 column
Most cells are empty
Low content density

function isLayoutTable(matrix) {
  if (!matrix || matrix.length < 2) return true;
  
  const rowCount = matrix.length;
  const colCount = matrix[0]?.length || 0;
  
  if (rowCount === 1 || colCount === 1) return true;
  
  // Count cells with meaningful content
  let cellsWithContent = 0;
  let totalCells = 0;
  
  for (const row of matrix) {
    for (const cell of row) {
      totalCells++;
      const text = (cell || "").trim();
      if (text.length > 1 || /\d/.test(text)) {
        cellsWithContent++;
      }
    }
  }
  
  // Less than 30% content = likely layout
  return (cellsWithContent / totalCells) < 0.3;
}

The Meta-Lesson

Each edge case was discovered through real-world testing. The initial implementation worked for simple tables. Then users (and my own testing) found tables that broke it.

The pattern:

Export fails or looks wrong
Inspect the source table's HTML
Identify the pattern causing the issue
Write detection logic
Write transformation logic
Add tests for that pattern

After handling these six patterns, the extraction works on the vast majority of web tables. There are always more edge cases, but the common ones are covered.

Want table extraction that handles these edge cases automatically? Check out HTML Table Exporter on the Chrome Web Store.

The Table Extraction Edge Cases That Took Weeks to Solve

Edge Case 1: Wikipedia's Horizontally Duplicated Tables

The Detection

The Transformation

Edge Case 2: FBREF's Grouped Column Headers

The Challenge

The Detection

The Merge

Edge Case 3: Wikipedia's "v t e" Navigation Prefix

The Patterns

Edge Case 4: Multi-Level Headers with Units

Detection vs. Group Headers

The Merge Strategy

Edge Case 5: Nested Tables

Edge Case 6: Layout Tables vs. Data Tables

The Meta-Lesson

Comments

More from this blog

Designing Export Profiles for Different Data Workflows

Why I Chose Vanilla JavaScript for a Chrome Extension (No Framework)

Lessons from Users Who Use My Tool Differently Than I Expected

Why I Built for a Niche (And Ignored the 'Scale' Advice)

Command Palette

Edge Case 1: Wikipedia's Horizontally Duplicated Tables

The Detection

The Transformation

Edge Case 2: FBREF's Grouped Column Headers

The Challenge

The Detection

The Merge

Edge Case 3: Wikipedia's "v t e" Navigation Prefix

The Patterns

Edge Case 4: Multi-Level Headers with Units

Detection vs. Group Headers

The Merge Strategy

Edge Case 5: Nested Tables

Edge Case 6: Layout Tables vs. Data Tables

The Meta-Lesson

Comments

More from this blog