Skip to main content

Command Palette

Search for a command to run...

The Table Extraction Edge Cases That Took Weeks to Solve

Updated
8 min read
The Table Extraction Edge Cases That Took Weeks to Solve

When I started building HTML Table Exporter, I thought table extraction would be straightforward. Iterate through rows, grab cell text, done.

Then I encountered real-world tables.

Wikipedia tables that duplicate columns horizontally. Sports statistics sites with two-level grouped headers. Financial tables with merged cells spanning multiple rows. Each edge case required its own detection and handling logic.

Here are the patterns that took the most time to solve.

Edge Case 1: Wikipedia's Horizontally Duplicated Tables

Some Wikipedia tables display the same column structure multiple times horizontally to save vertical space:

| Rank | Name    | Value | Rank | Name    | Value |
|------|---------|-------|------|---------|-------|
| 1    | Alpha   | 100   | 6    | Foxtrot | 50    |
| 2    | Bravo   | 95    | 7    | Golf    | 45    |
| 3    | Charlie | 90    | 8    | Hotel   | 40    |

Visually, this shows ranks 1-5 on the left and 6-10 on the right. But if you export it directly, you get 6 columns instead of 3, and the data relationships are broken.

The Detection

The pattern is recognizable:

  • Header row has repeated column names

  • The repetition is exact (same names in same order)

  • First column of each "section" is typically numeric (ranks)

function detectHorizontalDuplication(headerRow) {
  if (!headerRow || headerRow.length < 4) return null;
  
  const uniqueHeaders = [...new Set(headerRow.map(h => h.toLowerCase().trim()))];
  
  // If we have significantly fewer unique headers than total,
  // there might be duplication
  if (uniqueHeaders.length >= headerRow.length * 0.7) {
    return null; // Not enough repetition
  }
  
  // Try to find the repetition point
  for (let splitPoint = 2; splitPoint <= headerRow.length / 2; splitPoint++) {
    const leftHalf = headerRow.slice(0, splitPoint);
    const rightHalf = headerRow.slice(splitPoint, splitPoint * 2);
    
    // Check if patterns match
    const matches = leftHalf.every((h, i) => 
      h.toLowerCase().trim() === (rightHalf[i] || "").toLowerCase().trim()
    );
    
    if (matches && splitPoint * 2 <= headerRow.length) {
      return { splitPoint, sections: Math.floor(headerRow.length / splitPoint) };
    }
  }
  
  return null;
}

The Transformation

Once detected, transform the wide table into a tall one:

function transformHorizontallyDuplicatedTable(matrix, splitPoint) {
  const headerRow = matrix[0].slice(0, splitPoint);
  const newRows = [headerRow];
  
  for (let rowIdx = 1; rowIdx < matrix.length; rowIdx++) {
    const originalRow = matrix[rowIdx];
    
    // Extract each section as a separate row
    for (let section = 0; section < Math.floor(originalRow.length / splitPoint); section++) {
      const start = section * splitPoint;
      const sectionData = originalRow.slice(start, start + splitPoint);
      
      // Skip if section is empty
      if (sectionData.every(cell => !cell || !cell.trim())) continue;
      
      newRows.push(sectionData);
    }
  }
  
  return newRows;
}

Result: 6 columns become 3, and the data stacks vertically as it should.

Edge Case 2: FBREF's Grouped Column Headers

Sports statistics sites like FBREF use two-level headers:

|        |        | Playing Time      | Performance    |
| Player | Nation | MP | Starts | Min | Gls | Ast     |

The first row contains group names that span multiple sub-columns. The second row contains the actual column headers.

The Challenge

After extracting the matrix (with colspan expansion), the group row looks like:

["", "", "Playing Time", "Playing Time", "Playing Time", "Performance", "Performance"]

The values repeat because colspan="3" becomes three cells with the same value.

The Detection

Group header rows have specific characteristics:

  • Empty cells at the beginning (columns without groups)

  • Multiple unique non-empty values (unlike title rows which have one)

  • High ratio of consecutive repeated values

  • Next row has more unique values (the actual headers)

function detectGroupHeaderRow(row, nextRow) {
  if (!row || !nextRow || row.length < 4) return false;
  
  // Must have empty first cell (ungrouped columns like Player, Nation)
  const firstCellEmpty = !(row[0] || "").trim();
  if (!firstCellEmpty) return false;
  
  // Count unique non-empty values
  const uniqueValues = new Set(
    row.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
  );
  
  // Title rows have ONE unique value; group rows have MULTIPLE
  if (uniqueValues.size <= 1) return false;
  
  // Check for consecutive repeats (colspan expansion signature)
  let consecutiveRepeats = 0;
  for (let i = 1; i < row.length; i++) {
    if ((row[i] || "").trim() === (row[i-1] || "").trim()) {
      consecutiveRepeats++;
    }
  }
  
  const repeatRatio = consecutiveRepeats / (row.length - 1);
  return repeatRatio > 0.3;
}

The Merge

Combine group and sub-header rows into composite names:

function mergeGroupAndSubHeaders(groupRow, subHeaderRow) {
  return subHeaderRow.map((subHeader, index) => {
    const group = (groupRow[index] || "").trim();
    const sub = (subHeader || "").trim();
    
    if (!group) return sub;  // No group for this column
    if (!sub) return group;  // No sub-header (shouldn't happen)
    
    return `\({group}: \){sub}`;  // "Playing Time: MP"
  });
}

Result: Clear, hierarchical column names like "Playing Time: MP", "Performance: Gls".

Edge Case 3: Wikipedia's "v t e" Navigation Prefix

Wikipedia templates often include navigation links at the start of titles:

v t e World Heritage Sites in France

The "v t e" stands for "view | talk | edit" and links to template management. It's noise in exported data.

The Patterns

The prefix appears in several formats:

  • v t e (space-separated)

  • v | t | e (pipe-separated)

  • [v] [t] [e] (bracket-separated)

function cleanWikipediaNavPrefix(text) {
  if (!text) return text;
  
  return text
    .replace(/^\s*v\s+t\s+e\s+/i, "")
    .replace(/^\s*v\s*\|\s*t\s*\|\s*e\s+/i, "")
    .replace(/^\s*\[v\]\s*\[t\]\s*\[e\]\s+/i, "")
    .trim();
}

This runs on table names extracted from captions or title rows.

Edge Case 4: Multi-Level Headers with Units

Some tables have headers spanning multiple rows to show units or categories:

|       | Area        | Population |
|       | km² | mi²   | millions   |
| USA   | 9.8 | 3.8   | 331        |

The first header row shows the category; the second shows the unit.

Detection vs. Group Headers

This looks similar to grouped headers but has a key difference: the second row contains units or sub-categories, not column names that should replace the first row.

The heuristic: if the second row values are very short (1-5 characters) and look like units (km², %, $), treat it as a unit row rather than a sub-header row.

The Merge Strategy

For unit rows, append rather than replace:

function mergeHeaderWithUnits(headerRow, unitRow) {
  return headerRow.map((header, index) => {
    const unit = (unitRow[index] || "").trim();
    if (!unit) return header;
    
    // Common unit patterns
    if (/^[km²³%\(€£¥°]+\)/i.test(unit) || unit.length <= 3) {
      return `\({header} (\){unit})`;  // "Area (km²)"
    }
    
    return header;  // Don't append if it doesn't look like a unit
  });
}

Edge Case 5: Nested Tables

Tables inside table cells are common in layout-heavy pages. When extracting, you usually want the outer table's structure, not recursively nested content.

function extractCellText(cell) {
  if (!cell) return "";
  
  // Clone to avoid modifying the DOM
  const clone = cell.cloneNode(true);
  
  // Remove nested tables entirely
  clone.querySelectorAll("table").forEach(el => el.remove());
  
  // Also remove script/style/template noise
  clone.querySelectorAll("script, style, noscript, template, link")
    .forEach(el => el.remove());
  
  // Get clean text
  return clone.textContent.replace(/\s+/g, " ").trim();
}

The nested table's content becomes part of the cell's text (if there's readable content) rather than creating structural confusion.

Edge Case 6: Layout Tables vs. Data Tables

Not every <table> element contains data. Many are used for page layout.

Detection criteria for layout tables:

  • Very few rows (< 2)

  • Only 1 column

  • Most cells are empty

  • Low content density

function isLayoutTable(matrix) {
  if (!matrix || matrix.length < 2) return true;
  
  const rowCount = matrix.length;
  const colCount = matrix[0]?.length || 0;
  
  if (rowCount === 1 || colCount === 1) return true;
  
  // Count cells with meaningful content
  let cellsWithContent = 0;
  let totalCells = 0;
  
  for (const row of matrix) {
    for (const cell of row) {
      totalCells++;
      const text = (cell || "").trim();
      if (text.length > 1 || /\d/.test(text)) {
        cellsWithContent++;
      }
    }
  }
  
  // Less than 30% content = likely layout
  return (cellsWithContent / totalCells) < 0.3;
}

The Meta-Lesson

Each edge case was discovered through real-world testing. The initial implementation worked for simple tables. Then users (and my own testing) found tables that broke it.

The pattern:

  1. Export fails or looks wrong

  2. Inspect the source table's HTML

  3. Identify the pattern causing the issue

  4. Write detection logic

  5. Write transformation logic

  6. Add tests for that pattern

After handling these six patterns, the extraction works on the vast majority of web tables. There are always more edge cases, but the common ones are covered.


Want table extraction that handles these edge cases automatically? Check out HTML Table Exporter on the Chrome Web Store.