The Table Extraction Edge Cases That Took Weeks to Solve

When I started building HTML Table Exporter, I thought table extraction would be straightforward. Iterate through rows, grab cell text, done.
Then I encountered real-world tables.
Wikipedia tables that duplicate columns horizontally. Sports statistics sites with two-level grouped headers. Financial tables with merged cells spanning multiple rows. Each edge case required its own detection and handling logic.
Here are the patterns that took the most time to solve.
Edge Case 1: Wikipedia's Horizontally Duplicated Tables
Some Wikipedia tables display the same column structure multiple times horizontally to save vertical space:
| Rank | Name | Value | Rank | Name | Value |
|------|---------|-------|------|---------|-------|
| 1 | Alpha | 100 | 6 | Foxtrot | 50 |
| 2 | Bravo | 95 | 7 | Golf | 45 |
| 3 | Charlie | 90 | 8 | Hotel | 40 |
Visually, this shows ranks 1-5 on the left and 6-10 on the right. But if you export it directly, you get 6 columns instead of 3, and the data relationships are broken.
The Detection
The pattern is recognizable:
Header row has repeated column names
The repetition is exact (same names in same order)
First column of each "section" is typically numeric (ranks)
function detectHorizontalDuplication(headerRow) {
if (!headerRow || headerRow.length < 4) return null;
const uniqueHeaders = [...new Set(headerRow.map(h => h.toLowerCase().trim()))];
// If we have significantly fewer unique headers than total,
// there might be duplication
if (uniqueHeaders.length >= headerRow.length * 0.7) {
return null; // Not enough repetition
}
// Try to find the repetition point
for (let splitPoint = 2; splitPoint <= headerRow.length / 2; splitPoint++) {
const leftHalf = headerRow.slice(0, splitPoint);
const rightHalf = headerRow.slice(splitPoint, splitPoint * 2);
// Check if patterns match
const matches = leftHalf.every((h, i) =>
h.toLowerCase().trim() === (rightHalf[i] || "").toLowerCase().trim()
);
if (matches && splitPoint * 2 <= headerRow.length) {
return { splitPoint, sections: Math.floor(headerRow.length / splitPoint) };
}
}
return null;
}
The Transformation
Once detected, transform the wide table into a tall one:
function transformHorizontallyDuplicatedTable(matrix, splitPoint) {
const headerRow = matrix[0].slice(0, splitPoint);
const newRows = [headerRow];
for (let rowIdx = 1; rowIdx < matrix.length; rowIdx++) {
const originalRow = matrix[rowIdx];
// Extract each section as a separate row
for (let section = 0; section < Math.floor(originalRow.length / splitPoint); section++) {
const start = section * splitPoint;
const sectionData = originalRow.slice(start, start + splitPoint);
// Skip if section is empty
if (sectionData.every(cell => !cell || !cell.trim())) continue;
newRows.push(sectionData);
}
}
return newRows;
}
Result: 6 columns become 3, and the data stacks vertically as it should.
Edge Case 2: FBREF's Grouped Column Headers
Sports statistics sites like FBREF use two-level headers:
| | | Playing Time | Performance |
| Player | Nation | MP | Starts | Min | Gls | Ast |
The first row contains group names that span multiple sub-columns. The second row contains the actual column headers.
The Challenge
After extracting the matrix (with colspan expansion), the group row looks like:
["", "", "Playing Time", "Playing Time", "Playing Time", "Performance", "Performance"]
The values repeat because colspan="3" becomes three cells with the same value.
The Detection
Group header rows have specific characteristics:
Empty cells at the beginning (columns without groups)
Multiple unique non-empty values (unlike title rows which have one)
High ratio of consecutive repeated values
Next row has more unique values (the actual headers)
function detectGroupHeaderRow(row, nextRow) {
if (!row || !nextRow || row.length < 4) return false;
// Must have empty first cell (ungrouped columns like Player, Nation)
const firstCellEmpty = !(row[0] || "").trim();
if (!firstCellEmpty) return false;
// Count unique non-empty values
const uniqueValues = new Set(
row.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
);
// Title rows have ONE unique value; group rows have MULTIPLE
if (uniqueValues.size <= 1) return false;
// Check for consecutive repeats (colspan expansion signature)
let consecutiveRepeats = 0;
for (let i = 1; i < row.length; i++) {
if ((row[i] || "").trim() === (row[i-1] || "").trim()) {
consecutiveRepeats++;
}
}
const repeatRatio = consecutiveRepeats / (row.length - 1);
return repeatRatio > 0.3;
}
The Merge
Combine group and sub-header rows into composite names:
function mergeGroupAndSubHeaders(groupRow, subHeaderRow) {
return subHeaderRow.map((subHeader, index) => {
const group = (groupRow[index] || "").trim();
const sub = (subHeader || "").trim();
if (!group) return sub; // No group for this column
if (!sub) return group; // No sub-header (shouldn't happen)
return `\({group}: \){sub}`; // "Playing Time: MP"
});
}
Result: Clear, hierarchical column names like "Playing Time: MP", "Performance: Gls".
Edge Case 3: Wikipedia's "v t e" Navigation Prefix
Wikipedia templates often include navigation links at the start of titles:
v t e World Heritage Sites in France
The "v t e" stands for "view | talk | edit" and links to template management. It's noise in exported data.
The Patterns
The prefix appears in several formats:
v t e(space-separated)v | t | e(pipe-separated)[v] [t] [e](bracket-separated)
function cleanWikipediaNavPrefix(text) {
if (!text) return text;
return text
.replace(/^\s*v\s+t\s+e\s+/i, "")
.replace(/^\s*v\s*\|\s*t\s*\|\s*e\s+/i, "")
.replace(/^\s*\[v\]\s*\[t\]\s*\[e\]\s+/i, "")
.trim();
}
This runs on table names extracted from captions or title rows.
Edge Case 4: Multi-Level Headers with Units
Some tables have headers spanning multiple rows to show units or categories:
| | Area | Population |
| | km² | mi² | millions |
| USA | 9.8 | 3.8 | 331 |
The first header row shows the category; the second shows the unit.
Detection vs. Group Headers
This looks similar to grouped headers but has a key difference: the second row contains units or sub-categories, not column names that should replace the first row.
The heuristic: if the second row values are very short (1-5 characters) and look like units (km², %, $), treat it as a unit row rather than a sub-header row.
The Merge Strategy
For unit rows, append rather than replace:
function mergeHeaderWithUnits(headerRow, unitRow) {
return headerRow.map((header, index) => {
const unit = (unitRow[index] || "").trim();
if (!unit) return header;
// Common unit patterns
if (/^[km²³%\(€£¥°]+\)/i.test(unit) || unit.length <= 3) {
return `\({header} (\){unit})`; // "Area (km²)"
}
return header; // Don't append if it doesn't look like a unit
});
}
Edge Case 5: Nested Tables
Tables inside table cells are common in layout-heavy pages. When extracting, you usually want the outer table's structure, not recursively nested content.
function extractCellText(cell) {
if (!cell) return "";
// Clone to avoid modifying the DOM
const clone = cell.cloneNode(true);
// Remove nested tables entirely
clone.querySelectorAll("table").forEach(el => el.remove());
// Also remove script/style/template noise
clone.querySelectorAll("script, style, noscript, template, link")
.forEach(el => el.remove());
// Get clean text
return clone.textContent.replace(/\s+/g, " ").trim();
}
The nested table's content becomes part of the cell's text (if there's readable content) rather than creating structural confusion.
Edge Case 6: Layout Tables vs. Data Tables
Not every <table> element contains data. Many are used for page layout.
Detection criteria for layout tables:
Very few rows (< 2)
Only 1 column
Most cells are empty
Low content density
function isLayoutTable(matrix) {
if (!matrix || matrix.length < 2) return true;
const rowCount = matrix.length;
const colCount = matrix[0]?.length || 0;
if (rowCount === 1 || colCount === 1) return true;
// Count cells with meaningful content
let cellsWithContent = 0;
let totalCells = 0;
for (const row of matrix) {
for (const cell of row) {
totalCells++;
const text = (cell || "").trim();
if (text.length > 1 || /\d/.test(text)) {
cellsWithContent++;
}
}
}
// Less than 30% content = likely layout
return (cellsWithContent / totalCells) < 0.3;
}
The Meta-Lesson
Each edge case was discovered through real-world testing. The initial implementation worked for simple tables. Then users (and my own testing) found tables that broke it.
The pattern:
Export fails or looks wrong
Inspect the source table's HTML
Identify the pattern causing the issue
Write detection logic
Write transformation logic
Add tests for that pattern
After handling these six patterns, the extraction works on the vast majority of web tables. There are always more edge cases, but the common ones are covered.
Want table extraction that handles these edge cases automatically? Check out HTML Table Exporter on the Chrome Web Store.



