Parsing HTML Tables Is Harder Than You Think

When I started building HTML Table Exporter, I thought parsing tables would be the easy part. Tables are structured data, right? Rows and columns. How complicated could it be?

Three months later, I had a folder called edge-cases with 47 test files and a deep appreciation for how creative web developers can be when abusing the <table> element.

This article documents the non-obvious problems you'll encounter when extracting data from HTML tables in the wild, and the solutions that actually work.

The Basics Aren't Basic

Let's start with what should be simple. A well-formed HTML table looks like this:

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Item A</td>
      <td>100</td>
    </tr>
  </tbody>
</table>

Parsing this is trivial. Query the <th> elements for headers, iterate through <tr> elements in <tbody>, extract <td> content. Done.

But here's what tables look like in the real world:

<table>
  <tr>
    <td><strong>Name</strong></td>
    <td><strong>Value</strong></td>
  </tr>
  <tr>
    <td>Item A</td>
    <td>100</td>
  </tr>
</table>

No <thead>. No <th>. No <tbody>. The headers are just bold text in regular cells. This is valid HTML, and it's everywhere.

Solution: You can't rely on semantic markup. Instead, use heuristics:

function detectHeaderRow(table) {
  const firstRow = table.querySelector('tr');
  if (!firstRow) return null;

  const cells = firstRow.querySelectorAll('td, th');

  // Heuristic 1: All cells contain <th> elements
  const allTh = Array.from(cells).every(cell => cell.tagName === 'TH');
  if (allTh) return 0;

  // Heuristic 2: All cells have bold/strong text
  const allBold = Array.from(cells).every(cell => {
    const text = cell.textContent.trim();
    const boldText = cell.querySelector('strong, b')?.textContent.trim();
    return text === boldText;
  });
  if (allBold) return 0;

  // Heuristic 3: First row has different background color
  const firstRowBg = getComputedStyle(firstRow).backgroundColor;
  const secondRow = table.querySelectorAll('tr')[1];
  if (secondRow) {
    const secondRowBg = getComputedStyle(secondRow).backgroundColor;
    if (firstRowBg !== secondRowBg) return 0;
  }

  return null; // No clear header detected
}

This catches maybe 80% of cases. The remaining 20% require context-specific rules or user intervention.

The Colspan/Rowspan Problem

This is where things get genuinely difficult. Consider this table:

<table>
  <tr>
    <th>Product</th>
    <th colspan="2">Sales</th>
  </tr>
  <tr>
    <th></th>
    <th>Q1</th>
    <th>Q2</th>
  </tr>
  <tr>
    <td>Widget</td>
    <td>100</td>
    <td>150</td>
  </tr>
</table>

Visually, this renders as:

| Product | Sales      |
|         | Q1  | Q2   |
| Widget  | 100 | 150  |

But if you naively extract cell contents row by row, you get:

Row 0: ["Product", "Sales"]           // 2 cells
Row 1: ["", "Q1", "Q2"]               // 3 cells
Row 2: ["Widget", "100", "150"]       // 3 cells

The column count is inconsistent, and there's no way to know that "Sales" spans two columns without reading the colspan attribute.

Solution: Build a coordinate grid that accounts for spans.

function buildTableGrid(table) {
  const rows = table.querySelectorAll('tr');
  const grid = [];
  const occupied = {}; // Track cells occupied by rowspan

  rows.forEach((row, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];

    let colIndex = 0;
    const cells = row.querySelectorAll('td, th');

    cells.forEach(cell => {
      // Skip columns occupied by previous rowspans
      while (occupied[`${rowIndex},${colIndex}`]) {
        grid[rowIndex][colIndex] = occupied[`${rowIndex},${colIndex}`];
        colIndex++;
      }

      const colspan = parseInt(cell.getAttribute('colspan')) || 1;
      const rowspan = parseInt(cell.getAttribute('rowspan')) || 1;
      const value = cell.textContent.trim();

      // Fill the grid for this cell's span
      for (let r = 0; r < rowspan; r++) {
        for (let c = 0; c < colspan; c++) {
          const targetRow = rowIndex + r;
          const targetCol = colIndex + c;

          if (!grid[targetRow]) grid[targetRow] = [];
          grid[targetRow][targetCol] = value;

          // Mark as occupied for future rows
          if (r > 0) {
            occupied[`${targetRow},${targetCol}`] = value;
          }
        }
      }

      colIndex += colspan;
    });
  });

  return grid;
}

Now the grid correctly represents the visual layout:

Row 0: ["Product", "Sales", "Sales"]
Row 1: ["", "Q1", "Q2"]
Row 2: ["Widget", "100", "150"]

This is still not perfect—you might want "Sales > Q1" and "Sales > Q2" as hierarchical headers—but at least the structure is consistent.

Numbers That Aren't Numbers

Here's a cell value: 1,234.56

Is that:

One thousand two hundred thirty-four point five six? (US format)
One point two three four five six? (Some European formats use comma as decimal)
A string with formatting that should be preserved?

And what about:

$1,234.56 (currency)
1,234.56% (percentage)
(1,234.56) (negative in accounting notation)
1.234,56 € (European currency)
1 234,56 (space as thousands separator)

Detecting and normalizing these correctly is essential if the exported data will be used for calculations.

Solution: Build a detection pipeline with locale awareness.

const numberPatterns = [
  // US/UK: 1,234.56 or 1234.56
  {
    regex: /^[$€£]?\s*-?\(?\d{1,3}(?:,\d{3})*(?:\.\d+)?\)?%?$/,
    parse: (s) => parseFloat(s.replace(/[$€£,\s()%]/g, '')) * (s.includes('(') ? -1 : 1) * (s.includes('%') ? 0.01 : 1)
  },
  // European: 1.234,56 or 1234,56
  {
    regex: /^[$€£]?\s*-?\(?\d{1,3}(?:\.\d{3})*(?:,\d+)?\)?%?$/,
    parse: (s) => parseFloat(s.replace(/[$€£.\s()%]/g, '').replace(',', '.')) * (s.includes('(') ? -1 : 1) * (s.includes('%') ? 0.01 : 1)
  },
  // Space as thousands: 1 234,56
  {
    regex: /^[$€£]?\s*-?\d{1,3}(?:\s\d{3})*(?:,\d+)?%?$/,
    parse: (s) => parseFloat(s.replace(/[$€£\s%]/g, '').replace(',', '.')) * (s.includes('%') ? 0.01 : 1)
  }
];

function detectNumber(value) {
  const trimmed = value.trim();

  for (const pattern of numberPatterns) {
    if (pattern.regex.test(trimmed)) {
      return {
        isNumber: true,
        value: pattern.parse(trimmed),
        original: trimmed
      };
    }
  }

  return { isNumber: false, value: trimmed, original: trimmed };
}

The key insight is that you often need to analyze the entire column to determine the format. If 95% of values in a column match the European format, the outliers are probably data entry errors, not a different locale.

The Invisible Character Problem

This one is insidious. The cell looks like it contains 123, but your number parsing fails. Why?

const cellValue = cell.textContent;
console.log(cellValue);        // "123"
console.log(cellValue.length); // 4 (!)

There's an invisible character. Common culprits:

Character	Code Point	Name
`\u00A0`	160	Non-breaking space
`\u200B`	8203	Zero-width space
`\u200C`	8204	Zero-width non-joiner
`\u200D`	8205	Zero-width joiner
`\uFEFF`	65279	Byte order mark
`\u2007`	8199	Figure space
`\u202F`	8239	Narrow no-break space

These appear in web content constantly, often inserted by CMS systems, copy-paste from Word documents, or internationalization libraries.

Solution: Normalize aggressively before processing.

function normalizeText(text) {
  return text
    // Replace various spaces with regular space
    .replace(/[\u00A0\u2007\u202F\u2000-\u200A]/g, ' ')
    // Remove zero-width characters
    .replace(/[\u200B-\u200D\uFEFF]/g, '')
    // Normalize unicode (handles composed vs decomposed characters)
    .normalize('NFKC')
    // Collapse multiple spaces
    .replace(/\s+/g, ' ')
    .trim();
}

Apply this to every cell value before any other processing. It will save you hours of debugging.

Tables Inside Tables

Some websites nest tables for layout purposes:

<table class="data">
  <tr>
    <td>
      <table class="inner-layout">
        <tr><td>Actual Value</td></tr>
      </table>
    </td>
  </tr>
</table>

If you query for all <td> elements, you'll get duplicates and incorrect nesting. The text content extraction will also be wrong because textContent includes all descendant text.

Solution: Use direct child selectors and explicit recursion control.

function extractDirectText(element) {
  let text = '';

  for (const node of element.childNodes) {
    if (node.nodeType === Node.TEXT_NODE) {
      text += node.textContent;
    } else if (node.nodeType === Node.ELEMENT_NODE) {
      // Skip nested tables entirely
      if (node.tagName === 'TABLE') continue;
      // Recurse into other elements (spans, divs, etc.)
      text += extractDirectText(node);
    }
  }

  return text.trim();
}

function getDirectRows(table) {
  // Only get <tr> elements that are direct children of this table
  // (or its tbody/thead, but not nested tables)
  const rows = [];

  const processContainer = (container) => {
    for (const child of container.children) {
      if (child.tagName === 'TR') {
        rows.push(child);
      } else if (['THEAD', 'TBODY', 'TFOOT'].includes(child.tagName)) {
        processContainer(child);
      }
      // Explicitly don't recurse into nested TABLE elements
    }
  };

  processContainer(table);
  return rows;
}

Dates Are The Worst

I saved the worst for last. A cell contains 01/02/03. What date is this?

January 2, 2003 (US: MM/DD/YY)
February 1, 2003 (EU: DD/MM/YY)
2001 February 3 (ISO-ish: YY/MM/DD)
January 2, 1903 (if YY < 30, is it 19xx or 20xx?)

There is no reliable way to detect date formats from a single value. You need context:

function detectDateFormat(values) {
  const validAsUS = [];
  const validAsEU = [];

  for (const value of values) {
    const match = value.match(/^(\d{1,2})[\/\-.](\d{1,2})[\/\-.](\d{2,4})$/);
    if (!match) continue;

    const [, a, b, c] = match;
    const year = c.length === 2 ? (parseInt(c) > 50 ? 1900 + parseInt(c) : 2000 + parseInt(c)) : parseInt(c);

    // Check if valid as MM/DD/YYYY
    if (parseInt(a) <= 12 && parseInt(b) <= 31) {
      const dateUS = new Date(year, parseInt(a) - 1, parseInt(b));
      if (dateUS.getMonth() === parseInt(a) - 1) {
        validAsUS.push(value);
      }
    }

    // Check if valid as DD/MM/YYYY
    if (parseInt(b) <= 12 && parseInt(a) <= 31) {
      const dateEU = new Date(year, parseInt(b) - 1, parseInt(a));
      if (dateEU.getMonth() === parseInt(b) - 1) {
        validAsEU.push(value);
      }
    }
  }

  // If some dates are only valid in one format, that's our answer
  const onlyUS = validAsUS.filter(v => !validAsEU.includes(v));
  const onlyEU = validAsEU.filter(v => !validAsUS.includes(v));

  if (onlyUS.length > 0 && onlyEU.length === 0) return 'US';
  if (onlyEU.length > 0 && onlyUS.length === 0) return 'EU';

  // Look for values like 13/01/2024 (day > 12, must be DD/MM)
  for (const value of values) {
    const match = value.match(/^(\d{1,2})[\/\-.](\d{1,2})/);
    if (match) {
      if (parseInt(match[1]) > 12) return 'EU';
      if (parseInt(match[2]) > 12) return 'US';
    }
  }

  return 'AMBIGUOUS';
}

The strategy: analyze all date values in a column. If any value has a day > 12 in the first position, the format must be DD/MM. If any has day > 12 in the second position, it must be MM/DD. If all values are ambiguous (both parts ≤ 12), you have to guess or ask the user.

Lessons Learned

After building a parser that handles all these cases, here's what I know:

Never trust semantic HTML. Assume tables will be malformed and build detection heuristics.
Build a grid first. Converting the DOM structure to a 2D array that accounts for colspan/rowspan makes everything else easier.
Normalize early. Clean invisible characters and whitespace before any other processing.
Context matters. Single-cell parsing is often impossible. Look at the whole column to detect formats.
Some ambiguity is permanent. Accept that edge cases exist and provide ways for users to override automatic detection.

If you're building something that needs to extract tables from arbitrary websites, I hope this saves you some of the trial and error I went through.

Learn more about HTML Table Exporter or try it free on the Chrome Web Store. It handles all of these edge cases automatically.

Have you encountered other edge cases I didn't cover? I'd love to hear about them in the comments.

Parsing HTML Tables Is Harder Than You Think

The Basics Aren't Basic

The Colspan/Rowspan Problem

Numbers That Aren't Numbers

The Invisible Character Problem

Tables Inside Tables

Dates Are The Worst

Lessons Learned

Comments (1)

More from this blog

The Table Extraction Edge Cases That Took Weeks to Solve

Designing Export Profiles for Different Data Workflows

Why I Chose Vanilla JavaScript for a Chrome Extension (No Framework)

Lessons from Users Who Use My Tool Differently Than I Expected

Why I Built for a Niche (And Ignored the 'Scale' Advice)

Command Palette

The Basics Aren't Basic

The Colspan/Rowspan Problem

Numbers That Aren't Numbers

The Invisible Character Problem

Tables Inside Tables

Dates Are The Worst

Lessons Learned

Comments (1)

More from this blog