Skip to main content

Command Palette

Search for a command to run...

Parsing HTML Tables Is Harder Than You Think

Updated
8 min read
Parsing HTML Tables Is Harder Than You Think

When I started building HTML Table Exporter, I thought parsing tables would be the easy part. Tables are structured data, right? Rows and columns. How complicated could it be?

Three months later, I had a folder called edge-cases with 47 test files and a deep appreciation for how creative web developers can be when abusing the <table> element.

This article documents the non-obvious problems you'll encounter when extracting data from HTML tables in the wild, and the solutions that actually work.

The Basics Aren't Basic

Let's start with what should be simple. A well-formed HTML table looks like this:

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Item A</td>
      <td>100</td>
    </tr>
  </tbody>
</table>

Parsing this is trivial. Query the <th> elements for headers, iterate through <tr> elements in <tbody>, extract <td> content. Done.

But here's what tables look like in the real world:

<table>
  <tr>
    <td><strong>Name</strong></td>
    <td><strong>Value</strong></td>
  </tr>
  <tr>
    <td>Item A</td>
    <td>100</td>
  </tr>
</table>

No <thead>. No <th>. No <tbody>. The headers are just bold text in regular cells. This is valid HTML, and it's everywhere.

Solution: You can't rely on semantic markup. Instead, use heuristics:

function detectHeaderRow(table) {
  const firstRow = table.querySelector('tr');
  if (!firstRow) return null;

  const cells = firstRow.querySelectorAll('td, th');

  // Heuristic 1: All cells contain <th> elements
  const allTh = Array.from(cells).every(cell => cell.tagName === 'TH');
  if (allTh) return 0;

  // Heuristic 2: All cells have bold/strong text
  const allBold = Array.from(cells).every(cell => {
    const text = cell.textContent.trim();
    const boldText = cell.querySelector('strong, b')?.textContent.trim();
    return text === boldText;
  });
  if (allBold) return 0;

  // Heuristic 3: First row has different background color
  const firstRowBg = getComputedStyle(firstRow).backgroundColor;
  const secondRow = table.querySelectorAll('tr')[1];
  if (secondRow) {
    const secondRowBg = getComputedStyle(secondRow).backgroundColor;
    if (firstRowBg !== secondRowBg) return 0;
  }

  return null; // No clear header detected
}

This catches maybe 80% of cases. The remaining 20% require context-specific rules or user intervention.

The Colspan/Rowspan Problem

This is where things get genuinely difficult. Consider this table:

<table>
  <tr>
    <th>Product</th>
    <th colspan="2">Sales</th>
  </tr>
  <tr>
    <th></th>
    <th>Q1</th>
    <th>Q2</th>
  </tr>
  <tr>
    <td>Widget</td>
    <td>100</td>
    <td>150</td>
  </tr>
</table>

Visually, this renders as:

| Product | Sales      |
|         | Q1  | Q2   |
| Widget  | 100 | 150  |

But if you naively extract cell contents row by row, you get:

Row 0: ["Product", "Sales"]           // 2 cells
Row 1: ["", "Q1", "Q2"]               // 3 cells
Row 2: ["Widget", "100", "150"]       // 3 cells

The column count is inconsistent, and there's no way to know that "Sales" spans two columns without reading the colspan attribute.

Solution: Build a coordinate grid that accounts for spans.

function buildTableGrid(table) {
  const rows = table.querySelectorAll('tr');
  const grid = [];
  const occupied = {}; // Track cells occupied by rowspan

  rows.forEach((row, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];

    let colIndex = 0;
    const cells = row.querySelectorAll('td, th');

    cells.forEach(cell => {
      // Skip columns occupied by previous rowspans
      while (occupied[`${rowIndex},${colIndex}`]) {
        grid[rowIndex][colIndex] = occupied[`${rowIndex},${colIndex}`];
        colIndex++;
      }

      const colspan = parseInt(cell.getAttribute('colspan')) || 1;
      const rowspan = parseInt(cell.getAttribute('rowspan')) || 1;
      const value = cell.textContent.trim();

      // Fill the grid for this cell's span
      for (let r = 0; r < rowspan; r++) {
        for (let c = 0; c < colspan; c++) {
          const targetRow = rowIndex + r;
          const targetCol = colIndex + c;

          if (!grid[targetRow]) grid[targetRow] = [];
          grid[targetRow][targetCol] = value;

          // Mark as occupied for future rows
          if (r > 0) {
            occupied[`${targetRow},${targetCol}`] = value;
          }
        }
      }

      colIndex += colspan;
    });
  });

  return grid;
}

Now the grid correctly represents the visual layout:

Row 0: ["Product", "Sales", "Sales"]
Row 1: ["", "Q1", "Q2"]
Row 2: ["Widget", "100", "150"]

This is still not perfect—you might want "Sales > Q1" and "Sales > Q2" as hierarchical headers—but at least the structure is consistent.

Numbers That Aren't Numbers

Here's a cell value: 1,234.56

Is that:

  • One thousand two hundred thirty-four point five six? (US format)

  • One point two three four five six? (Some European formats use comma as decimal)

  • A string with formatting that should be preserved?

And what about:

  • $1,234.56 (currency)

  • 1,234.56% (percentage)

  • (1,234.56) (negative in accounting notation)

  • 1.234,56 € (European currency)

  • 1 234,56 (space as thousands separator)

Detecting and normalizing these correctly is essential if the exported data will be used for calculations.

Solution: Build a detection pipeline with locale awareness.

const numberPatterns = [
  // US/UK: 1,234.56 or 1234.56
  {
    regex: /^[$€£]?\s*-?\(?\d{1,3}(?:,\d{3})*(?:\.\d+)?\)?%?$/,
    parse: (s) => parseFloat(s.replace(/[$€£,\s()%]/g, '')) * (s.includes('(') ? -1 : 1) * (s.includes('%') ? 0.01 : 1)
  },
  // European: 1.234,56 or 1234,56
  {
    regex: /^[$€£]?\s*-?\(?\d{1,3}(?:\.\d{3})*(?:,\d+)?\)?%?$/,
    parse: (s) => parseFloat(s.replace(/[$€£.\s()%]/g, '').replace(',', '.')) * (s.includes('(') ? -1 : 1) * (s.includes('%') ? 0.01 : 1)
  },
  // Space as thousands: 1 234,56
  {
    regex: /^[$€£]?\s*-?\d{1,3}(?:\s\d{3})*(?:,\d+)?%?$/,
    parse: (s) => parseFloat(s.replace(/[$€£\s%]/g, '').replace(',', '.')) * (s.includes('%') ? 0.01 : 1)
  }
];

function detectNumber(value) {
  const trimmed = value.trim();

  for (const pattern of numberPatterns) {
    if (pattern.regex.test(trimmed)) {
      return {
        isNumber: true,
        value: pattern.parse(trimmed),
        original: trimmed
      };
    }
  }

  return { isNumber: false, value: trimmed, original: trimmed };
}

The key insight is that you often need to analyze the entire column to determine the format. If 95% of values in a column match the European format, the outliers are probably data entry errors, not a different locale.

The Invisible Character Problem

This one is insidious. The cell looks like it contains 123, but your number parsing fails. Why?

const cellValue = cell.textContent;
console.log(cellValue);        // "123"
console.log(cellValue.length); // 4 (!)

There's an invisible character. Common culprits:

CharacterCode PointName
\u00A0160Non-breaking space
\u200B8203Zero-width space
\u200C8204Zero-width non-joiner
\u200D8205Zero-width joiner
\uFEFF65279Byte order mark
\u20078199Figure space
\u202F8239Narrow no-break space

These appear in web content constantly, often inserted by CMS systems, copy-paste from Word documents, or internationalization libraries.

Solution: Normalize aggressively before processing.

function normalizeText(text) {
  return text
    // Replace various spaces with regular space
    .replace(/[\u00A0\u2007\u202F\u2000-\u200A]/g, ' ')
    // Remove zero-width characters
    .replace(/[\u200B-\u200D\uFEFF]/g, '')
    // Normalize unicode (handles composed vs decomposed characters)
    .normalize('NFKC')
    // Collapse multiple spaces
    .replace(/\s+/g, ' ')
    .trim();
}

Apply this to every cell value before any other processing. It will save you hours of debugging.

Tables Inside Tables

Some websites nest tables for layout purposes:

<table class="data">
  <tr>
    <td>
      <table class="inner-layout">
        <tr><td>Actual Value</td></tr>
      </table>
    </td>
  </tr>
</table>

If you query for all <td> elements, you'll get duplicates and incorrect nesting. The text content extraction will also be wrong because textContent includes all descendant text.

Solution: Use direct child selectors and explicit recursion control.

function extractDirectText(element) {
  let text = '';

  for (const node of element.childNodes) {
    if (node.nodeType === Node.TEXT_NODE) {
      text += node.textContent;
    } else if (node.nodeType === Node.ELEMENT_NODE) {
      // Skip nested tables entirely
      if (node.tagName === 'TABLE') continue;
      // Recurse into other elements (spans, divs, etc.)
      text += extractDirectText(node);
    }
  }

  return text.trim();
}

function getDirectRows(table) {
  // Only get <tr> elements that are direct children of this table
  // (or its tbody/thead, but not nested tables)
  const rows = [];

  const processContainer = (container) => {
    for (const child of container.children) {
      if (child.tagName === 'TR') {
        rows.push(child);
      } else if (['THEAD', 'TBODY', 'TFOOT'].includes(child.tagName)) {
        processContainer(child);
      }
      // Explicitly don't recurse into nested TABLE elements
    }
  };

  processContainer(table);
  return rows;
}

Dates Are The Worst

I saved the worst for last. A cell contains 01/02/03. What date is this?

  • January 2, 2003 (US: MM/DD/YY)

  • February 1, 2003 (EU: DD/MM/YY)

  • 2001 February 3 (ISO-ish: YY/MM/DD)

  • January 2, 1903 (if YY < 30, is it 19xx or 20xx?)

There is no reliable way to detect date formats from a single value. You need context:

function detectDateFormat(values) {
  const validAsUS = [];
  const validAsEU = [];

  for (const value of values) {
    const match = value.match(/^(\d{1,2})[\/\-.](\d{1,2})[\/\-.](\d{2,4})$/);
    if (!match) continue;

    const [, a, b, c] = match;
    const year = c.length === 2 ? (parseInt(c) > 50 ? 1900 + parseInt(c) : 2000 + parseInt(c)) : parseInt(c);

    // Check if valid as MM/DD/YYYY
    if (parseInt(a) <= 12 && parseInt(b) <= 31) {
      const dateUS = new Date(year, parseInt(a) - 1, parseInt(b));
      if (dateUS.getMonth() === parseInt(a) - 1) {
        validAsUS.push(value);
      }
    }

    // Check if valid as DD/MM/YYYY
    if (parseInt(b) <= 12 && parseInt(a) <= 31) {
      const dateEU = new Date(year, parseInt(b) - 1, parseInt(a));
      if (dateEU.getMonth() === parseInt(b) - 1) {
        validAsEU.push(value);
      }
    }
  }

  // If some dates are only valid in one format, that's our answer
  const onlyUS = validAsUS.filter(v => !validAsEU.includes(v));
  const onlyEU = validAsEU.filter(v => !validAsUS.includes(v));

  if (onlyUS.length > 0 && onlyEU.length === 0) return 'US';
  if (onlyEU.length > 0 && onlyUS.length === 0) return 'EU';

  // Look for values like 13/01/2024 (day > 12, must be DD/MM)
  for (const value of values) {
    const match = value.match(/^(\d{1,2})[\/\-.](\d{1,2})/);
    if (match) {
      if (parseInt(match[1]) > 12) return 'EU';
      if (parseInt(match[2]) > 12) return 'US';
    }
  }

  return 'AMBIGUOUS';
}

The strategy: analyze all date values in a column. If any value has a day > 12 in the first position, the format must be DD/MM. If any has day > 12 in the second position, it must be MM/DD. If all values are ambiguous (both parts ≤ 12), you have to guess or ask the user.

Lessons Learned

After building a parser that handles all these cases, here's what I know:

  1. Never trust semantic HTML. Assume tables will be malformed and build detection heuristics.

  2. Build a grid first. Converting the DOM structure to a 2D array that accounts for colspan/rowspan makes everything else easier.

  3. Normalize early. Clean invisible characters and whitespace before any other processing.

  4. Context matters. Single-cell parsing is often impossible. Look at the whole column to detect formats.

  5. Some ambiguity is permanent. Accept that edge cases exist and provide ways for users to override automatic detection.

If you're building something that needs to extract tables from arbitrary websites, I hope this saves you some of the trial and error I went through.


Learn more about HTML Table Exporter or try it free on the Chrome Web Store. It handles all of these edge cases automatically.

Have you encountered other edge cases I didn't cover? I'd love to hear about them in the comments.

T

So true — HTML tables look simple until you actually have to parse them programmatically. Colspan, rowspan, nested tables, inconsistent markup... it is a rabbit hole. Good breakdown of the edge cases.