The trouble with tables: Overcoming the challenge of successfully extracting data trapped within tables
By Martin Staniforth – Technical Product Director at Eigen Technologies
As a company specializing in document extraction, it probably won’t surprise you to hear that we’ve been investing in ways to extract data more successfully from tables. As an efficient and effective means to organize and present related information, tables appear everywhere, from annual reports and claims forms to financial statements, invoices, insurance slips and vendor scheduling agreements.
But what may surprise you is how hard the problem of table extraction is to solve. Part of the challenge of table extraction comes from the subjective nature of the task. It’s not always obvious to the human eye what is a table and what is not.
So, to demonstrate that point, let’s play a quick game of ‘Table or not table?’ …
Example one – Applicable Rate within a Credit Agreement::
This is a nice easy one to start with. The explicit bounding boxes in this example are very clearly cells. The table has a well-defined structure and its author intended for us to read each row of the table with reference to the column headers.
Example two – Secured Leverage Ratio within a Credit Agreement:
This one is slightly trickier, but this is also a table. Looking at this example, you can start to sympathize with the complexity of the task. Your mind draws the boundaries of the cells with ‘Fiscal Quarter’ on one line of text, matching the ‘Leverage Ratio’, which requires two lines of text. The table row boundaries are inferred by indented text.
Example three – Definitions of Terms within a CLO document:
What’s the first word that comes into your head when you see this one? I think ‘definitions’, which are kind of like tables. The content has far more text than the previous examples but there is a clear separation between each pair of key terms and definitions, followed by the next pair.
Example four – Board of Directors within an Annual Report:
And this one? I see ‘lists with headings’, which could also be described as a collection of single-column tables, but ‘table’ isn’t the first word that comes to mind.
Example five – Extract from an Annual Report:
This last example doesn’t register on my tables radar, but maybe it should. The structure could be inferred as three columns with headers. It could just be the content layout that leads my eyes to see three columns of text rather than a three-column table.
So, identifying tables is clearly a subjective task. But does any of this matter? Our goal at Eigen is to extract information from documents to make it useful for our customers. When we ask, ‘Table or not table?’ are we asking the wrong question? Is ‘Should we respect the structure provided by the author when we extract this information?’ a better question to ask?
How does Eigen handle table detection and reconstruction?
Eigen refers to the first stage of identifying all tables in a document as ‘Detection and Reconstruction’ (or ‘Detection’ for short). There’s no user set-up required for this step as our models come pre-trained on hundreds of thousands of examples meaning customers can get up and running instantly.
We provide a rich reconstructed representation of the table which includes the geometry of the table and cells, and even the coordinates of the bounding boxes of the characters within those cells. These reconstructed tables can be exported as XLS or HTML formats, or as a detailed representation via our APIs.
We have heavily invested in creating vast datasets of heterogeneous tables from different sources, well beyond what is currently available in open-source datasets. Thanks to our R&D capabilities, this effort also included the generation of synthetic tables datasets with the same attributes as those seen in legal, financial and insurance documents, to tackle specific edge-cases of prime importance to our customers. This means we can process tables with no explicit boundaries and can recognize merged cells, to give our customers far better results than alternative solutions can provide.
How does Eigen ensure you capture data from all ‘tables’?
Given the subjective nature of the ‘Table or not table’ question, we’ve calibrated our model to be more lenient on what we consider to be a table. In the field of machine learning we say we’re favouring recall over precision – it’s better to find false positives than to ignore true positives.
This is great for ensuring we capture as much of the document’s true structure as possible but does create a second problem. We now have many tables (or table-like structures) to sift through to find the specific ones of interest. To remedy this, we’ve developed ‘Secondary Models’ that take the output from our table detection model and refine it further.
The next challenge is homing in on the information you specifically need. Do you need the full table, specific columns or even a specific cell? If we provide you with all the tables from a set of financial statements but don’t tell you which table is which, it’s unlikely we have solved your problem. You’re likely interested in the balance sheet, income or cash flow statements. If we find you the income statement, you may then ask for the ‘Current Year Gross Profit’.
You’ll always want more, but thankfully we’ve developed the tools to give you more.
Using Eigen to extract specific tables
Your use case and data needs will dictate the amount and type of information you need to retrieve from the tables in your documents. The example below is a Margin Matrix from a Credit Agreement. The content of the table describes how much interest a lender will expect to receive based on how leveraged the borrower is.
In this scenario we’re rolling the dice on a future event. This is what we call ‘Table Extraction’ and enables the user to pull out the table and all its contents. Users can annotate examples of the tables they are interested in, and Eigen will find the equivalent tables in unseen documents.
You can also use this approach to extract all the cells from the table. We have the tools to wrap each row into a single record for downstream processing and storage.
Using Eigen to extract specific cells from within a table
In some scenarios, you’ll only want to extract specific information from a table. Take a look at the table below.
It’s a fine-looking table, with no doubt lots of useful information contained within it. However, it’s an example where not all cells are created equal. I can see some ‘star-quarterback appeal’ for one cell in particular. The end of year balance!
For these instances, where you want to extract a specific cell or a selection of cells, we switch to ‘Table Cell Extraction’. This is intended for those needle-in-the-haystack-style table extraction problems.
How do other vendors approach the challenge of table extraction?
Data extraction is our bread and butter, and we pride ourselves on being excellent at what we do for our customers. Many, if not all, of our competitors would (and do) say the same. However, we believe that Eigen is the only platform that enables a business user to extract data from specific tables, including down to the cell level. Other table extraction solutions will give you all the tables, or all the contents of the table to sift through or require some amount of coding to get to the data you need.
Eigen is leading the way in intelligent document processing
The field of natural language processing (NLP) is increasingly merging with other machine learning disciplines. This multi-modal approach of combining the text with the visual layers of a document can only lead to better and more nuanced results. We’re excited to be at the forefront of this area of research, bringing the best of what’s possible to our customers.
Find out more about IDP or request a platform demo.
- World Economic forum 2020
- Gartner Cool Vendor 2020
- AI 100 2021
- Lazard T100
- FT Intelligent Business 2019
- FT Intelligent Business 2020
- CogX Awards 2019
- CogX Awards 2021
- CogX Awards Best AI Product in Insurance
- Forbes The AI 50 2022
- Ai BreakThrough Award 2022
- FStech 2023 awards shortlisted