Table Structure Detection and Data Extraction

Automatic Table Information Extraction using Image Segmentation

Arkaprava Patra
9 min readJun 18, 2021
Image : Nanonets

Background

In recent years, there has been a rapid adoption of Computer Vision techniques in a wide range of tasks — ranging from Autonomous vehicles and Medical diagnosis to Military surveillance.

Another pertinent task would be the extraction of specific information from documents or images. With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices has become more acute. A major hurdle to this task is that these images/documents often contain information in the form of tables and extracting data accurately from tables can be a tricky task.

The Challenges

  • While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition.
  • Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models.

In this blog, we’ll explore a relatively new Deep Learning method — TABLENET that provides an end to end solution to the problems of detection of table detection and data extraction .

Goal

Image : govwebworks

Our objective is to build a ML system that automates the process of data extraction from table(s) present in a document. It identifies the presence of tables in an Image file and pulls out the data accurately from the table cells.

The Real-World Constraints

  • Only the data present inside the tables(if any) should be extracted accurately. Erroneous extracted data would defeat the purpose of automating the task.
  • There is no strict latency constraints but the data should be fetched and populated within a couple of minutes max.

Dataset

For this exercise, we would be using the Marmot Dataset that has been also used to train the TableNet model in the original research paper as well. It contains about 500 images and their corresponding annotation files:

Let’s check a sample Image and its corresponding Annotation …

Sample Image :

This image has a table that needs to be detected by the model.

Annotation file :

It contains coordinates of the “columns” of tables present in the image. Each <bndbox> block represents each bounding box and has its x&y coordinates.

These annotation files need to be processed to generate the Table and Column Masks for each image using which we’d try to train our model.

The following is an example of an Image and its corresponding Table Mask and Column Mask :

Image : TableNet Paper

ML Problem Formulation

The idea is to :

  1. Firstly, use the annotation files to generate the Table Mask and Column Mask Images.
  2. Then the TableNet model is trained to predict both the Table and Column Mask when any Image file is provided as Input.
  3. The Image file is segmented using the predicted masks so that only the tables in the image are highlighted and the background is dark.
  4. The text from the table cells are extracted using an OCR(Object Character Recognition) tool.

Problem Type

We have to segment out the the Tables and Columns from the Input Image. Thus, it is an Image Segmentation problem.

Image segmentation is used in digital image processing and analysis to partition an image into multiple parts or regions, often based on the characteristics of the pixels in the image.

It’s not useful to process the entire image at the same time as there will be regions in the image which do not contain any information. By dividing the image into segments, we can make use of the important segments as per our task.

We can also think of Image segmentation as a pixel-wise classification task. In our case, we have two classes for each Mask —

Table and Background(for Table Mask) || Column and Background(for Column Mask).

Our model has to determine whether each pixel belongs to a table/column or is it a part of the background.

Performance Metrics :

The evaluation metric used in this classification task is F1-score

  • Precison is the measure of the correctly identified positive cases from all the predicted positive cases
  • Recall is the measure of the correctly identified positive cases from all the actual positive cases
  • F1-Score is the harmonic mean of Precision and Recall. This used as a evalutation metric since it penalizes false positives and false negatives equally. It is a preferred metric in cases where class imbalance exists.

Data Preprocessing

After downloading the Dataset, we use the annotations to generate the Masks(Table & Column) for each image.

As we had seen earlier, the annotation file contains coordinates (xmin, ymin, xmax, ymax) of each column across tables that the image contains. We’ll have to use these coordinates effectively to extract the column segments as well as the table segments.

Let’s check how the generated Masks look :

As we can see, the mask generator does a decent enough job of accurately generating the required Mask Images.

Train & Test Dataset

Splitting

The whole data is split into Train and Test Datasets in the 80:20 ratio.

Processing Final Datasets

The images and their masks are all resized to a particular shape(1024,1024) and normalized so that pixel values will be scaled in between 0–1.

The data is loaded in batches during Training or Testing.

Now that we’re done with the Data preparation, we can start builiding and training models.

Modelling(TableNet)

Architecture

The proposed TableNet model is based on the Encoder-Decoder model for semantic segmentation.

The model consists of two sub-sections —

  1. The Encoder Network : The Encoder consists of a combination of Convolution, Relu and Pooling layers that downsample the spatial resolution of the input, developing lower-resolution feature mappings. It basically extracts important features from the Input Image. In the TableNet architecture, the VGG-19 network(until the bottleneck layer) has been used as the encoder.
  2. The Decoder Network : The downsampled images from the Encoder are passed though two conv2D layers, and again processed though one 1x1 conv2D layer. Since we need our model to produce a full-resolution semantic prediction of the Masks, the decoder network takes the low resolution output from the Encoder and upsamples the feature segmentations into full-resolution feature map(i.e the Masks). Also, with help of skip-pooling technique the low-resolution feature maps of the decoder network combined with the high-resolution features of encoder networks as we can spot in the diagram above. The TableNet model has two branches in the Decoder Network — The Table Branch(generates Table Mask) & The Column Branch(generates Column Mask). . Then

Decoder Structures

Final Model

Model Training

Image : article

TableNet requires both table and structure annotated data for training. There are two computation graphs (table & column) which require training. Each training sample is a tuple of a document image, table mask and column mask.

In the initial phase of training, the table branch and column branch are computed in the ratio of 2:1. This is done since, although the table branch and column branch are different, the encoder is the same for both. During initial iterations of training, the learning is more focused on training the model to generate big active tabular regions which on subsequent training specializes to column regions. After around 50 iterations with a batch size of 1, this training scheme is modified.

The model is then trained in the ratio of 1:1 for both branches until convergence. Using the same training pattern, the model is trained for another 50 iterations with a batch size of 1 and learning rate of 0.0001.

The Adam optimizer is used for improving and optimizing training with parameters beta1=0.8, beta2=0.99 and epsilon=1e-08. The convergence and overfitting behavior was monitored by observing the performance over the validation set.

Here’s a snapshot of the above mentioned process :

Let’s check the results , now that the model has been trained…

Training Plots

Checking the Model’s Mask Predictions

Example 1
Example 2
Example 3

Scoring

The model performance was tested against the complete Test Data :

Model Performance

The trained model achieves a F1-Score of 0.94 for Table Mask predictions and a F1-Score of 0.85 for Column Mask predictions.

Table Data Extraction by tesseract-OCR

OCR stands for Optical Character Recognition. In other words, OCR systems transform a two-dimensional image of text, that could contain machine printed or handwritten text from its image representation into machine — readable text.

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or using an API to extract printed text from images.

Step 1

The input image (1024,1024,3) is fed to the model which produces two outputs — Table Mask and Column Mask of shape(1,1024,1024,2). We need to take the maximum probability class from the masks and save the Column and Table Mask images to disk. We use the alpha channels from these masks to set the alpha channel of the image and thus get the masked Image…

Step 2

We further process the masked image to make the horizontal and vertical lines in the detected tables clearer and finally use pytesseract to extract the text present in the table cells…

Let’s check the Final Output with an Example…

Input Image :

Masked Image :

Text from OCR:

Deployment

The final Model is deployed in a web app designed using Streamlit.

Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. Because of the ease with which one can develop a data science web app, many developers use it in their daily workflow.

Conclusion & Future Work

  1. We built a model that can extract table information given any input image. Due to limited resources, we could only train our Model to about 100 epochs. The f1-score can be improved if the model is trained for about 5000 epochs as done in the research paper.
  2. I tried to use Resnet-50 as the encoder network instead of VGG and checked the results. It didn’t perform as well. Ofcourse, this network too could’ve been trained for more epochs and checked.
  3. I could extract the text in the rows by a reasonable accuracy but couldn’t get the column divisions especially when the image contained mutiple tables . So that could be added to make this a fully functional Table Data Extaction Tool.

For full — length code, you can check out this github link .

For any suggestions/questions , you can connect with me on linkedin.

References

  1. https://arxiv.org/pdf/2001.01469.pdf
  2. https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47
  3. https://www.tensorflow.org/tutorials/images/segmentation
  4. https://www.appliedaicourse.com/

--

--