How to extract data from PDFs using machine learning

How to Extract Data from PDFs Using Machine Learning

PDF Mining is one of the most searched topics around the world . Data in several formats are required to be extracted from PDFs. However, due to its secure nature, it becomes difficult for individuals to fetch the necessary data set. Data in the PDF can be an image, tabular, textual, etc.

In this blog, we shall discuss the Tabular data extraction techniques using Machine Learning.

Following are the prerequisites for successful data extraction from PDFs:

  • JAVA 8+ 
  • Python 3.5+
  • Python libraries

Tabular data can be extracted using one of these two different libraries:

Tabula library and Camelot library. Let us study both in detail:

1. Tabula library

Tabula library is a python wrapper by tabula java, used to extract data in four different formats:

  • Pandas data frame
  • JSON
  • CSV(Comma-Separated Values)
  • TSV(Tab-Separated Values)

How to install Tabula?

Tabula wrapper can be installed using tabula-py via pip:

Input:

pip install tabula-py
Tabula-py

Predefined Methods to extract tabular data:

  • To check Python, OS, and the JAVA version before initiating the tabula-py, use tabula.environment_info(). Follow the steps mentioned below.

Input:

import tabula
tabula.evironment_info()

Output:

import tabula
  • To get the DataFrame that reads only page 1 by default use read_pdf() You may also alternatively use: ?read_pdf or ? tabula.wrapper.build_options.

Input:

import tabula

#Read pdf into a list of data frame
dfs = tabula.read_pdf(“demo.pdf”, pages=’all’)

#Read remote pdf into a list of data frame
dfs2 =  tabula.read_pdf(“https://link of your pdf file”, pages=’all’)
  • For detailed help, we can leverage the help module in tabula.io by help(tabula.read_pdf)

Input:

help(tabula.read_pdf)

Output:

  • Use help(tabula.io.build_options) for build_options in module tabula.io

Input:

help(tabula.io.build_options)

Output:

tabula.io.build_options
  • To read specific areas of a given page by specifying the dimensions of the table to be extracted use tabula.read_pdf(pdf_path, area=[136,150,210,455], pages=4).

Input:

tabula.read_pdf(“demo.pdf”, area=[136,150,210,455], pages=1)

Output:

tabula.read_pdf
  • To extract the table which is separated by lines or cells the lattice option is set to true by default. tabula.read_pdf(pdf_path5, pages=”5″, lattice=True, pandas_options={“header”: [0, 1]}, area=[0, 0, 75, 150], relative_area=True, multiple_tables=False) 

The tabula app also offers tabula templates which have area options set by the GUI app.

To leverage the template, follow the path as linked here.

2. Camelot-py Library. 

To install the Camelot-py library, you need to establish a ghost stripe. 

How to install Camelot-py

Camelot can be installed using Camelot-py via pip:

Input:

pip install camelot-py
camelot-py

Predefined Methods to extract tabular data:

  • To extract table from different pages use CamelotTables=camelot.read_pdf(‘xyz.pdf’, flavor=’stream’, pages=’0-6′)

Input:

import camelot
tables = camelot.read_pdf(‘demo.pdf’)
tables.export(‘demo.csv’, f=’csv’, compress=True)
tables[0].parsing_report
{
‘accuracy’: 99.02,
‘whitespace’: 12.24,
‘order’: 1,
‘page’: 1
}

tables[0].to_csv(‘demo.csv’)

tables[0].df 
  • To get the total list of tables available in PDF file use CamelotTables[2] #2 is the index
  • To get parsing report use CamelotTables.parsing_report

This way, you can easily mine tabular data from PDFs using Machine Learning. However, several people may find this complicated. In case you require any help, do not hesitate to get in touch with an expert at DEV IT here.