1 of 21

THE LORD OF THE POLICIES

DATA SCIENCE JOURNEY

2 of 21

ROADMAP

  • An Unexpected Journey
  • The Fellowship of the Policies
  • One Model to rule them ALL
  • The Desolation of Pytorch
  • The Battle of the Impressions

3 of 21

UNEXPECTED INSURANCE POLICIES

Can you imagine how many different types of documents are there?

4 of 21

UNDERSTANDING THE PROBLEM

5 of 21

THE DATA IS IN THE RIGHT PLACE IN JUST 5 SECONDS!

OUR MODEL EXTRACTS MEANINGFULL DATA

TAKE A PHOTO OF YOUR DOCUMENT

THE JOURNEY OF TWO BURGLARS

01

02

03

6 of 21

THE FELLOWSHIP OF THE POLICIES

Master

Marko Bogoevski

Master

Risto Trajanov

With his keen eyesight for error, sensitive debugging, and excellent codemanship, Master Marko was valuable to the Fellowship in their journey across the realm of the Policies. He was well-known for becoming friends with the code warrior Master Risto, despite their long wars between their kin in the past.

Master Risto was a well-respected code warrior in the realm of the Policies during the Debugging Years. He was a member of the Fellowship of the Policies and was the only one of the code warriors to fight alongside the bowman, Master Marko, in the war against the Generator at the end of the Third Month of the Internship. After the defeat of the Generator, he was given lordship of the Documents at Team’s Temple.

7 of 21

STARTING LINE - SHIRE

01

.PDF documents

Later converted to .JPG

JSON target files for every document

.JPG images

8 of 21

PREPROCESS

DATA READY FOR THE MODEL

01

02

03

04

PDF to JPG converter

OCR on the images to extract the text segments and their coordinates

Transform the coordinates, images and target json’s into input format for the model

9 of 21

OCR module

  • Object character recognition system
  • Many viable options
  • TesseractOCR
  • Inconsistency in segmenting

10 of 21

OUR SOLUTIONS – ONE MODEL TO RULE THEM ALL

VISUAL CHARACTERISTICS

TEXT SEGMENTS

11 of 21

GLCNN

Graph Learning Layer + Graph Convolution Layer(s) = Graph Learning Convolution Neural Network

The graph learning layer provides an optimal adaptive graph representation for graph convolutional layers.

12 of 21

Input

Output

13 of 21

Results so far…

14 of 21

Results so far

15 of 21

Postprocessing of output

  • Model outputs page level files
  • Compare each output segment with OCR segments and interpolate missing information
  • Choose the best option from each image
  • Merge the results and try to combine insurance types with sums and fees
  • Improves results vastly

16 of 21

Problems we encountered

                  • Lack of data due to sensitivity of the information and privacy
              • Low statistical significance of results
              • No basis for fine-tuning of the model
              • OCR module dependency, invalid labeling

Two potential solutions: generate own data and relabel segments

17 of 21

Extracts from the generator ( + json and OCR generated files)

18 of 21

Relabeling app

19 of 21

Results on fake (generated) data

  • Aggregation and postprocessing techniques included
  • No visual features aiding the model
  • Slight bias due to the generation process

20 of 21

THE DESOLATION OF TECHNOLOGIES

PyTorch

21 of 21

THE BATTLE OF IMPRESSIONS

  • Rich in experience
  • Whole new specter of knowledge
  • Teamwork excellence