0% found this document useful (0 votes)
15 views5 pages

Automated Deep Learning Invoice Processing

The document proposes an automated invoice processing system utilizing deep learning techniques to enhance efficiency and reduce manual data entry for companies handling high volumes of invoices. By implementing a Convolutional Neural Network (CNN) based approach, the system aims to accurately extract relevant information from various invoice formats while addressing challenges such as low-resolution images. The proposed solution is expected to lead to significant cost savings, increased productivity, and reduced error rates in invoice processing.

Uploaded by

shivamkr4524
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Automated Deep Learning Invoice Processing

The document proposes an automated invoice processing system utilizing deep learning techniques to enhance efficiency and reduce manual data entry for companies handling high volumes of invoices. By implementing a Convolutional Neural Network (CNN) based approach, the system aims to accurately extract relevant information from various invoice formats while addressing challenges such as low-resolution images. The proposed solution is expected to lead to significant cost savings, increased productivity, and reduced error rates in invoice processing.

Uploaded by

shivamkr4524
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Automated Invoice Processing System

Lama Alkhaled1, Ng Yee Fei2


1
Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, Luleå, Sweden
2
Faculty of Computing, Asia Pacific University, Kuala Lumpur, Malaysia
([Link]@[Link])

Abstract - Many companies still rely on manual data speed up the search process by automatically converting
2023 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) | 979-8-3503-2315-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/IEEM58616.2023.10406704

entry methods for managing their invoices. Some of these the image data to text for different applications like
companies deal with a high volume of invoices in various natural scene understanding [5], car license plate
formats daily, resulting in time-consuming processes and recognition [6], and finance [7]. However, the existing
resource wastage. To address this issue, a proposal is made system in the market still required staff to generate a
to implement an efficient automated invoice processing
system using deep learning. This system aims to reduce
template before extracting the invoice data. Whenever
workload and enhance productivity for companies. In there are different kinds of invoice formats discovered,
addition, a comprehensive review and comparison of existing staff will be required to create another additional template
techniques and similar systems have been conducted to for the invoice. This process is inefficient and non-
identify the most suitable solution for this scenario. The intelligent for the end-user. Given the issues, there is a
proposed work utilizes advanced deep learning computer clear need to replace the manual invoice processing
vision techniques, a simple Convolutional Neural Network system with an automated solution. The proposed
(CNN) based on RPN, and LeNet-5 is used to detect and approach involves utilizing Convolutional Neural
classify text objects on invoice documents. This paper
Network (CNN) architectures to effectively learn from
utilized scanned invoices to assess the system's performance.
A dataset consisting of 1000 scanned English invoices from
large volumes of invoice data in various formats and
the Scanned Receipts OCR and Information Extraction extract relevant information [8]. This eliminates the need
(SROIE) dataset. The system will predict and extract for manual rule generation or template setup prior to
specific regions such as invoice number, date, payer invoice processing. Consequently, organizations can
information, and total amount from the invoices. However, it efficiently manage the complexities and diverse formats
has been observed that low-resolution and unclear invoices of invoices. Implementing this automated system will
can negatively impact the accuracy of OCR (Optical bring numerous benefits to the organization. Firstly, it
Character Recognition) pattern-matching methods. To will lead to cost and resource savings by eliminating the
mitigate this issue, an image pre-processing method has been need for manual extraction of invoice data. Secondly,
incorporated, which reduces image noise and corrects page
skew to achieve better performance.
productivity will increase as the system eliminates the
laborious task of creating templates and rules. Lastly, the
Keywords - Invoice processing; Deep learning; optical system will mitigate the risk of errors during the invoice
character processing; CNN, RPN, LeNet processing stage.

I. INTRODUCTION II. LITERATURE REVIEW

Filling the business records particularly invoices is an The important step in the method for automating invoice
important part for companies, especially in banking, processing is object detection in different types of
insurance, hospitals, and a lot of other different invoices. However, there are a lot of invoices that involve
organizations. It can be represented in different forms an unstructured image which is difficult for the system to
such as handwritten, machine-printed, and receipts [1,2], extract. Generally, there are a lot of open-source optical
and even historical documents or visual ones [3]. Most character recognition (OCR) engines that are only used to
companies still rely on paper invoices to process extract the textual content from a document image
payments and maintain their accounts. To process a large without giving the output coordinates. To increase the
number of invoices, staff need to spend hours manually accuracy of data extraction, machine learning is
searching through invoices and listing items in a ledger. necessary.
Besides, the staff also needs to verify the correctness of Traditional machine learning techniques are needed to
information processing in the database systems. The determine the optimal filters manually determined by
process of checking manually is time-consuming, intensive experiments for conventional supervised
expensive, and has a high chance of making errors due to learning-based techniques like SVM and MLP. Chang et
the large amounts of invoices, different layouts, and al. proposed a system that uses a support vector machine
delivery formats [4]. Nowadays, a lot of businesses use (SVM) to present an alarm system for edible oil
optical character recognition (OCR) to process their manufacture with around 80% accuracy. However, SVM
documents, especially invoices. OCR is a method of is using the manual filter, which takes time consumption.
processing and analyzing valuable information from the Khan, et al. (2010) report that SVM classification
collection of image data. The information collected can accuracy is best in most cases but that the procedure is

0188
Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on April 09,2025 at 17:35:02 UTC from IEEE Xplore. Restrictions apply.
time-consuming because of many parameters and the III. PROPOSED WORK
demand for computation time. Al-Masri (2021) proposed
the use of SVM to diagnose problems with gear in A. Problem definition
mechanical engineering, different methods are used such
as mathematical statistical feature extraction, particle In general, the invoice document is regarded as a
swarm algorithm, and support vector machines. The sparse arrangement of multiple text blocks in the domain
intelligent diagnosis model is developed, and its reliability of invoice and bill extraction, instead of as a single
is tested through experiments. constant collection of sentences [18]. Most of the invoice
In recent years, deep learning specifically CNN was documents displayed semi-structured data, key fields, and
an important point and had a strong impact in the fields of company detail.
classification of images, scene recognition, and tracking
and object detection [10]. The advantage of CNN is the
automated retrieval of filters and classifiers from the
training data for maximum functional elimination [11]. In
the specific case of “Invoice Classification Using Deep
Features and Machine Learning Techniques” by
Tarawneh et al., their proposed system used a deep
convolutional neural network (CNN) (pre-trained model).
The proposed method is based on the extraction of
features via AlexNet model. They indicate that the deep
features of AlexNet allow high classification rates [2].
Besides, Afzal et al. (2015) proposed a CNN model with Fig. 1. Different samples of SROIE.
trained millions of samples. The accuracy of the proposed
approach is around 77.6%. Harley et al. (2015) agreed
that applying CNN approach to document image To predict the classes of the identified objects within
representation will a better result compared to current the image, object detection is employed. This approach
handcrafted alternatives. The other neural network that utilizes algorithms for image classification and
has experienced a high level of progress is the field of localization. However, prior to applying the object
object detection called Faster Region-based Convolutional detection algorithm, it is essential to perform image
Neural Network (RCNN) [14]. Faster R-CNN’s central preprocessing to standardize the input file. This
concept is to use the Region Proposal Network (RPN) to preprocessing step not only helps normalize the image but
generate candidate regions of the object, and convolution also aids in accurately identifying the correct classes of
layers to recognize each area as the class of object [16]. objects and extracting improved results in OCR (Optical
Jiang et al (2017) asserted that Faster R-CNN able to Character Recognition).
automatically acquire a feature representation from data
and allow end-to-end learning of all layers by comparing B. Data Preparation
with non-neural-based methods and neural-based
methods. In recent years, a lot of researchers have used Many images used in this study are scanned document
this object detection to solve different problems. Zhu et al. images, which exhibit irregularities and unnecessary
applied Faster R-CNN to train a detection network that variations that make them difficult to define directly.
would detect the book category and position of books Additionally, low-resolution and unclear documents
automatically. The model comprised a network of negatively impact the pattern-matching capabilities of
regional proposals and an object detection network that OCR processes. Singh (2013) noted that existing OCR
was connected to predict and identify the bounding boxes processors can achieve character recognition levels of up
of the books. to 99% in high-quality documents. However, when
Furthermore, there is another object detection like dealing with scanned images it is totally different while
Faster RCNN called YOLO which stands for You Only working with pure text and using NLP [19-20], lower
Look Once. They all use a network structure based on an output resolutions from scanners lead to higher error rates.
anchor box-based, both of which use bounding both To address these issues, image pre-processing techniques
regressions. Referring to Faster RCNN, YOLO uses a are employed to enhance image features using digital
simplified model that concurrently learns classification computer methods such as skew correction, smoothing,
and bounding box regression. YOLOv8 is a mature and noise removal. Automated invoice processing often
version of YOLO that uses multi-label classification and requires appropriate scaling between JPEG and PNG
the concept of feature pyramid networks [17]. image formats for grayscale conversion and image
normalization. Grayscale is preferred in computer vision
tasks as it reduces processing time, as RGB images have
three channels whereas grayscale has only one channel
[21]. Additionally, the color information in RGB images
is unnecessary for invoice processing. Skew correction is

0189
Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on April 09,2025 at 17:35:02 UTC from IEEE Xplore. Restrictions apply.
also vital in image pre-processing as scanning documents
can introduce skew due to human factors [22]. After
correcting the skew, image-denoising methods are applied
to ensure a clear background. Scanned images often suffer
from background noise, which reduces the accuracy of
text extraction. In denoising techniques, blur functions are
utilized to remove unwanted noise without compromising
image features. Gaussian blur is effective in reducing
noise and eliminating speckles. Furthermore, image
binarization plays a critical role in document image
processing. This approach converts a grayscale image into
black-and-white and can be performed through local or
global thresholding. Global thresholding, using a single
threshold for the entire document, is more suitable for Fig. 2. Proposed detector and a classifier based on combining RPF and
LeNet.
extracting all documents without limitations. Considering
the distinct contrast between text lines and the document
D. Data Extraction
context, Otsu's method has been applied to the input
grayscale image. Following skew correction and noise
Optical Character Recognition (OCR) is a technique
removal, morphological operations are necessary to
employed to convert scanned documents into machine-
remove imperfections in the image structure. Dilation and
readable text characters. These text characters represent
erosion serve as fundamental operations in morphological
the textual content present in the physical document. The
operations.
textual content may exist in an unstructured form or
adhere to a predefined physical document format without
C. Proposed Model
any specific layout. In essence, OCR aims to enable
computers to recognize and interpret optical symbols
Faster R-CNN is a widely recognized model in the realm
without human intervention. OCR has found extensive
of deep learning-based object detection. Its foundational
use in various applications, including 3D object
model is R-CNN, which operates as follows:
recognition, invoice and receipt processing, mailroom
1. Selective Search: The input image is scanned to
automation, and more. The Tesseract OCR engine v4 was
identify potential objects, generating approximately
used as the OCR engine for the proposed work to
2000 region proposals.
recognize the characters.
2. Each of these region proposals is fed into a
Convolutional Neural Network (CNN).
3. The output from each CNN is then passed through a
Support Vector Machine (SVM) to classify the region
and a linear regressor to refine the bounding box if an
object is present.

The speed of R-CNN is notably slow. Fast R-CNN


addressed this issue by conducting feature extraction
across the entire image prior to proposing regions. Instead
of running separate CNNs for each of the 2000
overlapping regions, it utilized a single CNN for the entire
image. Additionally, Fast R-CNN replaced the SVM with
a softmax layer. However, the bottleneck problem
persisted due to the time-consuming selective search Fig. 3. Tesseract OCR process flow.
algorithm used for generating region proposals.
The significant advancement introduced by Faster R-CNN IV. EXPERIMENTAL RESULTS
was the replacement of the slow selective search
algorithm with a rapid neural network. This was The experiments were conducted using the SROIE
accomplished through the introduction of the Region dataset. SORIE comprises 1000 complete scanned English
Proposal Network (RPN). The most common detectors receipt images, with each image containing approximately
used are YOLO, Faster R-CNN, and SSD. The proposed four important text fields such as goods name, unit price,
model in this work tries to maintain the high accuracy of and total cost. The dataset is divided into two subsets: a
YOLO and reduce the complexity of Faster-RCNN. The training/validation set (trainval) and a test set (test). The
adapted detector used the RPN concept from Faster- trainval set consists of 600 receipt images, provided to
RCNN and implemented the simple LeNet-5 network for participants along with their corresponding annotations.
feature extraction as shown in Fig 2.
The test set consists of 400 images. Annotations for each

0190
Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on April 09,2025 at 17:35:02 UTC from IEEE Xplore. Restrictions apply.
image in the dataset include text bounding boxes (bbox) Artificial Intelligence, whereas Sypht offers zero
and the transcript of each text bbox. Bounding box configurations or manual annotations. In contrast,
locations are represented as rectangles with four vertices, SmartSoft requires template definition prior to data
ordered clockwise from the top. Annotations for each extraction. Sypht demonstrates higher prediction accuracy
image are stored in a text file with a matching filename and faster processing per document compared to Rossum.
[23]. The evaluation aims to measure the performance of However, Sypht has limitations in terms of text
the system as a commercial product, which integrates both characteristics, such as LSTM. Data plays a crucial role in
text localization and recognition tasks. achieving successful outcomes in the training model for
automated invoice processing. Developers have also
Table. 1. Experimental results recognized the significance of considering inbound and
Recall Precision F1 Score progressive data in predictive outcomes. Key elements,
YOLO + OCR 75.06% 79.68% 77.30% such as tables, company logos, and total amounts,
Faster RCNN + 72.02% 74.65% 73.3% significantly impact the predictive outcome, which in turn
OCR affects the invoice processing results. Additionally, image
BiLSTM + OCR 65.84% 76.62% 70.83% processing stages hold importance in the overall process,
Proposed Model 78.26% 84.12% 81.08% as blurry or inaccurate captured images can adversely
without impact the outcome. The proposed system allows users to
preprocessing upload invoices in image or PDF format. Upon uploading,
users have the option to select multiple invoices for
Proposed Model 84.88% 88.62% 86.71%
simultaneous extraction. The system displays the detected
coordinates of specific labels and their confidence in the
V. COMMERCIAL SYSTEMS
invoice image.
According to Aslan (2016), electronic invoices are
getting more popular in the market. Despite that, smaller
businesses still issue invoices based on paper. Different
organization uses different approaches to invoice
processing. Besides the manual process, a lot of
companies are using automated invoice processing tools
to save time and cost. There are three examples of similar
tools provided for the organization to manage its invoices
in the current market.

Fig. 4. Proposed System Home Page.


Table. 2. Comparison Table of Similar Commercial Systems
Criteria SmartSoft Rossum Sypht
(smartSoft, (Rossum, 2019) (Sypht, 2020)
2019)
API Yes Yes Yes
Support
Platform Desktop Cloud-based Cloud-based
based service service
Cost High High High

Pros Export - No rules or Zero


directly to templates configurations
SAP or extracting the or manual
ERP data annotations. Fig. 5. Proposed System Extract Preview.
system.
Cons Define pre- Slower prediction It has limited
defined time per characteristics
templates document of text.
before data
extraction.

Based on the comparison, the research reveals that


Rossum and Sypht offer trial versions for users to test and
experience the system, whereas SmartSoft requires
payment for system usage. All three platforms, SmartSoft,
Rossum, and Sypht, provide API support. However, Fig. 6. Proposed System Extract Preview on multiple pages.
Rossum lacks rules or templates for data extraction using

0191
Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on April 09,2025 at 17:35:02 UTC from IEEE Xplore. Restrictions apply.
VI. CONCLUSION [10] Zhu, B., Wu, X., Yang, L., Shen, Y. and Wu, L., 2016, July.
Automatic detection of books based on Faster R-CNN. In
The proposed automated system has showcased its 2016 Third International Conference on Digital Information
superiority compared to other suggested detectors and Processing, Data Mining, and Wireless Communications
(DIPDMWC) (pp. 8-12). IEEE.
possesses numerous functionalities found in current [11] Kim, K.W., Hong, H.G., Nam, G.P. and Park, K.R., 2017.
commercialized systems. The significant accomplishment A study of deep CNN-based classification of open and
of this system is its efficiency and effectiveness in closed eyes using a visible light camera sensor. Sensors,
extracting invoice data. However, there are still 17(7), p.1534.
challenges that must be tackled to enhance its [12] Afzal, Muhammad Zeshan, et al. "Deepdocclassifier:
performance. A notable challenge involves accurately Document classification with deep convolutional neural
extracting data due to the wide range of formats and network." 2015 13th international conference on document
unstructured characteristics of invoices, which vary in analysis and recognition (ICDAR). IEEE, 2015.
type and size. The trained model faces difficulties in [13] Harley, Adam W., Alex Ufkes, and Konstantinos G.
Derpanis. "Evaluation of deep convolutional nets for
effectively extracting features from text-image document image classification and retrieval." In 2015 13th
combinations in diverse formats, resulting in varying International Conference on Document Analysis and
levels of accuracy across different invoice formats. Recognition (ICDAR), pp. 991-995. IEEE, 2015.
[14] Girshick, R., 2015. Fast r-cnn. In Proceedings of the IEEE
REFERENCES international conference on computer vision (pp. 1440-
1448).
[1] Mokayed, H. and Mohamed, A., 2014. A robust [15] Jiang, H. and Learned-Miller, E., 2017, May. Face
thresholding technique for generic structured document detection with the faster R-CNN. In 2017 12th IEEE
classifier using ordinal structure fuzzy logic. International International Conference on Automatic Face & Gesture
Journal of Innovative Computing, Information and Control, Recognition (FG 2017) (pp. 650-657). IEEE.
10(4), pp.1543-1554. [16] Nguyen, C.C., Tran, G.S., Nghiem, T.P., Doan, N.Q.,
[2] Tarawneh, A.S., Hassanat, A.B., Chetverikov, D., Lendak, Gratadour, D., Burie, J.C. and Luong, C.M., 2018, April.
I. and Verma, C., 2019, April. Invoice Classification Using Towards real-time smile detection based on faster region
Deep Features and Machine Learning Techniques. In 2019 convolutional neural network. In 2018 1st International
IEEE Jordan International Joint Conference on Electrical Conference on Multimedia Analysis and Pattern
Engineering and Information Technology (JEEIT) (pp. Recognition (MAPR) (pp. 1-6). IEEE.
855-859). IEEE. [17] Benjdira, Bilel, et al. "Car detection using unmanned aerial
[3] Kanchi, S., Pagani, A., Mokayed, H., Liwicki, M., Stricker, vehicles: Comparison between faster r-cnn and yolov3."
D. and Afzal, M.Z., 2022. EmmDocClassifier: Efficient 2019 1st International Conference on Unmanned Vehicle
Multimodal Document Image Classifier for Scarce Data. Systems-Oman (UVS). IEEE, 2019.
Applied Sciences, 12(3), p.1457. [18] Holt, X. and Chisholm, A., 2018, December. Extracting
[4] Baviskar, D., Ahirrao, S. and Kotecha, K., 2021. Multi- structured data from invoices. In Proceedings of the
layout invoice document dataset (MIDD): a dataset for Australasian Language Technology Association Workshop
named entity recognition. Data, 6(7), p.78. 2018 (pp. 53-59).
[5] Mokayed, H., Shivakumara, P., Saini, R., Liwicki, M., Hin, [19] Alkhaled, L., Adewumi, T. and Sabry, S.S., 2023. Bipol: A
L.C. and Pal, U., 2021. Anomaly detection in natural scene Novel Multi-Axes Bias Evaluation Metric with
images based on enhanced fine-grained saliency and fuzzy Explainability for NLP. arXiv preprint arXiv:2304.04029.
logic. IEEE Access, 9, pp.129102-129109. [20] Adewumi, T., Södergren, I., Alkhaled, L., Sabry, S.S.,
[6] Mokayed, H., Quan, T.Z., Alkhaled, L. and Sivakumar, V., Liwicki, F. and Liwicki, M., 2023. Bipol: Multi-axes
2023. Real-time human detection and counting system Evaluation of Bias with Explainability in Benchmark
using deep learning computer vision techniques. In Datasets. arXiv preprint arXiv:2301.12139.
Artificial Intelligence and Applications (Vol. 1, No. 4). [21] Kanan, C. and Cottrell, G.W., 2012. Color-to-grayscale:
[7] Ha, H.T., Medved’, M., Nevěřilová, Z. and Horák, A., does the method matter in image recognition?. PloS one,
2018. Recognition of OCR invoice metadata block types. In 7(1), p.e29740.
Text, Speech, and Dialogue: 21st International Conference, [22] Gedraite, E.S. and Hadad, M., 2011, September.
TSD 2018, Brno, Czech Republic, September 11-14, 2018, Investigation on the effect of a Gaussian Blur in image
Proceedings 21 (pp. 304-312). Springer International filtering and segmentation. In Proceedings ELMAR-2011
Publishing. (pp. 393-396). IEEE.
[8] Mokayed, Hamam, Thomas Clark, Lama Alkhaled, [23] Z. Huang et al., "ICDAR2019 Competition on Scanned
Mohamad Ali Marashli, and Hum Yan Chai. "On Receipt OCR and Information Extraction," 2019
Restricted Computational Systems, Real-time Multi- International Conference on Document Analysis and
tracking and Object Recognition Tasks are Possible." In Recognition (ICDAR), Sydney, NSW, Australia, 2019, pp.
2022 IEEE International Conference on Industrial 1516-1520, doi: 10.1109/ICDAR.2019.00244.
Engineering and Engineering Management (IEEM), pp. [24] Aslan, E., Karakaya, T., Unver, E. and Akgül, Y.S., 2016.
1523-1528. IEEE, 2022. A Part based Modeling Approach for Invoice Parsing. In
[9] Al-Masri, A.N.A. and Mokayed, H., 2021. Intelligent fault VISIGRAPP (3: VISAPP) (pp. 392-399).
diagnosis of gears based on deep learning feature extraction
and particle swarm support vector machine state
recognition. Journal of Intelligent Systems and Internet of
Things, 4(1), pp.26-40.

0192
Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on April 09,2025 at 17:35:02 UTC from IEEE Xplore. Restrictions apply.

You might also like