1 of 30

Applying Business Intelligence for Analyzing Big Data Using High-Performance Compressed Columnar Database System for Data Warehouse

Paper Submission ID: 136

Mohammad Shohal Bhuiyan�Department of Computer Science & Engineering,

Dhaka University of Engineering & Technology (DUET),

Gazipur

Prof. Dr. Mohammad Abdur Rouf

Dhaka University of Engineering & Technology (DUET), Gazipur

Email:marouf.cse@student.duet.ac.bd

Md. Mostafijur Rahman

Dhaka University of Engineering & Technology (DUET), Gazipur

Email:20104003@student.duet.ac.bd

2 of 30

Outline

Introduction

Problem definition
Motivation

Existing Work
Work in Progress
Future work
Outcome of the project

2

3 of 30

3

Introduction

Electronic data (structured, semi-structured and unstructured) has been generated for every business that reached a terabyte threshold.

Big data are generated from different sources. These are essential for analysis to make analytical reports that help businesses make better decisions.

4 of 30

Business Intelligence

4

Business Intelligence (BI) combines:

Business analytics
Data mining
Data visualization
Data tools and infrastructure,
Best practices to help organizations make more data-driven decisions.

5 of 30

Big data is data that contains greater variety, arriving in increasing volumes and with more velocity.

These massive volumes of data can be used to address business problems.

5

Big Data

6 of 30

Data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.

6

Data Warehouse

7 of 30

Since separate systems are available for BI, big data analysis and data warehouse. There is no combined application for these three systems.
In the literature, several data modelling are used for data storage and retrieval from data warehouses.
Making a data warehouse available is not easy. And to react to changing business requirements, the data warehouse needs to change in design contains and physical characteristics on a timely basis.

7

Problem Definition

8 of 30

Though that storage space, Redundant Array of Independence Disc (RAID), Network Attached Storage (NAS) are available, the price is high enough for common use.
The conventional database technology cannot provide desired performance. We must use new algorithms and techniques to achieve the followings goals -

Reducing stories cost without loss of information
Performance improvements

The proposed algorithm focuses on an effective data warehouse that may enable companies to achieve management efficiency by utilizing big data through efficient BI without investing in additional infrastructure.

8

Motivation

9 of 30

We have used the following existing technology in this research-

TPC-DS – a decision support benchmark used for the input dataset
Huffman encoding – used for data compression
Oracle Database 21c and Microsoft SQL Server 2017 – used for comparing performance with our system.

9

Existing work

10 of 30

10

Our Approach

11 of 30

11

Basic Step Diagram of BI

Fig.1 Basic Step Diagram of Proposed Algorithm

Data set is collection from TPC-DS.
Different sources data are collected using ETL.
Columnar compressed in Data Warehouse.
Non-Volatile data are stored in DW
OLAP (Online analysis processing)
Report Views
Business Intelligence from different Reports.
Taking actionable decision (DSS)
Decision support system takes right decision.

12 of 30

12

Proposed Prototype

Fig.2. Basic diagram of Proposed Algorithm

13 of 30

13

Proposed Prototype

Fig: Block diagram

Fig.3. Proposed High-Performance Compressed Columnar Database System for Data Warehouse

14 of 30

14

Compressed Columnar Dictionary

This Huffman algorithm is used for data stored in data dictionary using bits stream.
4 and 5 lines are used to insert in memory using data compressed
Indexing structure is increased step by step.
High performance is desired when we retrieves data.

15 of 30

15

Data Insertion Using Compressed Columnar

Random numbers are generated and inserted in database.
Data stores in Columnar mechanism

16 of 30

Information segregated based on domains.
Repeated information is reduced by using a standard code.
Converted to columnar representation, as columnar representation is faster than the row-wise representation

16

Data Compression

Fig. 4. Data compression technique

17 of 30

Based on our proposed algorithms, we completed the ETL part.
We use our existing hardware to preserve data.
Use a regular laptop for queries and reporting.

17

Proposed Prototype

Fig. 4. Prototype of BI

18 of 30

18

Experimental Results

19 of 30

19

Performance - Storage capacity

Number of Tuples (Million)	Estimated Storage for Uncompressed Records(GB)	Actual Storage for SQL Server (Ss) (GB)	Actual Storage for ORACLE (So) (GB)	Estimated Storage for HPMDB Records(GB)	Actual Storage for HPMDB (Sh) (GB)
1	0.02	0.34	0.34	0.018	0.011
5	0.1	2.3	2.3	0.09	0.055

20 of 30

List of SELECT Queries -

20

Q1	SELECT * FROM Customer WHERE c_birth_year>=1991
Q2	SELECT * FROM Customer WHERE c_birth_year>=1991 AND c_birth_country="Bangladesh"
Q3	SELECT * FROM Customer WHERE c_birth_year>=1991 AND c_last_review_date_sk='2022-03-21'

Time (second)

Performance - Query time difference

Fig. 5. Query Time Difference with existing schemes

21 of 30

21

Performance - Trace of Elapsed Time

Fig. 6. Trace of Elapsed Time on Indexed and Non-Indexed Form of Compressed Database

22 of 30

22

Estimated Storage Space Comparison

Fig. 7. Estimated Storage Space Comparison in Different Architecture

23 of 30

Data Format

Unstructured
Semi-structured
Structured

Programming Languages

C++

23

Technology uses

24 of 30

24

Impacts and Constraints

25 of 30

Maximizing resource utilization as it can be implemented on existing hardware.
Reduse business risk as well as increase profitability by using data in a cost-effective way.
Sustanible business growth.

Economic Impact

25

26 of 30

Transferring (ETL) the data by using the ML algorithm.
IoT devices will have to be deployed to efficiently collect and send the data to the DW.

26

Future work

27 of 30

Our solutions will increase the usability of BI without investing in additional infrastructure.
That will help decision-making using historical data rather than assumptions and gut feeling.

27

Key Outcomes

28 of 30

The actionable analyzed data will help to take vital decisions.
It enables extracting analytical information from big data.
To improve the performance of database transactions by using dictionary-based columnar representation
It compressed columnar database system for data warehouse.
We also found competitive results compared with commercial database

28

Conclusion

29 of 30

Reference

[1] F. Faerber, A. Kemper, P. Å. Larson, J. Levandoski, T. Neumann, and A. Pavlo, “Main memory database systems,” Found. Trends Databases, vol. 8, no. 1–2, pp. 1–130, 2017.

[2] H. Zhang, G. Chen, B. C. Ooi, K. L. Tan, and M. Zhang, “In-Memory Big Data Management and Processing: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 7, pp. 1920–1948, 2015.

[3] A. Kemper and T. Neumann, “HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots,” Proc. - Int. Conf. Data Eng., pp. 195–206, 2011.

[4] V. Sikka, F. Färber, A. Goel, and W. Lehner, “Sap Hana,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1184–1185, 2013.

[5] N. Pansare, V. Borkar, C. Jermaine, and T. Condie, “Online aggregation for large MapReduce jobs,” Proc. VLDB Endow., vol. 4, no. 11, pp. 1135–1145, 2011.

[6] A. Habib, A. S. M. Latiful Hoque, and M. R. Hussain, “H-HIBASE: Compression enhancement of HIBASE technique using huffman coding,” J. Comput., vol. 8, no. 5, pp. 1175–1183, 2013.

[7] M. A. Rouf, “Scalable Storage in Compressed Representation for Terabyte. Data Management,” MSc Thesis, Dept. of CSE, BUET, 2006.

[8] “TPC-DS is a decision support benchmark.” [Online]. Available: http://www.tpc.org/tpcds/.

[9] He, Y. et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. in Proceedings - International Conference on Data Engineering, 2011

29

30 of 30

30

Any Questions?