1 of 30

Applying Business Intelligence for Analyzing Big Data Using High-Performance Compressed Columnar Database System for Data Warehouse

Paper Submission ID: 136

Mohammad Shohal Bhuiyan�Department of Computer Science & Engineering,

Dhaka University of Engineering & Technology (DUET),

Gazipur

Prof. Dr. Mohammad Abdur Rouf

Dhaka University of Engineering & Technology (DUET), Gazipur

Email:marouf.cse@student.duet.ac.bd

Md. Mostafijur Rahman

Dhaka University of Engineering & Technology (DUET), Gazipur

Email:20104003@student.duet.ac.bd

2 of 30

Outline

  • Introduction
    • Problem definition
    • Motivation
  • Existing Work
  • Work in Progress
  • Future work
  • Outcome of the project

2

3 of 30

3

Introduction

  • Electronic data (structured, semi-structured and unstructured) has been generated for every business that reached a terabyte threshold.
  • Big data are generated from different sources. These are essential for analysis to make analytical reports that help businesses make better decisions.

4 of 30

Business Intelligence

4

Business Intelligence (BI) combines:

  • Business analytics
  • Data mining
  • Data visualization
  • Data tools and infrastructure,
  • Best practices to help organizations make more data-driven decisions.

5 of 30

  • Big data is data that contains greater variety, arriving in increasing volumes and with more velocity.

  • These massive volumes of data can be used to address business problems.

5

Big Data

6 of 30

  • Data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.

6

Data Warehouse

7 of 30

  • Since separate systems are available for BI, big data analysis and data warehouse. There is no combined application for these three systems.
  • In the literature, several data modelling are used for data storage and retrieval from data warehouses.
  • Making a data warehouse available is not easy. And to react to changing business requirements, the data warehouse needs to change in design contains and physical characteristics on a timely basis.

7

Problem Definition

8 of 30

  • Though that storage space, Redundant Array of Independence Disc (RAID), Network Attached Storage (NAS) are available, the price is high enough for common use.
  • The conventional database technology cannot provide desired performance. We must use new algorithms and techniques to achieve the followings goals -
    • Reducing stories cost without loss of information
    • Performance improvements
  • The proposed algorithm focuses on an effective data warehouse that may enable companies to achieve management efficiency by utilizing big data through efficient BI without investing in additional infrastructure.

8

Motivation

9 of 30

We have used the following existing technology in this research-

  • TPC-DS – a decision support benchmark used for the input dataset
  • Huffman encoding – used for data compression
  • Oracle Database 21c and Microsoft SQL Server 2017 – used for comparing performance with our system.

9

Existing work

10 of 30

10

Our Approach

11 of 30

11

Basic Step Diagram of BI

Fig.1 Basic Step Diagram of Proposed Algorithm

  • Data set is collection from TPC-DS.
  • Different sources data are collected using ETL.
  • Columnar compressed in Data Warehouse.
  • Non-Volatile data are stored in DW
  • OLAP (Online analysis processing)
  • Report Views
  • Business Intelligence from different Reports.
  • Taking actionable decision (DSS)
  • Decision support system takes right decision.

12 of 30

12

Proposed Prototype

Fig.2. Basic diagram of Proposed Algorithm

13 of 30

13

Proposed Prototype

Fig: Block diagram

Fig.3. Proposed High-Performance Compressed Columnar Database System for Data Warehouse

14 of 30

14

Compressed Columnar Dictionary

  • This Huffman algorithm is used for data stored in data dictionary using bits stream.
  • 4 and 5 lines are used to insert in memory using data compressed
  • Indexing structure is increased step by step.
  • High performance is desired when we retrieves data.

15 of 30

15

Data Insertion Using Compressed Columnar

  • Random numbers are generated and inserted in database.
  • Data stores in Columnar mechanism

16 of 30

  • Information segregated based on domains.
  • Repeated information is reduced by using a standard code.
  • Converted to columnar representation, as columnar representation is faster than the row-wise representation

16

Data Compression

Fig. 4. Data compression technique

17 of 30

  • Based on our proposed algorithms, we completed the ETL part.
  • We use our existing hardware to preserve data.
  • Use a regular laptop for queries and reporting.

17

Proposed Prototype

Fig. 4. Prototype of BI

18 of 30

18

Experimental Results

19 of 30

19

Performance - Storage capacity

Number of Tuples (Million)

Estimated Storage for Uncompressed Records(GB)

Actual Storage for SQL Server (Ss) (GB)

Actual Storage for ORACLE (So) (GB)

Estimated Storage for HPMDB Records(GB)

Actual Storage for HPMDB (Sh) (GB)

1

0.02

0.34

0.34

0.018

0.011

5

0.1

2.3

2.3

0.09

0.055

20 of 30

List of SELECT Queries -

20

Q1

SELECT * FROM Customer WHERE c_birth_year>=1991

Q2

SELECT * FROM Customer WHERE c_birth_year>=1991 AND c_birth_country="Bangladesh"

Q3

SELECT * FROM Customer WHERE c_birth_year>=1991 AND c_last_review_date_sk='2022-03-21'

Time (second)

Performance - Query time difference

Fig. 5. Query Time Difference with existing schemes

21 of 30

21

Performance - Trace of Elapsed Time

Fig. 6. Trace of Elapsed Time on Indexed and Non-Indexed Form of Compressed Database

22 of 30

22

Estimated Storage Space Comparison

Fig. 7. Estimated Storage Space Comparison in Different Architecture

23 of 30

Data Format

  • Unstructured
  • Semi-structured
  • Structured

Programming Languages

  • C++

23

Technology uses

24 of 30

24

Impacts and Constraints

25 of 30

  • Maximizing resource utilization as it can be implemented on existing hardware.
  • Reduse business risk as well as increase profitability by using data in a cost-effective way.
  • Sustanible business growth.

Economic Impact

25

26 of 30

  • Transferring (ETL) the data by using the ML algorithm.
  • IoT devices will have to be deployed to efficiently collect and send the data to the DW.

26

Future work

27 of 30

  • Our solutions will increase the usability of BI without investing in additional infrastructure.
  • That will help decision-making using historical data rather than assumptions and gut feeling.

27

Key Outcomes

28 of 30

  • The actionable analyzed data will help to take vital decisions.
  • It enables extracting analytical information from big data.
  • To improve the performance of database transactions by using dictionary-based columnar representation
  • It compressed columnar database system for data warehouse.
  • We also found competitive results compared with commercial database

28

Conclusion

29 of 30

Reference

[1] F. Faerber, A. Kemper, P. Å. Larson, J. Levandoski, T. Neumann, and A. Pavlo, “Main memory database systems,” Found. Trends Databases, vol. 8, no. 1–2, pp. 1–130, 2017.

[2] H. Zhang, G. Chen, B. C. Ooi, K. L. Tan, and M. Zhang, “In-Memory Big Data Management and Processing: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 7, pp. 1920–1948, 2015.

[3] A. Kemper and T. Neumann, “HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots,” Proc. - Int. Conf. Data Eng., pp. 195–206, 2011.

[4] V. Sikka, F. Färber, A. Goel, and W. Lehner, “Sap Hana,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1184–1185, 2013.

[5] N. Pansare, V. Borkar, C. Jermaine, and T. Condie, “Online aggregation for large MapReduce jobs,” Proc. VLDB Endow., vol. 4, no. 11, pp. 1135–1145, 2011.

[6] A. Habib, A. S. M. Latiful Hoque, and M. R. Hussain, “H-HIBASE: Compression enhancement of HIBASE technique using huffman coding,” J. Comput., vol. 8, no. 5, pp. 1175–1183, 2013.

[7] M. A. Rouf, “Scalable Storage in Compressed Representation for Terabyte. Data Management,” MSc Thesis, Dept. of CSE, BUET, 2006.

[8] “TPC-DS is a decision support benchmark.” [Online]. Available: http://www.tpc.org/tpcds/.

[9] He, Y. et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. in Proceedings - International Conference on Data Engineering, 2011

29

30 of 30

30

Any Questions?