1 of 23

Mining Data Streams

UNIT-5

2 of 23

What is Data Stream?

  • Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a very high speed.
  • It is an ordered sequence of information for a specific interval.
  • The sender’s data is transferred from the sender’s side and immediately shows in data streaming at the receiver’s side.
  • Streaming does not mean downloading the data or storing the information on storage devices.

3 of 23

Sources of Data Stream

There are so many sources of the data stream, and a few widely used sources are listed below:

  • Internet traffic
  • Sensors data
  • Real-time ATM transaction
  • Live event data
  • Call records
  • Satellite data
  • Audio listening
  • Watching videos
  • Real-time surveillance systems and Online transactions

4 of 23

What are Data Streams in Data Mining?

  • Data Streams in Data Mining is extracting knowledge and valuable insights from a continuous stream of data using stream processing software.
  • The structure of knowledge is extracted in data stream mining represented in the case of models and patterns of infinite streams of information.

5 of 23

Characteristics of Data Stream in Data Mining

Data Stream in Data Mining should have the following characteristics:

  • Continuous Stream of Data: The data stream is an infinite continuous stream resulting in big data. In data streaming, multiple data streams are passed simultaneously.
  • Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry timestamps with them. After a particular time, the data stream loses its significance and is relevant for a certain period.
  • Data Volatility: No data is stored in data streaming as It is volatile. Once the data mining and analysis are done, information is summarized or discarded.
  • Concept Drifting: Data Streams are very unpredictable. The data changes or evolves with time, as in this dynamic world, nothing is constant.

6 of 23

Time-series data mining

  • A time series is a sequence of data points recorded at specific time points – most often in regular time intervals (seconds, hours, days, months etc.).
  • Every organization generates a high volume of data every single day – be it sales figure, revenue, traffic, or operating cost.
  • Time series data mining can generate valuable information for long-term business decisions, yet they are under utilized in most organizations.

7 of 23

Time Series Data Mining

  • Time series represents a collection of values or data obtained from the logical order of measurement over time.
  • Time series data mining makes our natural ability to visualize the shape of real-time data. It is an ordered sequence of data points at uniform time intervals.
  • Time Series Analysis comprises methods for analyzing time-series data in order to extract meaningful statistics, rules and patterns.
  • These rules and patterns might be used to build forecasting models that are able to predict future developments.

8 of 23

Is the database play a vital role in Time Series mining?

The database is the collection of data retrieved from a different source in which the data are stored in a structural, nonstructural format on their respective columns.

Time Series database consists of a sequence of values or events changing with time. Data are recorded at regular intervals.

9 of 23

Application of Time Series Mining:

1. Financial:

  • Used for stock price evaluation
  • For the measurement of Inflation

2. Industry:

  • Determine the power consumption

3. Scientific:

  • Used for experiment results

4. Meteorological:

  • Concerned with the processes and phenomena of the atmosphere, basically for forecasting weather

10 of 23

Characteristic of time series components:

1. Trend

2. Cycle

3.Seasonal

4. Irregular

11 of 23

1. Long-term or trend movements :

The general direction in which a time series is moving over a long interval of time. It shows the general tendency of the data to increase or decrease a long period of time.It will be represented using Trend Curve.

2. Cyclic movements or cycle variations:

Long term oscillations about a trend line or curve. For example, business cycles. This oscillatory movement has a period of oscillation of more than a year.

12 of 23

3. Seasonal movements or seasonal variations:

Almost identical patterns that a time series appears to follow during corresponding months of successive years. This variation will be present in a time series if the data are recorded hourly, daily, weekly or monthly. Sudden increase of sale of cakes during christmas and new year time.

4. Irregular or random movements:

These fluctuations are unforeseen, uncontrollable and unpredictable. They are not regular variations and are purely random or irregular. Such as labor disputes, floods or announced personal changes in company.

13 of 23

Example 1: Weather conditions

14 of 23

Example 2: Stock exchange

15 of 23

Example 3: Cluster monitoring in Network operation of Usage of data

16 of 23

Example 4: Health monitoring(ECG Report)

17 of 23

Sequence pattern mining

  • Sequential pattern mining is the mining of frequently appearing series events or subsequences as patterns. An instance of a sequential pattern is users who purchase a Canon digital camera are to purchase an HP color printer within a month
  • Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …, sn}. As the name suggests, it is the sequence of items occurring together. It can be considered as a transaction or purchased items together in a basket.
  • Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q, y, e, c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the subsequence is not necessarily consecutive items of the sequence. From the sequences of databases, subsequences are found from which the generalized sequence patterns are found at the end.
  • Sequence pattern: A sub-sequence is called a pattern when it is found in multiple sequences. The goal of the GSP algorithm is to mine the sequence patterns from the large database. The database consists of the sequences. When a subsequence has a frequency equal to more than the “support” value. For example: the pattern <a, b> is a sequence pattern mined from sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.

18 of 23

Introduction

  • Sequence Pattern Mining, a subset of Data Mining, is the process of identifying frequently occurring ordered events or subsequences as patterns.
  • It is highly useful for retail, telecommunications, and other businesses since it helps them detect sequential patterns for targeted marketing, customer retention, and many other tasks.

What is Sequence Pattern Mining?

19 of 23

When you are performing Sequence Pattern Mining, you are essentially:

  • Finding frequently occurring patterns
  • Comparing sequences
  • Finding missing sequence items
  • Building efficient indexes for sequence information

20 of 23

Applications of Sequence Pattern Mining:

Sequence Pattern Mining finds applications in multiple fields ranging from science, business, and finance to meteorology and geology. Some of them are listed below:

  • Determination of buying patterns (“If a person bought product A, he is likely to purchase product B”)
  • Stock trading (where else do people make huge bets on patterns than in the stock market?)
  • Analyzing DNA and protein sequences in computational biology
  • Studying website logs to identify a user’s online behavior
  • Predicting natural disasters based on past indicative patterns.
  • Studying telephone calling patterns

21 of 23

Types of Sequence Pattern Mining Problems

  1. String Mining: This is the subset of Sequence Pattern Mining that deals with text data in a sequence. The data can contain only a limited number of characters. For example, a DNA sequence contains only the letters ‘A’, ’T’, ’C’, and ’G’, and therefore analysis of the same falls within String Mining. Similarly, finding patterns in ASCII character sequences falls under String Mining.
  2. Itemset Mining: This is the broader subset of Sequence Pattern Mining that aims to find patterns in ordered datasets. Itemset Mining generally finds use in Marketing and Sales Applications (increasing co-purchases of items that are frequently brought together, cross-promoting products, managing inventory, setting price levels, and so on)

22 of 23

Sequence Database

A Sequence Pattern Mining Database is an ordered collection of elements or events. It is represented as a set of tuples <SID, S> where SID is the Sequence ID and S is the Sequence.

23 of 23

GSP (Generalized Sequential Pattern Mining)

  • This Sequence Pattern Mining algorithm takes a bottom-up approach to find frequent patterns. Initially, every element is considered as a candidate of length 1. Based on the minimum support, frequent sequences of length 1 are identified.
  • Now, using Apriori Pruning (discarding supersequences of infrequent sequences of length 1), supersequences of length 2 are constructed as candidates. This process repeats till no more candidates or no frequent sequence can be found. Thus, this process outputs all the frequent sequences from the dataset, starting from length 1. A very good example can be found here.
  • While this algorithm reduces the search space by Apriori Pruning, it still scans the database multiple times and can generate a large number of candidates if the minimum support is less.