Data stream is data that is
continuously generated from thousands of data sources and arrives
simultaneously as small data packets. The data stream can include data from
various sources such as log files, e-commerce data, click flow data, social
media data, financial transactions and location services data generated by
customers.
The data stream consists of fast and
infinite series of data. Due to these two characteristics of the data stream,
it is seen that today's computers are insufficient to process the flowing data
in terms of both their memory potential and processing capacity. In order to
overcome this problem, either the old batch processing algorithms need to be
modified and rewritten, or new methods, algorithms and platforms need to be
developed for the data stream.
Real-time data stream mining is
slightly different from conventional batch processing. Because the entire data
cannot be accessed at any time t. Due to this nature of the data stream, there
are some difficulties in processing the data:
·
The
data stream is continuous and infinite. It is impossible to store and process
such large data in memory. Therefore, mechanisms that can process large-scale
data with less memory should be developed.
·
It
is necessary to process the incoming data with high speeds and give real-time
responses. For this reason, generally single pass strategy is used in the data
processing. However, traditional methods can access data several times. That
is, while conventional methods can make random access, it is not possible for
data stream.
·
Data
stream may evolve (drift) over time. Thus, there may be inconsistency between
the first and subsequent data.
·
In
many algorithms, parameters are determined by expert opinion. Setting these
parameters is even more difficult for
data stream, due to the lack of whole data.
·
Algorithms
that run on data stream have to relearn a repeating pattern.
·
Concept
drifts need to be accurately detected. it is not only necessary to detect
concept drifts, but also to manage them.