5 Aralık 2017 Salı

Data Stream


Data stream is data that is continuously generated from thousands of data sources and arrives simultaneously as small data packets. The data stream can include data from various sources such as log files, e-commerce data, click flow data, social media data, financial transactions and location services data generated by customers.

The data stream consists of fast and infinite series of data. Due to these two characteristics of the data stream, it is seen that today's computers are insufficient to process the flowing data in terms of both their memory potential and processing capacity. In order to overcome this problem, either the old batch processing algorithms need to be modified and rewritten, or new methods, algorithms and platforms need to be developed for the data stream.


Real-time data stream mining is slightly different from conventional batch processing. Because the entire data cannot be accessed at any time t. Due to this nature of the data stream, there are some difficulties in processing the data:
·         The data stream is continuous and infinite. It is impossible to store and process such large data in memory. Therefore, mechanisms that can process large-scale data with less memory should be developed.
·         It is necessary to process the incoming data with high speeds and give real-time responses. For this reason, generally single pass strategy is used in the data processing. However, traditional methods can access data several times. That is, while conventional methods can make random access, it is not possible for data stream.
·         Data stream may evolve (drift) over time. Thus, there may be inconsistency between the first and subsequent data.
·         In many algorithms, parameters are determined by expert opinion. Setting these parameters is even more difficult for  data stream, due to the lack of whole data.
·         Algorithms that run on data stream have to relearn a repeating pattern.
·         Concept drifts need to be accurately detected. it is not only necessary to detect concept drifts, but also to manage them.