Latest News

Thursday, November 8, 2018

How to manage every aspects of Big Data





In the era of Big Data, unless we understand what Big Data is, we will never get it right.

It’s important to know why fancy analytics tools that we have used for small data sets cannot be replicated when our data grows. For example, if we want to find simple average of ‘n’ numbers, we just add them, and divide the sum by ‘n’. The same approach is followed whether the ‘n’ is 100 or 100 billion.

However, if all the numbers are large positive, then the sum of 100 billion such numbers might not fit in a computer’s memory. We need to adjust the algorithm appropriately to find the average. That's the extra bit of cosmetic surgery needed for handling Big Data.

Decoding complex stats

Data analytics mostly comprise statistical methodologies like regression analyses, classification and clustering techniques, standard estimation and testing procedures, etc. While most of such theories are neatly developed in statistical literature and easily applied for small to moderate-sized data, one might need to manipulate intelligently and devise novel techniques for unusual format of data. But, the real challenge, even for standard ready-to-use techniques, lies in the limitations of using data with huge number of variables.

One reason is the presence of ‘spurious’ or nonsense correlations among different variables. The more the variables we handle, the more we counter such correlations. And unless we can sift out the unimportant variables, we cannot have meaningful analyses of data.

It’s theoretically challenging too. In addition, even in a standard regression analysis, for example, with loads of data and say, 10,000 variables, we need additional computational techniques.

Managing information

So, how do we handle the ocean of data? Now, with virtually everything confined under the system of Internet of Things, a gigantic amount of data is generated continuously. The ever-expanding horizon of data is now growing faster than ever. An IBM report of mid-2017 described that 2.5 quintillion bytes of data are created per day, and according to a Forbes article (2015), by 2020, new information of about 1.7 megabytes per second is expected to be created for every human being on the planet.

Storing data

Big Data is a boon and a curse at the same time. Are we really capable of leveraging it? With the present expertise, the answer is ‘no’. We need to devise statistical techniques to accommodate data. Only the top statisticians and computational experts together might produce such techniques, that too in a case-by-case manner.

Understanding the power of data

Consider the example of multiplication. We need some additional techniques for multiplying two big numbers, say with hundreds of digits. We use our memory, multiply one number by every digit of the other, one by one, starting from the unit place. Finally we add all the rows. This algorithm for multiplication is a derivative of the knowledge of tables, combined with some special techniques. This can be interpreted as a Big Data problem. And special techniques are needed for solving it.

Consider another simple mathematical problem — sorting. Suppose we are to sort five numbers in increasing order. In our elementary classes, we could easily sort them by looking at the numbers; certainly some algorithm within our brain runs to sort them manually. But, we cannot sort 100 numbers, or say 100,000 numbers just by looking at them. We need some algorithm to reach the answer. We have been tackling the Big Data problem for years now.




No comments:

Post a Comment

Tags

Recent Post