In the era of Big Data, unless
we understand what Big Data is, we will never get it right.
It’s important to know
why fancy analytics tools that we have used for small data sets cannot be
replicated when our data grows. For example, if we want to find simple average
of ‘n’ numbers, we just add them, and divide the sum by ‘n’. The same approach
is followed whether the ‘n’ is 100 or 100 billion.
However, if all the
numbers are large positive, then the sum of 100 billion such numbers might not
fit in a computer’s memory. We need to adjust the algorithm appropriately to
find the average. That's the extra bit of cosmetic surgery needed for handling
Big Data.
Decoding complex stats
Data analytics mostly
comprise statistical methodologies like regression analyses, classification and
clustering techniques, standard estimation and testing procedures, etc. While
most of such theories are neatly developed in statistical literature and easily
applied for small to moderate-sized data, one might need to manipulate
intelligently and devise novel techniques for unusual format of data. But, the
real challenge, even for standard ready-to-use techniques, lies in the
limitations of using data with huge number of variables.
One reason is the
presence of ‘spurious’ or nonsense correlations among different variables. The
more the variables we handle, the more we counter such correlations. And unless
we can sift out the unimportant variables, we cannot have meaningful analyses
of data.
It’s theoretically
challenging too. In addition, even in a standard regression analysis, for
example, with loads of data and say, 10,000 variables, we need additional
computational techniques.
Managing information
So, how do we handle the
ocean of data? Now, with virtually everything confined under the system of
Internet of Things, a gigantic amount of data is generated continuously. The
ever-expanding horizon of data is now growing faster than ever. An IBM report of
mid-2017 described that 2.5 quintillion bytes of data are created per day, and
according to a Forbes article (2015), by 2020, new information of about 1.7
megabytes per second is expected to be created for every human being on the
planet.
Storing data
Big Data is a boon and a
curse at the same time. Are we really capable of leveraging it? With the
present expertise, the answer is ‘no’. We need to devise statistical techniques
to accommodate data. Only the top statisticians and computational experts together
might produce such techniques, that too in a case-by-case manner.
Understanding the power of data
Consider the example of
multiplication. We need some additional techniques for multiplying two big
numbers, say with hundreds of digits. We use our memory, multiply one number by
every digit of the other, one by one, starting from the unit place. Finally we
add all the rows. This algorithm for multiplication is a derivative of the
knowledge of tables, combined with some special techniques. This can be
interpreted as a Big Data problem. And special techniques are needed for
solving it.
Consider another simple
mathematical problem — sorting. Suppose we are to sort five numbers in
increasing order. In our elementary classes, we could easily sort them by
looking at the numbers; certainly some algorithm within our brain runs to sort
them manually. But, we cannot sort 100 numbers, or say 100,000 numbers just by
looking at them. We need some algorithm to reach the answer. We have been
tackling the Big Data problem for years now.
Source: https://www.tribuneindia.com/news/jobs-careers/big-data-and-how-to-manage-it-all/679796.html
No comments:
Post a Comment