Big Data: What’s the Big Deal? – A Guide

The largest tech companies are running to provide two new related services to big business, and the rush is on. These are cloud computing services, and big data analytic services.

The internet, mobile computing, the internet of things – all these provide new sources of data that did not exist ten years ago, and the quantity of data is enormous. Questions arise: how do you access that data, how do you organize it, and how do you turn it to your advantage?

These represent huge changes in the landscape of enterprise computation, and billions of dollars are at stake. Wikibon estimates total Big data spending in 2018 to reach almost $50 billion, while A. T. Kearney and ABI both predict $114 billion. Granted their definitions most likely are not the same, but in any case these are huge numbers.

Main Players include:

What is it all about?

There has been a lot of talk about these services, but many people are still hazy on what it all means. Therefore, I am providing a kind of primer for those not completely sure of what this movement is about.

I – Cloud services

The basic concept of cloud services is well known. A company with large server computers offers customers the ability to store their data on the servers. Dropbox and iCloud provide this most simple of services to individuals and small businesses. The stored objects – be they word processing documents, photographs, or other documents types – can typically be shared with other users, as the owner of the document sees fit. Some systems, such as Microsoft’s Office 365, Google’s (GOOG) Google Docs, and Apple’s iWork suite allow for collaboration.

There are, however, much more elaborate services that vary in a continuum of customer engagement.

The first would be the simple Disk in the Cloud type service mentioned above.

Benefits of cloud computing

Benefits of cloud computing [Source: Caiptalfm.com]

Software as a Service – SaaS

Here a company can license particular software on a subscription basis, and have the actual application programs hosted in the cloud by a provider. A user will typically login via a web browser to access the functionality of the programs licensed.

The advantages to the subscriber is that they do not need to maintain and upgrade application versions nor the licenses. Additionally, they do not need to maintain the physical infrastructure to support the applications, as this is all done by the servicing agency.

All types of software are used via this method, including Office suites, CAD, Databases, etc. Microsoft’s Office 365 is q well known example, but any software can potentially be run from a remote server.

Platform as a Service – PaaS

In the PaaS models, cloud providers deliver a computing platform, typically including operating system, programming language execution environment, database, and web server. Application developers can develop and run their software solutions on a cloud platform without the cost and complexity of buying and managing the underlying hardware and software layers. [Wikipedia – emphasis added]

This allows the user to do more than simply run particular software, but is designed for developers. First services were for Web applications in JavaScript, but now services such as Google App Engine allow for many languages. Oracle has its own PaaS service allowing for the development of Oracle databases.

The advantages here are, once again, that the costs of purchase and management of the physical computers is born by the service provider. Additionally, added resources are typically available immediately, with no need to purchase and install your own hardware.

Infrastructure as a Service – IaaS

This last service model is the most detailed for the user. The service actually hosts the physical machines that the customer needs. The customer is free to manage the operating system, software and internet connections as they see fit. The physical computer resides at the service provider’s site and is managed there.

More typically now, rather than a real, individual computer, the customer gets a virtual machine running on a larger system under the supervision of a program called a hypervisor. This allows the service provider to spread the load of users over a large server. It also protects the user from outages. If his real machine had a problem, then it would go down. But a virtual machine is always available on the providers server farm.

A lot of variations on the model are available. For example, if the customer needs extra high security, then specific physical machines can be dedicated to him. Even if this is not acceptable, then the most sensitive data can be stored on a local machine with only less critical data and applications hosted by the service provider, in what is called a hybrid system.

II – Big Data

Big Data and Analytics: two of the most talked about and least understood topics facing the modern CIO. – [Alix Partners]

We are used to databases. From a simple contact list, to filling out forms on a web page. We understand how they have records and each record has specific fields with data. If we are more knowledgeable, we may understand the concept of a relational database. (A customer has one record with name and an ID, and each invoice need only link to one customer record via the unique ID.) All this was fine for many years.

Now, however, there is a double problem.

  1. There is too much data.
  2. Much of the data is unstructured.

For a large enterprise, data is coming from everywhere, from clicks and data entry on its own web sites, from notes taken at the call center, posts in user forums, etc. And that is all internal. External sites such as Facebook generate data, potentially about your customers, your customers may have your mobile app, other companies may have statistical data relevant to you, and now, with the Internet of Things, even your products, or other companies’ products may be generating useful data.

And a large amount of that data is unstructured. That is, it is not in a standardized format. This is on two levels. First, the data may not be organized as a standardized database. Second, much of the data, even if in a structured DB, is natural language text, and meaning needs to be derived from this. This would include data from Twitter, as well as comments recorded in a users forum, emails, news articles, and research papers. In all, there is an enormous amount of data that can be tapped for insight into one’s customers, and for other production issues.

The area of health is a realm of its own, with very different objectives, but just as much interest in big data. Today with many people using constant activity monitoring by devices such as Fitbit or the Apple Watch, or personal blood sugar and other monitors, data is accumulating at an extraordinary rate. The diverse sources make it impossible to expect consistency to any particular structural model (although systems such as Apple’s HealthKit do provide this for their users).

Thus, all this data needs to be not only collected and stored, but interpreted as well.

Hadoop & MapReduce

Data needs to be stored so it can be processed, and big data has the obvious problem – there is lots of it so how should one store it.

Hadoop is an open-source framework for storing and processing very large data sets, and doing so in a distributed manner. That is, the data is not stored on a single piece of hardware, but distributed over a number of clusters – both physical and virtual. Clusters can be on one machine or many, and hardware may be at one location or many. The Hadoop Distributed File System (HDFS) manages the storage.

The MapReduce system can then process the data – again in a distributed fashion. Each of the HDFS clusters will independently perform the specified processing functions on any local data. The Map functions will filter the data and organize it, while Reduce functions perform summary processing. Wikipedia uses the example of a large set of students in which the Map creates separate lists of individuals, one list for each first name, and the Reduce function counts the number of students for each name. In both cases, the Hadoop system handles any summarization across clusters so the user who queries the system does not need to know the particular distribution network.

Many of the companies listed above provide Hadoop integration and solutions.


  • Note: It should always be remembered that Big Data does not imply simply large amounts of data, but that the mix of structured and unstructured is typically important.

Analytics

The analysis of data is obviously not limited to Big Data systems, but has been going on for a long time in more traditional database systems.

Alix Partners, cited above, defines:

Analytics (n.) – the art of combining statistics, machine learning, and business knowledge to derive business insights, predict behaviors, and inform decision making.

This is relative to business intelligence, but the definition applies to any other field as well (e.g. health, science, government).

It is precisely the application of varieties of analytics to big data that is the source of the present opportunity. There is a rush for companies to bring Big Data analytic engines to customers, and all the big companies have entries.

Watson

The most notable entry is Watson, by IBM. This computer system gained fame when it beat two expert players on the game show Jeopardy, in 2011. One of its key features is Natural Language Processing (NLP). This is an extraordinarily difficult task that has stymied computer scientists for decades. There are a lot of problems, but one of the most challenging is that spoken language is so ambiguous. There are so many words with multiple meanings, that it is necessary to understand the context of the current discourse, and frequently also the history, if one wants to understand any particular sentence.

Since 2011, Watson has been redesigned to handle other business tasks, with tremendous success in certain medical fields.

IBM's Watson competing on Jeopardy. {Source: hothardware]

IBM’s Watson competing on Jeopardy. [Source: hothardware]

Example

Big Data storage and analysis solutions are driving a soon to be $100+ billion industry that is deemed essential. Alix Partners gives one example:

One large aircraft manufacturer can now analyze six years of air traffic data in real-time, using Massively Parallel Processing (MPP) databases. What used to take a week to report is now available to business users immediately and interactively. The quality of their decision-making is transformed.

This instantly available data can then be combined with mobile platforms to put immediate analysis literally in the hands of field personnel. The Mobile First partnership between IBM and Apple is perhaps the best example of this.

Summary

The discussion above gives a brief outline of the major technologies in field, and should help explain them to those who have felt a bit hazy on them.

Please leave your comments if have anything to add.


Related: Neural Basis of Intuition

Advertisements

Your comments are appreciated.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s