« Lowering Drug Costs and the 340b Drug Pricing Program | Main | Introduction to Biosimilars »


What is Big Data and How Will it Revolutionize the Health Industry? - PART I

Big Data is one of those new terms that has been getting a lot of media coverage. If you’re like me, you have been confused by what it even means. The short answer is that Big Data is a new approach for organizing and analyzing the massive amounts of data being generated each day. Big Data allows for insights that were practically impossible under traditional approaches. We are at the doorstep of a revolution, yet we still haven’t maximized our potential with old techniques and approaches.

Before we dive into the future of Big Data, it helps to first realize how much data modern society is producing each day. Eric Schmidt of Google noted that “from the dawn of civilization until 2003, humankind generated five exabytes of data. Now we produce five exabytes every two days…and the pace is accelerating.” (How much is an exabyte?).  More recently, it was estimated that we produce two and a half quintillion terabytes of data every day as of 2012. These include everything from your credit card purchases, to the photos you take on your phone, to your social media posts. Everything is being digitized and ubiquitously captured and more data are being produced constantly. Every phone call you make is recorded. Every song you play in iTunes is documented somewhere.

We have an astounding volume, variety, and velocity of data–“the three Vs.” This is where Big Data comes in. Big Data is a new approach to storing, reading, and analyzing these data, which are distributed over many different platforms, and are not standardized. This differs from the traditional mode of data analysis.  The traditional approach has been to organize and build what are called “relational databases,” then apply statistical analysis methods to answer specific questions–often the databases are built for the purpose of answering those specific questions.

A relational database is simply a set of data tables, each made up of rows and columns, which are joined together by one or more columns used as an identifier. For example, if you have a student ID card, then the university has a table of all student IDs, with personal information about each student. Then, there would be another table, say one with course registrations by student. Every time you register for a course, a new row is created with your student ID and the course number. Because each table makes use of your student ID as an identifier, an analyst can find your information from each table–to create a class roster, say, or to print out your schedule for this semester. We can find all the student ID numbers registered for a particular course, then find information on each student from the other table.

A shopper rewards program works the same way. One table records your reward number and all your personal information. Companies can use what are called data mining techniques on this database to encourage more sales. For example, retailers already send catalogs and specials to their customers. If they know your shopping history, they can customize the mailers they send you to highlight items you are more likely to buy. Even just knowing the gender of the customer allows them to segment their advertisements, and get a better return on investment. The more they know about your preferences, and the preferences of people like you, the better they can customize their engagement with you.

But, even with sophisticated techniques like data mining, and with massive transaction databases, we are still not in the world of Big Data. The examples I just gave are part of the traditional approach. The tables are organized in advance, data are captured and recorded neatly in the tables, and normal methods of analysis are used. This is not Big Data–this is just lots of data.

Big Data, unlike this traditional approach, does not need to use relational databases in its analyses. The data are not “collected” in the same way. Oftentimes, the data are being collected (or archived) without the intent of analyzing them later.  Big Data does not have any structure. Data do not have to be neatly organized in tables with rows and columns like relational databases.

Nearly everything we do in modern society leaves a digital footprint. Big Data allows us to use and analyze these data by applying specific techniques.  Primarily, Big Data makes use of Hadoop for faster file storage and data retrieval. Hadoop, an open source architecture developed by Yahoo, based on research conducted by Google, is the primary Big Data tool. Hadoop uses a distributed filing system where raw data are saved across multiple nodes, using a single hierarchy of directories, usually saved in 64 MB chunks. The data are not cleaned or organized in any way, and no business rules are applied. The data are not transformed. Big Data, using Hadoop, allows users to query those data and gain meaningful insights. Facebook, as an example, uses Hadoop to store the massive data generated by its users every single second.

Practitioners of Big Data believe in the “sushi principle”; that is, data should be raw, fresh, and ready to consume. Don’t cook the data! Keep it in its raw form. 

Because Hadoop is open source, and runs on commodity hardware rather than specialized hardware, it is much cheaper and simpler to store data than traditional methods. However, the difficulty arises in later querying and analyzing the data.

Whereas before, specialists were required to build the data sets, create the schema, and capture the data in a consistent way, Big Data eliminates these required skills at the front end, since Hadoop standardizes the approach to storage

Big Data requires expertise and creativity in the querying end. Querying can be complicated, since the data are being retrieved from multiple sources, which are not organized in a standardized way. SQL is becoming the standard querying language in Big Data, as it has been in traditional relational databases.

Because it is so new, it has been said that the only people with 10 years experience in Hadoop, are the men who developed it in the first place.

This provides a huge opportunity for data scientists in the future, and Big Data will surely create a huge demand for analysts who can work within the architecture.

Magdi Stino

Health Policy PhD Candidate


Feed You can follow this conversation by subscribing to the comment feed for this post.

Post a comment.

If you have a TypeKey or TypePad account, please Sign In.

© 2011 University of the Sciences in Philadelphia • 600 South 43rd Street • Philadelphia, PA 19104 • 215.596.8800