I've got one of those careers dotted & dashed with Big Data and Advanced Analytics before it was even coined. Beginning in 1995, a small team of geeks took on the task of creating one of the first commercial data warehouses (not using NCR) to store more than a Terabyte of data. At the time, the concept of the Data Warehouse was in its infancy (I've got a proof copy of Bill Inmon's first book that I was asked to review) & Ralph Kimball hadn't published his works on Dimensional Table Design. To be quite honest, we were forming the science of Big Data Analytics even though we hadn't even loaded a single byte of data. The challenges of the MCI projects' S.A.M.S & D.E.S.I. (later coined to warehouseMCI) were not just scaling to a storage capacity to hold more than a Terabyte of data, but to make it useful - and this was a struggle. After months of benchmarking (using a draft version of the TPC-D) we concluded that data scalability was a factor partitioning data and running on a series of computers that were lightly coupled (Shared Nothing Architecture). We ended up selecting IBM SP2 MPP as the iron, and the front-end database was Informix Release 8 (I can hear the groans still coming from the developers, as most had expected Oracle to fully win this battle). The contest between databases (IBM DB2 UDB & Informix 8) was heated, both products demonstrated that they could load & query at a Terabyte scale where Oracle had failed miserably. At MCI, Informix won out (a decision made possible because IBM had dropped the ball) but now we this scaleable database but it was highly dependent upon the optimization of the query engine to create a parallel process capable of running over 104 different nodes and partitioned data. One of the first challenges was to recognize that the environment wasn't ready for multiple users, hence the requirement to "batch" all entries and restrict user access. The second challenge was that in a fully partitioned database, the operation of joining data sent copies of data between ALL nodes, for nearly every query. Lord help you if the fact table join that was unbalanced or required massive lookups (huge is relative, as personal data on every potential customer was a fact-less table in the billions of rows) and given we were running on a 32bit OS, the amount of real memory was limited on each node. The enemy wasn't the Users, but rather, the lack of horse power & memory from a single node, it's single core CPU and limited amount of RAM. To get SQL to scale for multiple users there had to be MORE CPU & MORE MEMORY - and that wasn't going to happen on an MPP at the time. Thus, we had users batch-up queries (we termed this the RAD interface - named after Rad who was this developer that worked with the Users to gather their queries manually). Users were subsequently redirected to Data Marts, running on an Sun SMP platforms that didn't have to contend with node partitioning (they had 32 CPU's and more memory than a single node on the MPP). Think about it, here was this "massive" data warehouse, containing a billion rows of marketing data, detailed transactional information, and a healthy dose of consumer data which drove the essence of marketing machines like "Friends & Family" - but not really user friendly. Kind of ironic actually, when I look back, selecting a platform that limited the users but opened up the valve on massive data storage - Ying & Yang.
Jump forward a few years (1998), and I landed at Citicorp, running more Terabyte benchmarks & building yet another massive data warehouse, this time, 156 node IBM MPP but front-ended by DB2 UDB. The developers didn't groan as much, but the solution to load data into this monster was a custom built ETL behemoth made possible by rocket scientist from Boeing and Applied Parallel Technology (later renamed Torrent, swallowed up by Informix, then Ascential and finally by IBM - we now know this product as Information Server). Citibank had more data than you could shake a stick at (I won several Winter Corporation awards for building the biggest open system database, number of rows, etc. in 1999) as the DW was composed of all of Citibank's global customers, including every credit card transaction over the last eight years. The focus again was directed at marketing, but we had additional internal users, like credit card fraud, private banking, and personal banking. But once again, the entry into the DW was "gaurded", queries were batched up and queued, and direct access was still limited to but a few individuals worthy enough to gain "access" to the pearly gates. We had SAS on twelve nodes connected directly to the same MPP, to support advanced analytics, but data was extracted through ETL processes, and redistributed to those nodes. In late 1999, the concept of a single node was going away, soon we had this "new" node, with multiple processors (dual and quad) SMP nodes. Perhaps there was hope, for multiple user queries but a re-org from the merger of Citi and Travelers left everyone in a jolt - orders were to reconstruct the data warehouse but this time use Oracle. Time to abandon ship, and the technical masses fled.
Let's kick the stone a few more times, building a few more data warehouses at the US Treasury (SQL Server) and Corporate Express (Oracle), and then I land at a "data company", Acxiom (2005). Now, what was unique about Acxiom, wasn't that they were "cheap" (they were) but they went out to take advantage of a few acres of re-purposed Dell servers from Walmart, in what was their approach to HUGE data. Acxciom built a one-off system called "the Hive" which was essentially a Grid Computing system composed of a few thousand servers with attached storage, loosely coupled by a network directed from a managing set of nodes termed, the Apiary. Programs were once again, batched & submitted, to run in parallel but this time not against a database, this time we had a flat file system (they were too cheap to license a database across an acre of computers - could you blame them?). If you asked me, it was kind of insane, using flat file systems, not allowing users direct access to the data, and requiring them to wait for extractions to hit a "database node"; little did I realize this was the first Big Data system, less the Hadoop elephant but complete with a file management system, data partitioning & file indexing.
My next jump was to IBM (circa 2006), where I went from ETL architect to selling Lab Services, and it was there that I was approached by one of the technical directors to create a billable project for something called "Hadoop"; my immediate response was "Gesundheit!" When I asked, what was Hadoop, they quietly mentioned that it was "the future" of very large data systems, and to keep this under my hat. Blink-blink, deja vu? Could something be changing in the wind?
Now years later as I reflect upon all my "Big Data" efforts, there were several key points - first to get things to scale you had to reach out to more than one computer, you had to have thousands of CPU's acting independently but in some fashion, bringing back results to users (not just one user). When you start gathering in thousands of nodes, to act as a data storage component, you have to wonder "why pay the cost for a big database?" Maybe Acxiom wasn't all that crazy cheap but the one inconsistency in all the prior attempts was preventing users direct access to massive amounts of data without the intervention of IT. But times they are a changing...
Hence this blog. I'll be using it as my record of learning about Big Data Analytics - yes I've mushed the term Big Data along with Analytics because one doesn't exist with out the other. There is a wealth of information hidden in large amounts of data, and restricting user access is like turning out the lights in a library and telling everyone to read a book in the dark. Wait, that's old school, how about requiring people to peddle a bike to generate electricity to access the internet? Better example, we only allow one person per town access to the internet, the rest of us have to either wait our turn or read a book.
Honestly Big Data is nothing new, the term may have been coined a few years ago, and it truly seems to be laced with confusion. You've got CIO's of large companies all declaring that this "Hadoop thing is ridiculous", but then you've got folks like Google, Facebook, Yahoo, that have built their industry upon it. I'm not here to debate the fact that 80% of the world's data is unstructured (or semi-structured), nor am I going to rant about how data is increasing on a exponential scale, where 90% of the data in the world today was created in the last two years. The internet of things is vast - its comprised of 250 Billion users, every cellphone and digital device, its multi-media exploding beyond the television, its nearly anything powered by electricity or connected to transportation (planes, trains, ships, trucks, busses, etc.), its medical systems so complex that the MRI scans might some day drive robotic surgical devices, and yet so simple (pill monitors to insure you're grand-dad is taking his prescription), and they are all beginning to generate data, on a massive scale. When we talk about variety of data, we need to include elements that just can't be queried by SQL - show me the query that works on PDF, XML, JSON, MPEG, Video, JPEG, and I'll eat my hat. Get the picture - Big Data is real.
IBM (my favorite lets pick on them first) says that Big Data is composed of Volume, Velocity, Variety and Veracity. The first three are a given, the forth V is from years of getting beat up over structured data issues (duplicate values, missing data, and just the unimaginable unstructured content). So, IBM marketeers added Veracity - which is pointing out the uncertainty of big data. I like to think of it as a "Google Query" where you get a result in the form of a percentage of confidence. When you're confidence levels exceed your fear factor that what ever you're reading is wrong, you've got the right answer - or so says Google.
Did IBM get the four V's correctly or were they just trying to frighten people into buying DataStage, Streams, and MDM Server? I'll give them a Google confidence of less than 28%.
I think what was missed by IBM in their assumption that its a Veracity of Data, rather than its Voracity from the consumers of data. Voracity or that of being voracious, having a huge appetite and acting excessively eager when it comes time to feeding. Voracity is what is truly driving Big Data Analytics. Businesses are refocusing their efforts not to engineer new solutions, but to use data driven engineering solutions to reshape their enterprise. GE is using data to drive its engineering processes to improve its engines (jet, train, power, they all have data collection elements sending back terabytes of data). Pharmaceutical companies are using Real World Experience (or Data) to expand the clinical data from just a handful of participants to a world that uses drugs to combat illnesses & disease. The goal for Pharma is to identify patterns, drug therapies, combinations, reduce risk, improve patient outcome and yes, to shorten the long drawn out process for clinical trials & drug manufacturing. Think about it, its a global population of data, why base drug manufacturing decisions on just a few hundred participants in a controlled sample? For data scientist, the more data the better (period). Its the voracity of business, where every person totes around at least one smart device and we're attempting to understand their buying habits - can we influence the sale by sending out notifications whenever we detect a customer is nearby?
How's this for an SQL challenge, watch streaming video to identify whether the patient that has been admitted for acute or severe sepsis (a life threatening condition from infection) is improving or degrading. Sepsis or Septic shock results in tissue and organ damage in less than twenty minutes, and has a death in rate of 40-60% (higher for some forms of infections). The clock is ticking - tissue is dying, is the patient improving? Here's a fun fact, that I'm trying to look at, could the Seattle Seahawks have made a better play selection in the last play of Superbowl XLIX (49)? Depends if you're a Seahawks fan or not, but the choices made to throw were probably ill-advised at best. Could they have used video analytics to confirm a play selection? Could they have used Video Based Analytics to make a better call? Sure, all the coaches on both teams were well equipped with tablet devices to review plays- but where's that query? That natural language question - What play should I run next?
We live in a digital world where everything is connected, people, applications, and corporations have a voracious appetite for data & knowledge of predictable outcomes. This is why there is a combination of Big Data and Analytics and its going to consume all things data, and the winners will be the best of systems that allow users direct access to petabytes of knowledge, not from one database but from a cloud that spans the globe.