If you want to keep your big-data jigsaw in order, the experts advise integrating your applications. But when it comes to seeing the big picture it's all about value-driven analytics.
Big Data comes in many forms. It comes as customer information and transactions contained in customer-relationship management and enterprise resource-planning systems and HTML-based web stores. It comes as information generated by machine-to-machine applications collecting data from smart meters, manufacturing sensors, equipment logs, trading systems data and call detail records compiled by fixed and mobile telecommunications companies.
Big data can come with big differences. Some say that the 'three Vs' of big data should more properly be tagged as the 'three HVs': high-volume, high-variety, high-velocity, and high-veracity. Apply those tags to the mountains of information posted on social network and blogging sites, including Facebook, Twitter and YouTube; the deluge of text contained in email and instant messages; not to mention audio and video files. It is evident then that it's not necessarily the 'big-ness' of information that presents big-data applications and services with their greatest challenge, but the variety and the speed at which all that constantly changing information must be ingested, processed, aggregated, filtered, organised and fed back in a meaningful way for businesses to get some value out of it.
Architecture and performance
That complexity is exacerbated by the broad range of system elements required to accommodate that information and its associated data flows. Though the bigger software companies – notably IBM, Oracle, EMC, and others – are busy trying to change things, big data tends to rely on a hotchpotch of multiple software applications, hardware and services rather than single, unified platforms.
A rough and simplified view of a common big-data model depicts: information acquired or ingested from many different sources, including relational (structured) and non-relational (unstructured); this is then passed on to distributed file systems and processing engines based on Apache's Hadoop software framework, say, for integration and organisation; it is aggregated in data warehouses or other storage repositories; finally it is pushed out to analytics or business intelligence (BI) applications. These in turn present the results of specific queries in some sort of graphical report format which are easily understood by end-users. The value added by each of these component stages is being scrutinised, however, by so-called Big Data 2.0 analysts who have started arguing that concepts like 'data warehouses' and 'business intelligence' are losing currency. They may have a point; they may be grinding axes.
In fact, "You need a multi-fold approach," says Clive Longbottom, co-founder and service director at analyst Quocirca. "There will be those applications that are dependent on a database, and so that's fine, leave them as are."
Then there are those that are totally unstructured (actually very few, as even Microsoft Office documents are formatted within XML schemas, Longbottom points out). "Then there are all the ones in between, which is where Hadoop comes in: filtering and using MapReduce so that it can be decided just where the resulting data should reside – in a SQL database or in a schema-less, NoSQL equivalent. Once all this has been done, then big-data analytics can be brought to bear." To put it another way, big-data methodologies are not an end in themselves, but rather the starting point of a longer, even more resource-intensive process.
Data-delivery chain challenges
With so many different elements in the big-data workflow, developers face serious challenges in adapting legacy applications, or building new software, to support it. Presenting for example a simple query model able to address both structured and unstructured data, relational and non-relational databases, can be tricky in itself.
Big-data platforms also have to cater for batch-processing jobs, which can vary hugely in size and duration, ranging from a few minutes to days at a time – a situation which can greatly complicate workflow, especially where automated information-feeds between different applications are concerned.
On top of that, developers are also under pressure to minimise data trafficking. Because the processing of large data-sets is often performed in a different location to where the information itself resides, it first has to be transmitted over a network connection, meaning bandwidth fluctuation can affect both application availability and performance.
This is especially true in environments where privacy and data-sovereignty concerns demand that sensitive data does not leave on-premise servers, and where many users at multiple sites collaborate to submit, process or analyse large data-sets or reports. And where information cannot be readily fed into the various different big-data system elements because the sheer volume of information involved swamps one or more of them, further bottlenecks on the ingress or processing side can bring the whole flow to a halt.
Most organisations will probably not want to spend more ICT budget on installing supplementary cabling just for the periodical big-data 'shunts' – building a sort of 'big-data bypass', as it were – especially at a time when strategists are claiming that the compelling appeal of IP-based networking is that you can constrain costs by running multiple traffic-stream types over the same hardwired infrastructure.
"Certainly, [big-data movement] performance will be one of the problems," agrees Quocirca's Clive Longbottom. "If what is required is real-time analytics, then the architectural model has too many steps in it. In this case, multi-analytics may be required, with the SQL and NoSQL worlds analysing as fast as they can, and a third level analysing those results." Does it sound like it's getting really complicated?
Integration and connectors 'key to success'
As such, big-data software vendors have worked hard to integrate those different elements to minimise the potential pain, building connectors that are able to transfer information between the various databases, processing engines, data warehouses and analytics and reporting applications. In some cases, the software vendors have produced hundreds of different components for various bits of third-party software, not only analytics and business-intelligence tools, but also systems and cloud-management suites, customer-relationship management applications, and other information repositories.
Oracle has built a separate product, Big Data Connectors, as part of its broader big-data portfolio specifically designed to format and transfer data between Hadoop MapReduce and its 11g database for example. Dell Boomi, IBM, Informatica and Microsoft have all released similar connectors that link the Hadoop Distributed File Systems (HDFS) to their database, data-warehouse and systems-management platforms. Expect others to attempt to mount the bandwagon at some point soon.
Hardware integration as an antidote
Another approach to addressing potential performance problems has seen the widespread emergence of dedicated appliances customised for specific workloads that combine all of the necessary software with dedicated hardware to ensure the data does not have to travel far. The components can be pre-certified by the manufacturer as being fully-integrated and interoperable.
Examples include Oracle's Big Data Appliance, EMC's Greenplum range, and more recently IBM's PureData Systems. In these three cases the strategy arguably has more to do with hardware vendors (Oracle spent $7.4bn acquiring Sun Microsystems in 2010) finding new ways to maximise revenue from existing server and storage business lines as it does for big-data innovation, with some potential buyers understandably cagey about the considerable investment required, plus the prospect of reprised vendor lock-in.
Software companies such as Teradata have also seen the advantage of customised, pre-integrated hardware/software solutions, partly due to performance concerns, but also because very few companies are adjudged to have the requisite in-house big-data expertise (analysts have pointed to a dearth of big-data 'scientists' all round) needed to get the best out of the technology. The company recently partnered unnamed hardware manufacturers to produce a 'unified' big-data appliance combining Hortonworks' version of the Hadoop MapReduce data-set processing engine, the Aster database, and an SQL query language able to query it, and 50 pre-built analytical applets on a specialised server and storage engine.
"The value is that you get everything in a single package – the back-end integration stuff, and the hardware and software to enable system monitoring," says Teradata Labs president, Scott Gnau. "It is easy to plug in and get going, to expose big data and big-data analytics to knowledge works. The idea is to get big-data analytics 'for the masses'." And that's going to take some doing, as mass-access will (with current technology) have to be via the Web, which outside of the dark arts of specialist algorithm from the likes of Google and Amazon, is assuredly not optimised to handle big-data processing.
While the customised, pre-configured appliance is aimed at on-premise deployments, other companies are entrusting big data to the cloud. Microsoft has added Hadoop support to its hosted Windows Azure platform as a service (PaaS) software development suite; and while Web giants Google and Amazon again have trumpeted the use of their own proprietary technology to deliver 'analytics in the cloud' services – which allow companies to move their own data into hosted big-data platforms that do the mega number-crunching for them (at an agreed price, of course – savings-hungry users will soon find that in-cloud service charges can scale just as effectively as anything else).
Collaboration and data visualisation between sectors
Collaboration capabilities can play a significant part in big-data initiatives, especially in public-sector and joint public/private-sector projects. The United States government has agreed funding of over $200m a year to six agencies looking to organise and analyse massive digital data-sets. Perhaps concerningly, this includes the Defense Advanced Research Projects Agency (DARPA) which is looking to develop new methods of analysing text documents and message traffic for the purposes of surveillance.
"The sheer volume of information creates a background clutter; let me put this in some context," said DARPA acting director, Kaigham J Gabriel at the launch of the XDATA initiative in March 2012. "The Atlantic Ocean is roughly 350 million cubic kilometres in volume, or nearly 100 billion, billon gallons of water. If each gallon of water represented a byte or character, the Atlantic Ocean would be able to store, just barely, all the data generated by the world in 2010. Looking for a specific message or page in a document would be the equivalent of searching the Atlantic Ocean for a single 55-gallon drum barrel."
The XDATA programme aims to develop computational techniques and software tools for analysing semi-structured (tabular, relational, categorical, metadata) and unstructured (text documents, message traffic) data with a special emphasis on visualising the results. Full details are yet to emerge, but it seems likely that as in the past the government agencies will work closely with established hardware and software vendors to meet their unique requirements.
This will include identifying and providing tools that enable engineers, mathematicians, analysts and academics to collectively share, work on, and evaluate the same data sets, and is expected to see greater use of applications armed with advanced social collaboration, search and visualisation tools which are currently emerging from innovative start-up software firms such as Tableau, DataDog, Qliktech, and Edgespring. Visualisation of the data-sets' output to analytical reports is a particularly crucial factor in many big-data initiatives according to IBM, and is one aspect of software which the company is paying close attention.
IBM big-data evangelist James G Kobielus says: "IBM has many products on the database side as well as analytics applications and development tools, and it has made it a strategy to cross them together into an integrated platform based on commonalities – analytical and transactional, common storage, CPU usage, pattern analysis and data warehousing, for example... But the challenge for IBM is visual tooling; much of the software development on big-data platforms is based on visual modelling tools that generate Java code, performs queries, pulls the data, does the calculations, and so on."
Despite the investment that some of the big IT companies are putting into developing and expanding big-data applications and services, there is of course absolute guarantee that end-user demand or interest will match their enthusiasm – or that clients and potential clients will not choose to implement their own big-data solutions rather than buy expensive hardware and software platforms deliberately customised to make the job easier. That said, many big-sata advocates are putting forward convincing arguments that a big-data strategy will become an integral aspect of IT management as allied to revenue generation within two-to-three years.
There's some shoehorning of existing approaches – but vendors such as Recommind, CommVault, and EMC (with its Greenplum acquisition) are taking a more inclusive view. Depending on the types and volume of data involved, the availability of requisite in-house skills, and how companies wish to see results presented, off-the-shelf applications can be combined to do the same job in some cases. Linking SQL or NoSQL databases using master data management or modelling, backed-up by master data records, then bringing in unstructured data on top using Hadoop is one approach, then utilising cross-function search and report tools like HP's Autonomy and CommVault's Simpana, for instance, along with standard analytics packages able to provide graphical references and mixed data reports can suffice.
Tony Speakman, vice president of sales at database company Filemaker, points out that the average firm does not really need to look at implementing back-end big-data initiatives unless it is processing about over a terabyte of information per 24-hour period, and that IT personnel should be wise to some of the marketing messages that might being played out in the market; they should also careful about retrofitting a modern big-data approach to legacy information repositories and data sets that in some cases would not translate well into Hadoop or similar platforms anyway.
"Lots of people want to jump on the bandwagon here and repackage something that they have had for a long time," Filemaker's Speakman adds. "In some cases that would be appropriate, in others it is a cynical use of marketing terminology' The ability to modify those sorts of legacy tools is probably more trouble than it is worth, and now adding new types of data with a new structure to old legacy data sets can be tricky... If you have a legacy database that has been in place for 15 years, [are you now] going to rewrite it? That would probably take you two years by which time the requirement has changed anyway" – but then, that's an endgame any IT practitioner will be familiar with.