Hadoop, as a pivotal piece of thedata mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to time and dollar constraints. But Hadoop can’t do everything quite yet, especially when it comes to real-time work flow. Fortunately, a couple of innovative efforts within the Hadoop ecosystem, such asHypertable andHBase, are filling the gaps, while at the same time providing a glimpse as to where Hadoop’s full capabilities might be headed.
Currently Hadoop is used by companies to analyze big data, typically with records and counts in the billions. These staggering numbers often come from Internet applications at scale and the resulting log data. But going through the log data happens in batch jobs outside the standard work flow of the web site, not upon every search. As an example, a search site might track each search query and the terms entered. If it then wants to examine the most popular requests to provide an “assist or suggest” capability, it can use Hadoop to crunch through a year or two of prior searches, resulting in a list of popular combinations and frequency. This multistep process takes time, however, and cannot execute upon every user search.
Continues @ http://gigaom.com
Related articles by Zemanta
- Social network analysis, aka relationship analytics (dbms2.com)
- Social Networking Brings the Zettabyte Era to the Internet (gigaom.com)
- Very Large Databases – Googzilla Being Coy (arnoldit.com)