Monday, 28 March 2016

What Is Apache Hadoop?

What Is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

Getting Started

To get started, begin here:
  1. Learn about Hadoop by reading the documentation.
  2. Download Hadoop from the release page.
  3. Discuss Hadoop on the mailing list.

Download Hadoop

Please head to the releases page to download a release of Apache Hadoop.

Who Uses Hadoop?

A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop PoweredBy wiki page.

News

11 February, 2016: Release 2.6.4 available

A point release for the 2.6 line.
Please see the Hadoop 2.6.4 Release Notes for the list of 46 critical bug fixes and since the previous release 2.6.3.

25 January, 2016: Release 2.7.2 (stable) available

A point release for the 2.7 line.
Please see the Hadoop 2.7.2 Release Notes for the list of 155 bug fixes and patches since the previous release 2.7.1.

17 December, 2015: Release 2.6.3 available

A point release for the 2.6 line.
Please see the Hadoop 2.6.3 Release Notes for the list of 35 critical bug fixes and since the previous release 2.6.2.

28 October, 2015: Release 2.6.2 available

A point release for the 2.6 line.
Please see the Hadoop 2.6.2 Release Notes for the list of 15 critical bug fixes and since the previous release 2.6.1.

23 September, 2015: Release 2.6.1 available

A point release for the 2.6 line.
Please see the Hadoop 2.6.1 Release Notes for the list of 158 critical bug fixes and since the previous release 2.6.0.

06 July, 2015: Release 2.7.1 (stable) available

A point release for the 2.7 line. This release is now considered stable.
Please see the Hadoop 2.7.1 Release Notes for the list of 131 bug fixes and patches since the previous release 2.7.0. Please look at the 2.7.0 section below for the list of enhancements enabled by this first stable release of 2.7.x.

21 April 2015: Release 2.7.0 available

Apache Hadoop 2.7.0 contains a number of significant enhancements. A few of them are noted below.
  • IMPORTANT notes
    • This release drops support for JDK6 runtime and works with JDK 7+ only.
    • This release is not yet ready for production use. Critical issues are being ironed out via testing and downstream adoption. Production users should wait for a 2.7.1/2.7.2 release.
  • Hadoop Common
    • Support Windows Azure Storage - Blob as a file system in Hadoop.
  • Hadoop HDFS
    • Support for file truncate
    • Support for quotas per storage type
    • Support for files with variable-length blocks
  • Hadoop YARN
    • Make YARN authorization pluggable
    • Automatic shared, global caching of YARN localized resources (beta)
  • Hadoop MapReduce
    • Ability to limit running Map/Reduce tasks of a job
    • Speed up FileOutputCommitter for very large jobs with many output files.
Full information about this milestone release is available at Hadoop Releases.

18 November, 2014: release 2.6.0 available

Apache Hadoop 2.6.0 contains a number of significant enhancements such as:
  • Hadoop Common
    • Key management server (beta)
    • Credential provider (beta)
  • Hadoop HDFS
    • Heterogeneous Storage Tiers - Phase 2
      • Application APIs for heterogeneous storage
      • SSD storage tier
      • Memory as a storage tier (beta)
    • Support for Archival Storage
    • Transparent data at rest encryption (beta)
    • Operating secure DataNode without requiring root access
    • Hot swap drive: support add/remove data node volumes without restarting data node (beta)
    • AES support for faster wire encryption
  • Hadoop YARN
    • Support for long running services in YARN
      • Service Registry for applications
    • Support for rolling upgrades
      • Work-preserving restarts of ResourceManager
      • Container-preserving restart of NodeManager
    • Support node labels during scheduling
    • Support for time-based resource reservations in Capacity Scheduler (beta)
    • Global, shared cache for application artifacts (beta)
    • Support running of applications natively in Docker containers (alpha)
Full information about this milestone release is available at Hadoop Releases.

19 November, 2014: release 2.5.2 available

Full information about this milestone release is available at Hadoop Releases.

12 September, 2014: release 2.5.1 available

Full information about this milestone release is available at Hadoop Releases.

11 August, 2014: release 2.5.0 available

Full information about this milestone release is available at Hadoop Releases.

30 June, 2014: release 2.4.1 available

Full information about this milestone release is available at Hadoop Releases.

27 June, 2014: release 0.23.11 available

Full information about this milestone release is available at Hadoop Releases.

07 April, 2014: release 2.4.0 available

Full information about this milestone release is available at Hadoop Releases.

20 February, 2014: release 2.3.0 available

Full information about this milestone release is available at Hadoop Releases.

11 December, 2013: release 0.23.10 available

Full information about this milestone release is available at Hadoop Releases.

15 October, 2013: release 2.2.0 available

Apache Hadoop 2.x reaches GA milestone! Full information about this milestone release is available at Hadoop Releases.

25 August, 2013: release 2.1.0-beta available

Apache Hadoop 2.x reaches beta milestone! Full information about this milestone release is available at Hadoop Releases.

27 December, 2011: release 1.0.0 available

Hadoop reaches 1.0.0! Full information about this milestone release is available at Hadoop Releases.

March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards

Described by the judging panel as a "Swiss army knife of the 21st century", Apache Hadoop picked up the innovator of the year award for having the potential to change the face of media innovations.
See The Guardian web site

Sunday, 20 March 2016

BIG DATA

Big Data History and Current Considerations

While the term “big data” is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:
Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
At SAS, we consider two additional dimensions when it comes to big data:
Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data.
Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.



Why Is Big Data Important?

The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
  • Determining root causes of failures, issues and defects in near-real time.
  • Generating coupons at the point of sale based on the customer’s buying habits.
  • Recalculating entire risk portfolios in minutes.
  • Detecting fraudulent behavior before it affects your organization.


Who uses big data?

Big data affects organizations across practically every industry. See how each industry can benefit from this onslaught of information.

Banking

With large amounts of information streaming in from countless sources, banks are faced with finding new and innovative ways to manage big data. While it’s important to understand customers and boost their satisfaction, it’s equally important to minimize risk and fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step ahead of the game with advanced analytics.

Education

Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluation and support of teachers and principals.

Government

When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime. But while there are many advantages to big data, governments must also address issues of transparency and privacy.

Health Care

Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly, accurately – and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care.

Manufacturing

Armed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste – processes that are key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business decisions.

Retail

Customer relationship building is critical to the retail industry – and the best way to manage that is to manage big data. Retailers need to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business. Big data remains at the heart of all those things.



How It Works

 

Before discovering how big data can work for your business, you should first understand where it comes from. The sources for big data generally fall into one of three categories:

Streaming data

This category includes data that reaches your IT systems from a web of connected devices. You can analyze this data as it arrives and make decisions on what data to keep, what not to keep and what requires further analysis.

Social media data

The data on social interactions is an increasingly attractive set of information, particularly for marketing, sales and support functions. It's often in unstructured or semistructured forms, so it poses a unique challenge when it comes to consumption and analysis.

Publicly available sources

Massive amounts of data are available through open data sources like the US government’s data.gov, the CIA World Factbook or the European Union Open Data Portal.

 

After identifying all the potential sources for data, consider the decisions you’ll need to make once you begin harnessing information. These include:

How to store and manage it

Whereas storage would have been a problem several years ago, there are now low-cost options for storing data if that’s the best strategy for your business.

How much of it to analyze

Some organizations don't exclude any data from their analyses, which is possible with today’s high-performance technologies such as grid computing or in-memory analytics. Another approach is to determine upfront which data is relevant before analyzing it.

How to use any insights you uncover

The more knowledge you have, the more confident you’ll be in making business decisions. It’s smart to have a strategy in place once you have an abundance of information at hand.

 

The final step in making big data work for your business is to research the technologies that help you make the most of big data and big data analytics. Consider:

  • Cheap, abundant storage.
  • Faster processors.
  • Affordable open source, distributed big data platforms, such as Hadoop.
  • Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
  • Cloud computing and other flexible resource allocation arrangements.


Wednesday, 16 March 2016

Data warehouse vs Data mining

Data Mining: What is Data Mining?

Overview

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Continuous Innovation

Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.

Example

For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

Data, Information, and Knowledge

Data

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:
  • operational or transactional data such as, sales, cost, inventory, payroll, and accounting
  • nonoperational data, such as industry sales, forecast data, and macro economic data
  • meta data - data about the data itself, such as logical database design or data dictionary definitions

Information

The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Knowledge

Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

Data Warehouses

Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining.

What can data mining do?

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data.
With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.
For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures.
WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.
The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.
By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.

How does data mining work?

While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
  • Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
  • Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
  • Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
  • Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:
  • Extract, transform, and load transaction data onto the data warehouse system.
  • Store and manage the data in a multidimensional database system.
  • Provide data access to business analysts and information technology professionals.
  • Analyze the data by application software.
  • Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

What technological infrastructure is required?

Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers:
  • Size of the database: the more data being processed and maintained, the more powerful the system required.
  • Query complexity: the more complex the queries and the greater the number of queries being processed, the more powerful the system required.
Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers.