Welcome!

@DXWorldExpo Authors: Xenia von Wedel, William Schmarzo, Kevin Jackson, Liz McMillan, Ed Featherston

Related Topics: @DXWorldExpo, @CloudExpo, Apache

@DXWorldExpo: Blog Feed Post

Big Data and Analytics By @MElRefaey | @BigDataExpo #BigData

Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers

This post is the first in a series of blog posts that will explore and exploit the Big Data and analytics tools. I will walk through easy steps to start working with such tools like Apache Hadoop, Pig, Mahout and solve some problems related to analytics and learning in the large scale by exploiting such tools, and shed the light on some of the challenges we face while working with these tools.

1. Apache Hadoop
1.1 Overview

Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers. Two of the main components of Hadoop are HDFS and MapReduce.HDFS is the file system that is used by Hadoop to store all the data. This file system spans across all the nodes that are being used by Hadoop. These nodes could be on a single server or they can be spread across a large number of servers.In this section, we will go through the instruction of how to get the Hadoop up and running with the configurations needed to make it useful for other components/frameworks that integrate or depends on Hadoop (e.g. Hive, Pig, HBase etc.).

Note: The installation will be a Pseudo distribution.

1.2 Tools and Versions
I've used the following tools and versions throughout this installation:

  • Ubuntu 14.04 LTS
  • Java 1.7.0_65 (java-7-openjdk-amd64)
  • Hadoop 2.5.1

1.3    Installation and Configurations

1. Install Java using the following command:

apt-get update apt-get install default-jdk

2. Create Security Keys using the following commands:

ssh-keygen -t rsa -P ' ' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3. Download Hadoop tar file using:

wget http://www.webhostingreviewjam.com/mirror/apache/hadoop/common/hadoop-2.5.1/hadoop-2.5.1.tar.gz

4. Extract the tar file using:

tar -xzvf hadoop-2.5.1.tar.gz

5. Move the extracted files into a location you can easily recognize, and easily change the version used without much modifications using:

mv hadoop-2.5.1/ /usr/local/hadoop

6. Configure the following environment variables in the bashrc file (to make sure every time they are set with the machine sartup):

#HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END

7. Source the bashrc file after changes, for the system to recognize the changes using the following command:

source ~/.bashrc

8. Edit the Hadoop-env.sh using vim:

vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

The hadoop-env.sh file should look like this:


That will make the value of the JAVA_HOME always available to Hadoop whenever it starts.

9-      Edit the core-site.xml file using vim as well:

vim /usr/local/hadoop/etc/hadoop/core-site.xml
The file will look like:

10-   Edit the YARN file yarn-site.xml as follows:

vim /usr/local/hadoop/etc/hadoop/yarn-site.xml
The file will look like:

11. Create and edit the mapred-site.xml file:

vim /usr/local/hadoop/etc/hadoop/mapred-site.xml
The file will contains the following property, that specify which framework will be used for MapReduce:


12. Edit the hdfs-site.xml file, in order to specify the directories that will be used as datanode and namenode on that server.

vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Create the two directories:  mkdir -p /usr/local/hadoop_store/hdfs/namenode                mkdir -p /usr/local/hadoop_store/hdfs/datanode
after editing the file, it will contains the following  properties:


13.Forma t the new Hadoop file system using the following command:

hdfs namenode -format
Note: This operation needs to be done once before we start using Hadoop. If it is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop filesystem.

14. Now, all configurations are done, we can start using Hadoop, we should first run the following shell scripts:

start-dfs.sh                     start-yarn.sh
And to make sure everything is okay, and the right process is running, run the command jps and see the following:



15-   We can run MapReduce examples that exist in Hadoop bundle, but we need to run the following:

We should create the HDFS directories required to execute MapReduce jobs:
hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mohamed
and copy the input files to be processed into the distributed filesystem:
hdfs dfs -put {here is the path to the files to be copied} input

16-   We can check the web console for the resource manager, HDFS nodes and running jobs as shown in the following screens:







Issues and problems:

I've experienced some issues related to: Ø  Formatting the HDFS, and I resolved it by changing permissions and ownership of the user who can format the namenode and datanode. Ø  Problem connecting to the resource manager, with the following error: ipc.Client: Retrying connect to server:

0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); maxRetries=45
INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 And I resolved it by: adding a few properties to yarn-site.xml :



We reached to the end of our first post on big data and analytics, hope you enjoyed reading it and experiminting with Hadoop installation and configuration. next post will be about Apache Pig.

Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

@BigDataExpo Stories
Sometimes I write a blog just to formulate and organize a point of view, and I think it’s time that I pull together the bounty of excellent information about Machine Learning. This is a topic with which business leaders must become comfortable, especially tomorrow’s business leaders (tip for my next semester University of San Francisco business students!). Machine learning is a key capability that will help organizations drive optimization and monetization opportunities, and there have been some...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
Sanjeev Sharma Joins June 5-7, 2018 @DevOpsSummit at @Cloud Expo New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
When it comes to cloud computing, the ability to turn massive amounts of compute cores on and off on demand sounds attractive to IT staff, who need to manage peaks and valleys in user activity. With cloud bursting, the majority of the data can stay on premises while tapping into compute from public cloud providers, reducing risk and minimizing need to move large files. In his session at 18th Cloud Expo, Scott Jeschonek, Director of Product Management at Avere Systems, discussed the IT and busine...
It’s conference season and, as you might expect, Jason and I have been on the road covering a bunch of them. It’s always great to see what the disruptive players in the market are doing — and this year did not disappoint. But there is one thing that repeatedly happens that just gets under my skin: transformation-washing. As Jason explained in a Forbes article over a year ago, ‘washing’ is when a vendor (or pundit) applies a buzzword loosely in an overt attempt to attach themselves to its buzz. ...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...