2016-03-22

Hadoop The Definitive Guide Reading Notes

Hadoop: The Definitive Guide, Fourth Edition: http://shop.oreilly.com/product/0636920033448.do
Code and Data: http://hadoopbook.com/code.html
Download ncdc weather dataset: https://gist.github.com/rehevkor5/2e407950ca687b36fc54

more >>

2016-03-16

Practical Vim: Dot Formula

—— Dot Formula: One keystroke to move and one keystroke to execute. ——

Doc Command: repeat last changel

2016-03-07

Install and Manage Node Versions with NVM

It’s very easy to install and manage multiple active node.js versions by Node Version Manager(NVM).

Install or update nvm

First you’ll need to make sure your system has a c++ compiler. For OSX, XCode will work. And then install or update nvm by the following command:

1
2
3

# The script clones the nvm repository to ~/.nvm and adds the source line to your profile (~/.bash_profile, ~/.zshrc or ~/.profile).
$ curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.31.0/install.sh | bash

more >>

2016-03-04

Getting Started with Vim by Vimtutor

———————- Vim: The God of Editors ———————-

This is a simple Vim Tutorial from vim built-in documents, you can get the whole vimtutor by typing vimtutor in shell or vimtutor -g for GUI version. It is intended to give a brief overview of the Vim editor, just enough to allow you to use the editor fairly easily.

Lesson 1: Text Editing Commands

1. The cursor is moved using either the arrow keys or the hjkl keys:
   	h (left)       j (down)       k (up)       l (right)
2. To start Vim from the shell prompt type:  vim FILENAME <ENTER>
3. To exit Vim type:   <ESC>  :q!  <ENTER>   to trash all changes.
            OR type:   <ESC>  :wq  <ENTER>   to save the changes.
            OR type:   <ESC>  shift + zz     to save the changes
4. To delete the character at the cursor type:  x
5. To insert or append text type:
         i   type inserted text   <ESC>      insert before the cursor
         A   type appended text   <ESC>      append after the line

more >>

2016-01-27

Hacking PySpark inside Jupyter Notebook

Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside Jupyter Notebook (previous known as IPython), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. Apache Spark is a fast and general engine for large-scale data processing. PySpark is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science:

1	IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g

Hacking PySpark inside Jupyter Notebook

more >>

2016-01-17

NPM Playbook

NPM (node package manager) is a package management tool for Node.js.
Node.js is an open source JavaScript runtime built on Chrome’s V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Note that Node.js is a server side runtime environment rather than a language.

Initial project

package.json will be firstly created by npm init:

1	$ npm init # create package.json

more >>

2016-01-15

Books of 2016

January

Learning Spark
- Book: http://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624
- Github: https://github.com/databricks/learning-spark

more >>

2016-01-14

Loopback API Framework

The LoopBack framework is a set of Node.js modules that you can use independently or together to quickly build applications that expose REST APIs.

Resources

Loopback: http://loopback.io/
Getting started: http://loopback.io/getting-started/
Create a simple API: https://docs.strongloop.com/display/public/LB/Create+a+simple+API
LoopBack core concepts: https://docs.strongloop.com/display/public/LB/LoopBack+core+concepts

more >>

2016-01-10

Spark Source Codes 01 Submit and Run Jobs

standalone mode

1	$ cd {SPARK_HOME}/libexec/sbin/

Start Master at 8080,

org.apache.spark.deploy.master.Master
onStart()

# spark command: java -Xms1g -Xmx1g org.apache.spark.deploy.master.Master 
#                --ip localhost --port 7077 --webui-port 8080
$ ./start-master.sh 
Output Logs:
16/01/10 20:45:23 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/01/10 20:45:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/10 20:45:24 INFO SecurityManager: Changing view acls to: tony
16/01/10 20:45:24 INFO SecurityManager: Changing modify acls to: tony
16/01/10 20:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tony); users with modify permissions: Set(tony)
16/01/10 20:45:24 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/01/10 20:45:24 INFO Master: Starting Spark master at spark://localhost:7077
16/01/10 20:45:24 INFO Master: Running Spark version 1.6.0
16/01/10 20:45:24 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/01/10 20:45:24 INFO MasterWebUI: Started MasterWebUI at http://192.168.0.112:8080
16/01/10 20:45:24 INFO Utils: Successfully started service on port 6066.
16/01/10 20:45:24 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/01/10 20:45:24 INFO Master: I have been elected leader! New state: ALIVE

Start Worker at 8081

onStart() => registerWithMaster()

# spark command: java -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker 
#                --webui-port 8081 spark://localhost:7077
$ ./start-slave.sh spark://localhost:7077
Output Logs:
16/01/10 20:50:45 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
16/01/10 20:50:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/10 20:50:45 INFO SecurityManager: Changing view acls to: tony
16/01/10 20:50:45 INFO SecurityManager: Changing modify acls to: tony
16/01/10 20:50:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tony); users with modify permissions: Set(tony)
16/01/10 20:50:46 INFO Utils: Successfully started service 'sparkWorker' on port 49576.
16/01/10 20:50:46 INFO Worker: Starting Spark worker 192.168.0.112:49576 with 4 cores, 7.0 GB RAM
16/01/10 20:50:46 INFO Worker: Running Spark version 1.6.0
16/01/10 20:50:46 INFO Worker: Spark home: /usr/local/Cellar/apache-spark/1.6.0/libexec
16/01/10 20:50:46 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/01/10 20:50:46 INFO WorkerWebUI: Started WorkerWebUI at http://192.168.0.112:8081
16/01/10 20:50:46 INFO Worker: Connecting to master localhost:7077...
16/01/10 20:50:46 INFO Worker: Successfully registered with master spark://localhost:7077

Start Spark-shell over cluster on http://localhost:4040

1	$ MASTER=spark://localhost:7077 spark-shell

14526662171327

1	scala> sc.textFile("README.md").filter(_.contains("Spark")).count

14526662553694

sc.textFile(“”)

RDD Object

DAGScheduler: error between stages

==TaskSet===>

TaskScheduler: error inside stage

org.apache.spark.scheduler.TaskScheduler

2016-01-09

Reading Spark Souce Code in IntelliJ IDEA

It’s a good choice to read spark souce code in IntelliJ IDEA. This tutorial introduces how to do it.

Get spark repository

Fork apache spark project to your Github account

Clone spark to local:

1 2	$ git clone git@github.com:username/spark.git $ cd spark/

more >>