Tutorials

https://vvcestudio.com.br/en/tutorial/bancodedados/hadoop/

menu

Big Data Hadoop

What is Hadoop?

Hadoop is a distributed computing platform aimed at clustering and processing large volumes of data, with attention to fault tolerance.
It was inspired by MapReduce and GoogleFS.

Architecture

EdgeNode -Hadoop access machine.
NameNode -Has the metadata. Server NameNode. Cluster map.
DataNode -Where the data is stored and the processes run. The set of DataNodes is located within the Hadoop Cluster. The large file is divided into 3 parts and duplicated to 6 different DataNode servers.

BATH - Processing MapReduce, Hive, Spark
STREAM -Run a MapReduce job from a Mapper and Reducers script.
Impala - (SQL) Ideal for extracting reports (select)

Ecosystem

Spark - Melhor do que MapReduce mais fácil de usar.
HBase - Não relacional, não permite alterar parte do arquivo.
Hive - Bom para mastigar Dados
Impala - Tambem usa SQL mais inteligente.
Parquet - Formato de tabela
Sqoop - Integrar dados ler dados externos.
Hue - Interface Web para usuários avançados
Oozie - Não use [control-M]

Visit other channels :