yyyyb / -spark-

sparkstreaming笔记
0 stars 0 forks source link

Flume学习 #1

Open yyyyb opened 5 years ago

yyyyb commented 5 years ago

Flume is a distributed, reliable, and available service for efficiently collecting(收集), aggregating(聚集), and moving(移动) large amounts of log data.

yyyyb commented 5 years ago

webserver(源端) ==> flume ==>HDFS(目的地)

yyyyb commented 5 years ago

Flume: Cloudera / Apache Java Scribe: Facebook C/C++ 不在维护 Chukwa: Yahoo/Apache java 不在维护 Fluentd: Ruby Logstash: ELK(ElasticSearch, Kibana)

yyyyb commented 5 years ago

Flume架构及核心组件 1.Source:收集 2.Channel:聚集 3.Sink:输出

yyyyb commented 5 years ago

Flume环境搭建: 前置条件: 1.Java Runtime Environment - Java 1.8 or later 2.Memory - Sufficient memory for configurations used by sources, channels or sinks 3.Disk Space - Sufficient disk space for configurations used by channels or sinks 4.Directory Permissions - Read/Write permissions for directories used by agent 1.安装JDK 首先 安装JDK 到软件目录下 tar -zxvf jdk-8uxxx-linux-x64.tar.gz -C ~/app/ 配置系统变量 vi ~/.bash_profile

export JAVA_HOME=/jdk解压目录 export PATH=$JAVA_HOME/bin:$PATH

source ~/.bash_profile

yyyyb commented 5 years ago

2.安装Flume 下载地址 http://archive.cloudera.com/cdh5/cdh/5/ 找到cdh相对应的版本下载 flume-ng-1.6.0-cdh5.7.0.tar.gz 解压到指定目录 tar -zvxf flume-ng-1.6.0-cdh5.7.0.tar.gz -C /指定目录 配置环境变量 vi ~/.bash_profile

export FLUME_HOME=/flume解压目录 export PATH=$/FLUME_HOME:$PATH

source ~/.bash_profile

yyyyb commented 5 years ago

3.配置flume 配置conf目录下的flume-env.sh cp flume-env.sh.templete flume-env.sh vi flume-env.sh

JAVA_HOME=/App/jdk1.8.0_211

4.检测 bin目录下:flume-ng version

yyyyb commented 5 years ago

使用Flume的关键就是写配置文件 A)配置Source B)配置Channel C)配置Sink D)把以上三个组件串起来

yyyyb commented 5 years ago

example.conf:单节点Flume配置

官网给的示例 a1:agent名称 r1:数据源source的名称 k1:sink的名称 c1:channel的名称 #命名此代理上的组件 a1.sources = r1 a1.sinks = k1 a1.channels = c1

#描述/配置源 a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444

#描述接收器 a1.sinks.k1.type = logger

#使用缓冲内存中事件的通道 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100

#将源和接收器绑定到通道 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

yyyyb commented 5 years ago

启动agent: bin / flume-ng agent \ -n $agent_name \ (指定agent名字,e.g. -n(或者--name) a1) -c conf (指定conf目录,也就是-c(或者--conf) $FLUME_HOME/conf) -f conf / flume-conf.properties.template (指定需要执行的配置文件,e.g. -f (或者--conf-file) $FLUME_HOME/conf/example.conf) -Dflume.root.logger=INFO,console 使用telnet进行测试: telnet hadoop000 44444

Event:{ headers:{} body: 68 65 6C 6C 6F 0D hello. } Event是Flume数据传输的基本单位 Event = 可选的header + byte array

yyyyb commented 5 years ago

需求二:监听文件来传输 Agent选型: exec source + memory channel + logger sink 配置文件: #命名此代理上的组件 a1.sources = r1 a1.sinks = k1 a1.channels = c1

#描述/配置源 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/data/data.log a1.sources.r1.shell = /bin/sh -c

#描述接收器 a1.sinks.k1.type = logger

#使用缓冲内存中事件的通道 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100

#将源和接收器绑定到通道 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

yyyyb commented 5 years ago

需求三:A服务器采集并发送日志到B服务器 技术选型:A服务器:exec source + memory channel +avro sink B服务器:avro source +memory channel + logger sink exec-memory-avro.conf #命名此代理上的组件 exec-memory-avro.sources = exec-source exec-memory-avro.sinks = avro-sink exec-memory-avro.channels = memory-channel

#描述/配置源 exec-memory-avro.sources.exec-source.type = exec exec-memory-avro.sources.exec-source.command = tail -F /root/data/data.log exec-memory-avro.sources.exec-source.shell = /bin/sh -c

#描述接收器 exec-memory-avro.sinks.avro-sink.type = avro exec-memory-avro.sinks.avro-sink.hostname = hadoop000 exec-memory-avro.sinks.avro-sink.port = 44444

#使用缓冲内存中事件的通道 exec-memory-avro.channels.memory-channel.type = memory

#将源和接收器绑定到通道 exec-memory-avro.sources.exec-source.channels = memory-channel exec-memory-avro.sinks.avro-sink.channel = memory-channel

avro-memory-logger.conf #命名此代理上的组件 avro-memory-logger.sources = avro-source avro-memory-logger.sinks = logger-sink avro-memory-logger.channels = memory-channel

#描述/配置源 avro-memory-logger.sources.avro-source.type = avro avro-memory-logger.sources.avro-source.bind = hadoop000 avro-memory-logger.sources.avro-source.port = 44444

#描述接收器 avro-memory-logger.sinks.logger-sink.type = logger

#使用缓冲内存中事件的通道 avro-memory-logger.channels.memory-channel.type = memory

#将源和接收器绑定到通道 avro-memory-logger.sources.avro-source.channels = memory-channel avro-memory-logger.sinks.logger-sink.channel = memory-channel

先启动avro-memory-logger 在启动exec-memory-avro