[GreenPlum] 大数据开源环境搭建(集群): 0.GreenPlum3.5

MPPDB中最有名的莫过于EMC GreenPlum,从3年前用4.3.2到后来的稳定版4.3.5,在测试环境中尝试搭建过开源的集群,在此做下总结。

三节点配置和名称(HMaster 12C30G , HDATA 8C20G):

192.168.111.140 HMASTER master,segment0 (为了方便性能,我将Master也作为数据节点,如不需要可以在seg_hosts去掉)
192.168.111.141 HDATA01 standby,segment1
192.168.111.142 HDATA02 segment2

提纲:一、环境参数
二、安装软件
三、配置节点目录
四、初始化GP

详细步骤如下:

一、环境参数
1.创建用户和用户组,修改主机名

groupadd -g 600 gpadmin
useradd -u 601 gpadmin -g gpadmin -d /home/gpadmin
passwd gpadmin 密码建议好记点,都设为gpadmin

修改主机名之类的此处省略,/etc/hosts和/etc/sysconfig/network请都确认下

 

2.修改系统参数(所有机器,root用户)
Step1. 操作系统性能相关,建议全部修改,否则比较容易报错,比如shmmax

$ vi /etc/sysctl.conf 参考如下:
kernel.sem = 256 64000 100 512
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
kernel.shmmni = 4096
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 1
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65536
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65536
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.conf.default.arp_filter = 1
net.core.netdev_max_backlog = 10000
vm.overcommit_memory = 2

$ sysctl -p #参数立即生效

 

Step2.磁盘性能和字符集相关(注意!:我测试是在自己的Hadoop环境安装的,所以这个我就不调整了,各位实际生产请务必按需要调整)
1>在/etc/inittab文件中,将行id:5:initdefault改成:id:3:initdefault
id:3:initdefault #系统运行级别 = 3

2>磁盘预读取block readahead = 16384 ,需要重启。
修改/etc/rc.d/rc.local 增加
blockdev –setra 16384 /dev/sd*
验证方法:系统重启后运行
blockdev –getra /dev/sd*
blockdev –getra /dev/vg0/*
应都是16384

 

3>磁盘调度算法
修改/boot/grub/menu.lst 找到 kernel /vmlinuz-xxx 这一行,在最后添加 elevator=deadline
检查/验证方法:系统启动正常后,执行 cat /sys/block/*/queue/scheduler
应能看到:noop anticipatory [deadline] cfq

 

4>磁盘IO参数Hugepage
设置办法,修改/boot/grub/grub.conf在kernel(写elevator=deadline同一行)末尾加上transparent_hugepage=never,保存退出
检查/验证方法:cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
显示:always [never]

Step3.安全限制类参数(所有机器,root用户)

$ vi /etc/security/limits.conf加上如下内容
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

 

二、安装软件
3.在主节点上安装greenplum

~$ unzip greenplum-db-4.3.5.2-build-1-RHEL5-x86_64.zip
$ ./greenplum-db-4.3.5.2-build-1-RHEL5-x86_64.bin

选项如下:

********************************************************************************
Do you accept the Pivotal Database license agreement? [yes|no]
********************************************************************************
yes

********************************************************************************
Provide the installation path for Greenplum Database or press ENTER to
accept the default installation path: /usr/local/greenplum-db-4.3.5.2
********************************************************************************
/home/gpadmin/greenplum-db-4.3.5.2

********************************************************************************
Install Greenplum Database into </home/gpadmin/greenplum-db-4.3.5.2>? [yes|no]
********************************************************************************
yes

********************************************************************************
/home/gpadmin/greenplum-db-4.3.5.2 does not exist.
Create /home/gpadmin/greenplum-db-4.3.5.2 ? [yes|no]
(Selecting no will exit the installer)
********************************************************************************
yes

********************************************************************************
Installation complete.
Greenplum Database is installed in /home/gpadmin/greenplum-db-4.3.5.2

Pivotal Greenplum documentation is available
for download at http://docs.gopivotal.com/gpdb
********************************************************************************

 

4.配置环境变量vi .bash_profile(只需Master和StandBy即可,如有root可以修改/etc/profile),参考如下:(所有机器)

source /home/gpadmin/greenplum-db/greenplum_path.sh
export GP_HOME=/home/gpadmin/greenplum-db-4.3.5.2
export MASTER_DATA_DIRECTORY=/home/gpadmin/gpdata/gpmaster/gpseg-1
export PG_PORT=5432
export PG_DATABASE=sordb
export PATH=${GP_HOME}/bin:$PATH

 

5.配置集群机器列表
Step1. vi /home/gpadmin/greenplum-db/etc/hostlist 参考如下

HMaster
HData01
HData02

Step2. vi /home/gpadmin/greenplum-db/etc/seg_hosts 参考如下 (为了方便性能,我将Master也作为数据节点,如不需要可以去掉)

HMaster
HData01
HData02

 

6.配置gpssh互信

$ gpssh-exkeys -f /home/gpadmin/greenplum-db/etc/hostlist    #执行后日志如下

[gpadmin@HMASTER greenplum-db]$ gpssh-exkeys -f /home/gpadmin/greenplum-db/etc/hostlist
[STEP 1 of 5] create local ID and authorize on local host
... /home/gpadmin/.ssh/id_rsa file exists ... key generation skipped

[STEP 2 of 5] keyscan all hosts and update known_hosts file

[STEP 3 of 5] authorize current user on remote hosts
... send to HMaster
... send to HData01
***
*** Enter password for HData01:
... send to HData02

[STEP 4 of 5] determine common authentication file content

[STEP 5 of 5] copy authentication files to all remote hosts
... finished key exchange with HMaster
... finished key exchange with HData01
... finished key exchange with HData02

[INFO] completed successfully

 

7.安装子节点

gpseginstall -f /home/gpadmin/greenplum-db/etc/hostlist -u gpadmin -p gpadmin

 

三、配置节点目录
8.在各节点创建目录如下:

MASTER目录:
mkdir -p /home/gpadmin/gpdata/gpmaster
PRIMARY目录:
mkdir -p /home/gpadmin/gpdata/gpdatap1
mkdir -p /home/gpadmin/gpdata/gpdatap2
MIRROR目录:
mkdir -p /home/gpadmin/gpdata/gpdatam1
mkdir -p /home/gpadmin/gpdata/gpdatam2

注意:建议按照实际配置来搞,比如12C192G 16盘,实例数强烈建议选择4p+4m(即每台机器上有8个计算实例,包括4个primary以及4个mirror);
每台机器挂载/data1、/data2 两数据目录。每个数据目录放2个Primary 和2 个mirror。

四、初始化GP
9.配置 /home/gpadmin/greenplum-db/etc/gpinitsystem_config

$ cp /home/gpadmin/greenplum-db-4.3.5.2/docs/cli_help/gpconfigs/gpinitsystem_config /home/gpadmin/greenplum-db/etc/
$ vi /home/gpadmin/greenplum-db/etc/gpinitsystem_config
内容参考如下:
# FILE NAME: gpinitsystem_config

# Configuration file needed by the gpinitsystem

################################################
#### REQUIRED PARAMETERS
################################################

#### Name of this Greenplum system enclosed in quotes.
ARRAY_NAME="EMC Greenplum DW"

#### Naming convention for utility-generated data directories.
SEG_PREFIX=gpseg

#### Base number by which primary segment port numbers
#### are calculated.
PORT_BASE=40000

#### File system location(s) where primary segment data directories
#### will be created. The number of locations in the list dictate
#### the number of primary segments that will get created per
#### physical host (if multiple addresses for a host are listed in
#### the hostfile, the number of segments will be spread evenly across
#### the specified interface addresses).
declare -a DATA_DIRECTORY=(/home/gpadmin/gpdata/gpdatap1 /home/gpadmin/gpdata/gpdatap2)

#### OS-configured hostname or IP address of the master host.
MASTER_HOSTNAME=HMaster

#### File system location where the master data directory
#### will be created.
MASTER_DIRECTORY=/home/gpadmin/gpdata/gpmaster

#### Port number for the master instance.
MASTER_PORT=5432

#### Shell utility used to connect to remote hosts.
TRUSTED_SHELL=ssh

#### Maximum log file segments between automatic WAL checkpoints.
CHECK_POINT_SEGMENTS=8

#### Default server-side character set encoding.
ENCODING=UTF8

################################################
#### OPTIONAL MIRROR PARAMETERS
################################################

#### Base number by which mirror segment port numbers
#### are calculated.
#MIRROR_PORT_BASE=50000

#### Base number by which primary file replication port
#### numbers are calculated.
#REPLICATION_PORT_BASE=41000

#### Base number by which mirror file replication port
#### numbers are calculated.
#MIRROR_REPLICATION_PORT_BASE=51000

#### File system location(s) where mirror segment data directories
#### will be created. The number of mirror locations must equal the
#### number of primary locations as specified in the
#### DATA_DIRECTORY parameter.
#declare -a MIRROR_DATA_DIRECTORY=(/home/gpadmin/gpdata/gpdatam1 /home/gpadmin/gpdata/gpdatam2)
################################################
#### OTHER OPTIONAL PARAMETERS
################################################

#### Create a database of this name after initialization.
#DATABASE_NAME=name_of_database
DATABASE_NAME=sordb

#### Specify the location of the host address file here instead of
#### with the the -h option of gpinitsystem.
#MACHINE_LIST_FILE=/home/gpadmin/gpconfigs/hostfile_gpinitsystem

 

10.执行GP初始化

$ gpinitsystem -c /home/gpadmin/greenplum-db/etc/gpinitsystem_config -h /home/gpadmin/greenplum-db/etc/seg_hosts -s HData01
#其中-s HData01表示HData01节点作为Standby节点
关键日志信息如下:
20170213:04:29:03:006400 gpinitsystem:HMASTER:gpadmin-[INFO]:-HData01 /home/gpadmin/gpdata/gpdatap1/gpseg0 40000 2 0
20170213:04:29:03:006400 gpinitsystem:HMASTER:gpadmin-[INFO]:-HData01 /home/gpadmin/gpdata/gpdatap2/gpseg1 40001 3 1
20170213:04:29:03:006400 gpinitsystem:HMASTER:gpadmin-[INFO]:-HData02 /home/gpadmin/gpdata/gpdatap1/gpseg2 40000 4 2
20170213:04:29:03:006400 gpinitsystem:HMASTER:gpadmin-[INFO]:-HData02 /home/gpadmin/gpdata/gpdatap2/gpseg3 40001 5 3
20170213:04:29:03:006400 gpinitsystem:HMASTER:gpadmin-[INFO]:-HMaster /home/gpadmin/gpdata/gpdatap1/gpseg4 40000 6 4
20170213:04:29:03:006400 gpinitsystem:HMASTER:gpadmin-[INFO]:-HMaster /home/gpadmin/gpdata/gpdatap2/gpseg5 40001 7 5
Continue with Greenplum creation Yy/Nn>
y

11.启动和停止

$ gpstart   #启动GP集群
$ gpstop  #停止GP集群

 

12.使用示例:这里演示下创建表以及读取

$ psql -d sordb
sordb=# create table students (id string,value string);
ERROR: type "string" does not exist
sordb=# create table students (id int,value varchar(20));
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
sordb=# insert into students values(1,'zacks');
INSERT 0 1
sordb=# select * from students;
id | value
----+-------
1 | zacks

 

 

 

 

 

分类上一篇:     分类下一篇:无,已是最新文章

3 Comments

  1. admin (Author)

    😆 😆 表情测试

  2. It is a pity, that now I can not express – I hurry up on job. I will be released – I will necessarily express the opinion.
    [url=https://twitter.com/afanasevavivia1]baileekt[/url]

    • admin (Author)

      thats fine 😛

Leave a Reply