Spark搭建 | 沐雨浥尘

Spark搭建

【个人GitBook搬运】本地搭建Spark

JAVA安装与环境搭建

ssh安装与测试

单机Spark可略过

hadoop安装与配置(可跳过)

  • Apache Hadoop下载(binary文件)
  • 解压放到/home/zydar/software下(不建议放在/usr下)
  • 配置JAVA_HOME

    1
    2
    #export JAVA_HOME=${JAVA_HOME}
    export JAVA_HOME=/usr/lib/java/jdk1.8.0_101
  • 配置/etc/profile:export HADOOP_HOME,PATH追加:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin

    1
    2
    3
    4
    5
    6
    7
    8
    JAVA_HOME=/usr/lib/java/jdk1.8.0_101
    JRE_HOME=$JAVA_HOME/jre
    HADOOP_HOME=/usr/local/bin/hadoop-2.7.3
    PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin
    export JAVA_HOME
    export JRE_HOME
    export HADOOP_HOME
    export PATH
  • 验证

    1
    hadoop version
  • wordcount测试

    1
    2
    3
    4
    5
    6
    7
    cd $HADOOP_HOME
    sudo mkdir input
    cp etc/hadoop/* input
    hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input /home/zydar/output
    cat /home/zydar/output/*
    rm -r /home/zydar/output
    rm -r input
  • 伪分布Hadoop配置

  • 配置core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    【core-site.xml】
    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8082</value>
    </property>
    </configuration>
    【hdfs-site.xml】
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>/usr/local/hadoop/hadoop-2.6.4/dfs/name</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop/hadoop-2.6.4/dfs/data</value>
    </property>
    </configuration>
    【yarn-site.xml】
    <configuration>
    <property>
    <name>yarn.nodemanager.aux.services</name>
    <value>mapreduce_shuffle</value>
    </property>
    </configuration>
    【mapred-site.xml】(cp mapred-site.xml.template mapred-site.xml)
    <configuration>
    <property>
    <name>mapreduce.framwork.name</name>
    <value>yarn</value>
    </property>
    </configuration>
  • 格式化

    1
    bin/hadoop namenode -format
  • 启动/关闭hadoop

    1
    start-all.sh/stop-all.sh
  • jps查看JAVA进程

    jps

  • 查看hadoop

    localhost:50070
    localhost:8088/cluster
    hdfs dfsadmin -report

  • 伪分布wordcount测试
    1
    2
    3
    4
    5
    6
    hdfs dfs -mkdir -p input
    hdfs dfs -put etc/hadoop input
    hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar wordcount input/hadoop output
    hdfs dfs -cat output/*
    hdfs dfs -rm -r input
    hdfs dfs -rm -r output

Spark安装与配置

  • Spark下载
  • 解压放到/home/zydar/software下(不建议放在/usr下)
  • 配置/etc/profile:export SPARK_HOME,PATH追加:$SPARK_HOME/bin
  • 配置环境

    1
    2
    cp ./conf/spark-env.sh.template ./conf/spark-env.sh
    vim ./conf/spark-env.sh

    export JAVA_HOME=/usr/lib/java/jdk1.8.0_101
    export SPARK_MASTER_IP=125.216.238.149
    export SPARK_WORKER_MEMORY=2g
    export HADOOP_CONF_DIR=/home/zydar/software/hadoop-2.7.3/etc/hadoop

  • 启动Spark

    1
    ./sbin/start-all.sh
  • localhost:8080查看Spark集群

  • 新建job(localhost:4040)
    1
    2
    3
    4
    pyspark --master spark://zydar-HP:7077 --name czh --executor-memory 1G --total-executor-cores 2
    >>> textFile = sc.textFile("file:///home/zydar/software/spark-2.0.0/README.md")
    >>> textFile.count()
    >>> textFile.filter(lambda line: line.split(' ')).map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b).map(lambda (a,b): (b,a)).sortByKey(False).map(lambda (a,b): (b,a)).collect()

Spark IDE开发环境

  • 配置/etc/profile:export PYTHONPATH
    1
    PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.1-src.zip

Spark on Pycharm

  • 下载Python(推荐.edu账号注册免费使用Professional版)
  • 解压放到/home/zydar/software下(不建议放在/usr下)
  • 运行

    1
    ./bin/pycharm.sh
  • 测试代码

    1
    2
    3
    4
    5
    6
    7
    8
    9
    from pyspark import SparkContext,SparkConf
    #conf = SparkConf().setAppName("YOURNAME").setMaster("local[*]")
    conf = SparkConf().setAppName("YOURNAME").setMaster("spark://zydar-HP:7077").set("spark.executor.memory", "1g").set("spark.cores.max", "2")
    sc = SparkContext(conf=conf)
    localFile = "file:///home/zydar/software/spark-2.0.0/README.md"
    hdfsFile = "README.md"
    hdfsFile1 = "/user/zydar/README.md"
    textFile = sc.textFile(localFile)
    print textFile.count()

Spark配置官方

Spark on Ipython Notebook

  • Ipython Notebook安装与配置

    1
    2
    3
    apt-get install ipython#安装ipython
    apt-get install ipython-notebook#安装ipython notebook
    ipython profile create spark#创建spark的config

    记下生成的路径/home/zydar/.ipython/profile_spark/ipython_notebook_config.py

  • 进入ipython设置密码

    1
    2
    3
    ipython
    In [1]:from IPython.lib import passwd
    In [2]:passwd()

    记下返回的sha1

  • 进入ipython_notebook_config.py文件

    1
    2
    3
    c.NotebookApp.password = u'sha1:67c34dbbc0f8:a96f9c64adbf4c58f2e71026a4bffb747d777c5a'
    c.FileNotebookManager.notebook_dir = u'/home/zydar/software/data/ipythonNotebook'
    # c.NotebookApp.open_browser = False
  • 打开Ipython Notebook

    1
    ipython notebook --profile=spark
  • 测试代码(同Pycharm)

Buy me a cup of coffee