Spark.txt

﻿Spark
官网
http://spark.incubator.apache.org/


1  jupyter中运行spark-submit命令
windows 2019+conda环境，代码单元格运行以下命令
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
!D:\ProgramData\spark-3.2.1-bin-hadoop3.2-scala2.13\bin\spark-submit --conf "spark.pyspark.python=D:\programdata\miniconda3\envs\jupyterpy38\python.exe" D:\ProgramData\spark-3.2.1-bin-hadoop3.2-scala2.13\examples\src\main\python\pi.py 10 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2 在spark上运行Hive
新的推荐用法是 builder.enableHiveSupport()
--------------------------------------------------------------------------------------------------
spark = SparkSession\
    .builder\
    .master('local[*]')\ # 主机地址
    .config("hive.metastore.uris","thrift://127.0.0.1:9083")\ # 元地址,可选
    .appName('AppName')\
    .enableHiveSupport()\ #启用Hive支持
    .getOrCreate()
result = spark.sql("SELECT * FROM xxxxx")
--------------------------------------------------------------------------------------------------

3 WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/D:/ProgramData/spark-3.2.1-bin-hadoop3.2-scala2.13/jars/spark-unsafe_2.13-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
不影响运行。经查询，一般认为与java版本有关，java11 出现这个，java8 正常。


4 初始化环境
已经逐渐被2 的方式代替
-------------------------------------------------------------------------------------------------------
import pyspark
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("App")
sc = SparkContext()
from pyspark.sql import SQLContext
sqlcontext = SQLContext(sc)
--------------------------------------------------------------------------------------------------------

5 github例子
含java/scala/python/R的代码demo
https://github.com/apache/spark/tree/v3.2.1/examples
在下载的安装包中也可以找到

6 若干文档库
总目录          http://spark.incubator.apache.org/docs/latest/       
SQL与DF      http://spark.incubator.apache.org/docs/latest/sql-programming-guide.html    
RDD编程      http://spark.incubator.apache.org/docs/latest/rdd-programming-guide.html    
流  	   http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html       
结构流         http://spark.incubator.apache.org/docs/latest/structured-streaming-programming-guide.html
ml机器学习  http://spark.incubator.apache.org/docs/latest/ml-guide.html
python        http://spark.incubator.apache.org/docs/latest/api/python/getting_started/index.html

7 问题：Py4JJavaError: An error occurred while calling o675.showString. 
解决：但是重新运行笔记本后无异常

8 分析IIS 等日志
https://blog.csdn.net/weixin_34408717/article/details/93452463

9 java应用中提交spark
https://www.cnblogs.com/javalinux/p/15094833.html

10 python 应用中提交pyspark 代码
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
import subprocess

if __name__=='__main__':
    cmd=rf'''"D:\ProgramData\spark-3.2.1-bin-hadoop3.2-scala2.13\bin\spark-submit2.cmd" --conf "spark.pyspark.python=D:\programdata\miniconda3\envs\reportpy38\python.exe" "D:\ProgramData\spark-3.2.1-bin-hadoop3.2-scala2.13\examples\src\main\python\pi.py" 10
'''
    print(cmd)
    result = subprocess.check_output(cmd)
    print(result)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

11 Spark内存资源分配
https://blog.csdn.net/qq_28658621/article/details/122579717

12 Spark JVM参数优化设置及Sparkstreaming优化和反压机制
http://t.zoukankan.com/jiashengmei-p-13746372.html

13 Spark Streaming动态资源分配
https://blog.csdn.net/wangpei1949/article/details/90734304
Spark Streaming资源动态分配和动态控制消费速率
http://t.zoukankan.com/sparkbigdata-p-5544609.html


14 Spark入Hbase的四种方式效率对比
https://www.cnblogs.com/runnerjack/p/10480468.html

15 .NET for Apache Spark
https://docs.microsoft.com/zh-cn/dotnet/spark/what-is-apache-spark-dotnet
https://github.com/dotnet/spark

16 本机启动
启动master
<spark——home>\bin\spark-class org.apache.spark.deploy.master.Master
启动worker
<spark——home>\bin\spark-class org.apache.spark.deploy.worker.Worker spark://{masterUI上显示的 IP}:7077 --cores 2 --memory 2G