sql oracle,Spark核心類:SQLContext和DataFrame

 2023-12-09 阅读 19 评论 0

摘要:http://blog.csdn.net/pipisorry/article/details/53320669 pyspark.sql.SQLContext Main entry point for DataFrame and SQL functionality. [pyspark.sql.SQLContext] 皮皮blog sql oracle, pyspark.sql.DataFrame A distributed collection of data grouped into named c

http://blog.csdn.net/pipisorry/article/details/53320669

pyspark.sql.SQLContext

Main entry point for DataFrame and SQL functionality.

[pyspark.sql.SQLContext]

皮皮blog

sql oracle,


pyspark.sql.DataFrame

A distributed collection of data grouped into named columns.

spark df和pandas df

spark df的操作基本和pandas df操作一樣的[Pandas小記(6) ]

相互轉換

從pandas_df轉換:

oracle查詢表結構的sql語句、spark_df = SQLContext.createDataFrame(pandas_df)

sc = SparkContext(master='local[8]', appName='kmeans')
sql_ctx = SQLContext(sc)
lldf_rdd = sql_ctx.createDataFrame(lldf)
另外,createDataFrame支持從list轉換spark_df,其中list元素可以為tuple,dict,rdd

從spark_df轉換:

pandas_df = spark_df.toPandas()

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

Apache Spark,Note that this method should only be used if the resulting Pandas’s DataFrame is expectedto be small, as all the data is loaded into the driver’s memory.

This is only available if Pandas is installed and available.

>>> df.toPandas()  age   name
0    2  Alice
1    5    Bob

[Spark與Pandas中DataFrame對比(詳細)]

spark df方法

rdd

Returns the content as an pyspark.RDD of Row.

rollup(*cols)

Create a multi-dimensional rollup for the current DataFrame usingthe specified columns, so we can run aggregation on them.

>>> df.rollup("name", df.age).count().orderBy("name", "age").show()
+-----+----+-----+
| name| age|count|
+-----+----+-----+
| null|null|    2|
|Alice|null|    1|
|Alice|   2|    1|
|  Bob|null|    1|
|  Bob|   5|    1|
+-----+----+-----+
select(*cols)

Projects a set of expressions and returns a new DataFrame.

Parameters:cols – list of column names (string) or expressions (Column).If one of the column names is ‘*’, that column is expanded to include all columnsin the current DataFrame.
>>> df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
selectExpr(*expr)

類是面向對象的核心?Projects a set of SQL expressions and returns a new DataFrame.

This is a variant of select() that accepts SQL expressions.

>>> df.selectExpr("age * 2", "abs(age)").collect()
[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]
toDF(*cols)

Returns a new class:DataFrame that with new specified column names

Parameters:cols – list of new column names (string)
>>> df.toDF('f1', 'f2').collect()
[Row(f1=2, f2=u'Alice'), Row(f1=5, f2=u'Bob')]
persist(storageLevel=StorageLevel(False, True, False, False, 1))?

Sets the storage level to persist its values across operationsafter the first time it is computed. This can only be used to assigna new storage level if the RDD does not have a storage level set yet.If no storage level is specified defaults to (MEMORY_ONLY).

[pyspark.sql.DataFrame]

from: http://blog.csdn.net/pipisorry/article/details/53320669

B類核心,ref:


版权声明:本站所有资料均为网友推荐收集整理而来,仅供学习和研究交流使用。

原文链接:https://hbdhgg.com/5/194060.html

发表评论:

本站为非赢利网站,部分文章来源或改编自互联网及其他公众平台,主要目的在于分享信息,版权归原作者所有,内容仅供读者参考,如有侵权请联系我们删除!

Copyright © 2022 匯編語言學習筆記 Inc. 保留所有权利。

底部版权信息