spark2_dfanalysis

Dataframe analysis in PySpark

View on GitHub

(home)

Analysis:

# price based columns

price_cols = ['msrp','discount','tax','upgrades']   

price_sum_expr = [sum(x) for x in price_cols]


df.groupBy("division")\
.agg(*price_sum_expr)\
.orderBy("sum(msrp)",ascending=False)\
.limit(5).toPandas()
df\
.groupBy("price_range")\
.agg(count("price_range"),min("net_worth"),max("net_worth"),mean("net_worth"),sum("net_worth"))\
.limit(5)\
.toPandas()