Code

All Code is executed with the following command:
spark-submit filename.py
If it is not working check installation requirements below and if necessary modify path.

Climate Change Code

The following computations can be made with our code (all of them PySpark Applications):

ClimateChangeContinents: calculates the average Temperature Change for every continent, for each decade
ClimateChangeDecades: calculates the average temperature change for the whole world (average over all given countries in our dataset) for each decade
ClimateChangeEachCountyEachYear: calculatess the temperature Change for every year, of every country in our dataset
ClimateChangeErasmus: calculates the temperature Change for every year for our homecountries (Germany, Portugal and Spain). This can be adapted to any countries you want by creating a dataframe per country you wish to include.
ClimateChangeWorldYears: calculates the average temperature change of the whole world for each year

The code operates in the following manner:

locate CSV
preprocess data if necessary
build PySpark Application
read csv into dataframe
filter data by relevant information
save data (using pandas library)

The above workflow is similar for Scripts related to both datasets.

Why Dataframes and not RDD:

RDD is a distributed collection of data elements spread across many machines in the cluster. RDDs are a set of Java or Scala objects representing data while a DataFrame is a distributed collection of data organized into named columns. I t is conceptually equal to a table in a relational database. Also DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets and RDD API is slower to perform simple grouping and aggregation operations. We also used DataFrames because it organizes the data in the named column. DataFrames allow the Spark to manage schema.

Crop Production Code

The following computations can be made with our code:

averager_over_items_and_countries: integrates over all the yields of all countries given in corresponding list
derivator_window_countries_items: calculates the derivatives of the yield of the selected items and countries. That means it averages over 5 years around the respective year, thus the yield in the year itself and two before and after that year. The year value is subtracted such that it outputs the derivation of this year over the windows mean value
most_produced_item: This script outputs the name of the item of the respective country which had the largest production volume in this country
ranker_items_and_countries: This script gives a ranking of production volume of the average since 2000
crop: computes Min, Max, Trend, Recent Decade Trend, Trend since 2000, Recent Average.

Execution requirements

Java has to be installed
Apache Spark distribution must be downloaded and extracted
path must be updated

Tools/Software:

PySpark
Pandas, Json, Matplotlib, Re, Os
Google Cloud
GitHub, GitHub Pages