All Code is executed with the following command:
spark-submit filename.py
If it is not working check installation requirements below and if necessary modify path.

Climate Change Code

The following computations can be made with our code (all of them PySpark Applications):
  • ClimateChangeContinents: calculates the average Temperature Change for every continent, for each decade
  • ClimateChangeDecades: calculates the average temperature change for the whole world (average over all given countries in our dataset) for each decade
  • ClimateChangeEachCountyEachYear: calculatess the temperature Change for every year, of every country in our dataset
  • ClimateChangeErasmus: calculates the temperature Change for every year for our homecountries (Germany, Portugal and Spain). This can be adapted to any countries you want by creating a dataframe per country you wish to include.
  • ClimateChangeWorldYears: calculates the average temperature change of the whole world for each year
The code operates in the following manner:
  • locate CSV
  • preprocess data if necessary
  • build PySpark Application
  • read csv into dataframe
  • filter data by relevant information
  • save data (using pandas library)
The above workflow is similar for Scripts related to both datasets.

Why Dataframes and not RDD:

RDD is a distributed collection of data elements spread across many machines in the cluster. RDDs are a set of Java or Scala objects representing data while a DataFrame is a distributed collection of data organized into named columns. I t is conceptually equal to a table in a relational database. Also DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets and RDD API is slower to perform simple grouping and aggregation operations. We also used DataFrames because it organizes the data in the named column. DataFrames allow the Spark to manage schema.

Crop Production Code

The following computations can be made with our code:
  • averager_over_items_and_countries: integrates over all the yields of all countries given in corresponding list
  • derivator_window_countries_items: calculates the derivatives of the yield of the selected items and countries. That means it averages over 5 years around the respective year, thus the yield in the year itself and two before and after that year. The year value is subtracted such that it outputs the derivation of this year over the windows mean value
  • most_produced_item: This script outputs the name of the item of the respective country which had the largest production volume in this country
  • ranker_items_and_countries: This script gives a ranking of production volume of the average since 2000
  • crop: computes Min, Max, Trend, Recent Decade Trend, Trend since 2000, Recent Average.

Execution requirements

  • Java has to be installed
  • Apache Spark distribution must be downloaded and extracted
  • path must be updated

Tools/Software:

  • PySpark
  • Pandas, Json, Matplotlib, Re, Os
  • Google Cloud
  • GitHub, GitHub Pages