Weather for Energy

Objective

To explore the relationship between weather patterns and energy consumption in different countries, identify potential weather drivers of energy consumption and separate into similar groups based on temperature and energy characteristics.

link for databases: Weather Database Energy

library(tidyverse)
library(mice)
library(patchwork)
library(DataExplorer)
library(rworldmap)
library(factoextra) 
library(cluster)
library(mclust)

Read all data

Data = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-CDD18dailybypop-from-2000-to-2022.csv") %>%
  mutate(Unit=NULL)

d2 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-CDD21dailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d3 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Clouddaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d4 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Daylightdaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d5 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-DNIdaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d6 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-GHIdaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d7 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-HDD16dailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d8 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-HDD18dailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d9 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-HeatIndexdailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d10 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Humidexdailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d11 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Precipitationdaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d12 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-RHdaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d13 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturedailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d14 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturedaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d15 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemaxdailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d16 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemaxdaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d17 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemindailybypop-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)
d18 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemindaily-from-2000-to-2022.csv")%>%
  mutate(Unit=NULL)

energy_consumption = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/owid-energy-data.csv") %>%
  filter(year >= 2000 & year <= 2022) %>%
  transmute(iso_code, year, population, gdp, energy_per_capita) %>%
  rename("Country" = "iso_code", "Date" = "year"
         )

Joinning Data

Data = Data %>% 
  left_join(d2)%>% 
  left_join(d3)%>% 
  left_join(d4)%>% 
  left_join(d5)%>% 
  left_join(d6)%>% 
  left_join(d7)%>% 
  left_join(d8)%>% 
  left_join(d9)%>% 
  left_join(d10)%>% 
  left_join(d11)%>% 
  left_join(d12)%>% 
  left_join(d13)%>% 
  left_join(d14)%>% 
  left_join(d15)%>% 
  left_join(d16)%>% 
  left_join(d17)%>% 
  left_join(d18) %>%
  left_join(energy_consumption)

summary(Data)
##    Country           Territory         CDD18dailybypop       Date     
##  Length:5457        Length:5457        Min.   :   0.0   Min.   :2000  
##  Class :character   Class :character   1st Qu.: 535.3   1st Qu.:2005  
##  Mode  :character   Mode  :character   Median :1819.4   Median :2011  
##                                        Mean   :1815.0   Mean   :2011  
##                                        3rd Qu.:3018.7   3rd Qu.:2017  
##                                        Max.   :4175.3   Max.   :2022  
##                                                                       
##  CDD21dailybypop    Clouddaily    Daylightdaily      DNIdaily     
##  Min.   :   0.0   Min.   :11.47   Min.   :726.4   Min.   :101265  
##  1st Qu.: 210.1   1st Qu.:45.49   1st Qu.:727.2   1st Qu.:420456  
##  Median : 978.9   Median :56.34   Median :728.1   Median :533992  
##  Mean   :1119.4   Mean   :55.08   Mean   :729.9   Mean   :514281  
##  3rd Qu.:1944.6   3rd Qu.:67.45   3rd Qu.:731.3   3rd Qu.:636470  
##  Max.   :3081.0   Max.   :90.50   Max.   :752.9   Max.   :800599  
##                                                                   
##     GHIdaily       HDD16dailybypop    HDD18dailybypop   HeatIndexdailybypop
##  Min.   : 282809   Min.   :   0.000   Min.   :   0.00   Min.   :-8.762     
##  1st Qu.: 658444   1st Qu.:   0.000   1st Qu.:   0.00   1st Qu.:13.147     
##  Median : 781075   Median :   8.672   Median :  54.78   Median :22.853     
##  Mean   : 741394   Mean   : 843.462   Mean   :1047.77   Mean   :20.370     
##  3rd Qu.: 860539   3rd Qu.:1557.031   3rd Qu.:2026.15   3rd Qu.:27.510     
##  Max.   :1018783   Max.   :9030.972   Max.   :9760.97   Max.   :32.437     
##                                                                            
##  Humidexdailybypop Precipitationdaily     RHdaily      Temperaturedailybypop
##  Min.   :-12.44    Min.   :0.0009588   Min.   :22.53   Min.   :-8.762       
##  1st Qu.: 13.81    1st Qu.:0.0768054   1st Qu.:65.94   1st Qu.:13.080       
##  Median : 28.05    Median :0.1191295   Median :76.51   Median :22.656       
##  Mean   : 24.56    Mean   :0.1386326   Mean   :70.90   Mean   :19.694       
##  3rd Qu.: 35.17    3rd Qu.:0.1856865   3rd Qu.:80.27   3rd Qu.:26.240       
##  Max.   : 39.52    Max.   :0.5655734   Max.   :87.54   Max.   :29.441       
##                                                                             
##  Temperaturedaily Temperaturemaxdailybypop Temperaturemaxdaily
##  Min.   :-18.51   Min.   :-7.073           Min.   :-15.96     
##  1st Qu.: 12.44   1st Qu.:17.703           1st Qu.: 16.91     
##  Median : 23.71   Median :26.280           Median : 26.84     
##  Mean   : 19.76   Mean   :23.114           Mean   : 23.21     
##  3rd Qu.: 26.30   3rd Qu.:28.456           3rd Qu.: 28.70     
##  Max.   : 29.80   Max.   :36.064           Max.   : 36.43     
##                                                               
##  Temperaturemindailybypop Temperaturemindaily   population       
##  Min.   :-10.579          Min.   :-21.221     Min.   :1.833e+03  
##  1st Qu.:  9.535          1st Qu.:  8.527     1st Qu.:8.496e+05  
##  Median : 19.091          Median : 20.166     Median :5.963e+06  
##  Mean   : 16.610          Mean   : 16.591     Mean   :3.306e+07  
##  3rd Qu.: 23.705          3rd Qu.: 23.909     3rd Qu.:2.133e+07  
##  Max.   : 27.131          Max.   : 27.131     Max.   :1.426e+09  
##                                               NA's   :797        
##       gdp            energy_per_capita
##  Min.   :3.129e+08   Min.   :     0   
##  1st Qu.:2.185e+10   1st Qu.:  3636   
##  Median :7.144e+10   Median : 14909   
##  Mean   :5.244e+11   Mean   : 27753   
##  3rd Qu.:3.306e+11   3rd Qu.: 35945   
##  Max.   :1.815e+13   Max.   :657539   
##  NA's   :2346        NA's   :1067

Variables:

  1. Country: Country name
  2. Date: Year( from 2000 to 2022)
  3. Territory: Country
  4. CDD18dailybypop: Cooling degree days from temperature corrected by humidity (reference temperature 18 °C). Heat Index is used as input temperature.
  5. CDD21dailybypop: Cooling degree days from temperature corrected by humidity (reference temperature 21 °C). Heat Index is used as input temperature.
  6. Clouddaily: Cloud coverage This parameter is the proportion of a grid box covered by cloud. Total cloud cover is a single level field calculated from the cloud occurring at different model levels through the at-mosphere.
  7. Daylightdaily: Minutes of sunlight.
  8. DNIdaily: Amount of direct solar radiation reaching the surface of the earth
  9. GHIdaily: Amount of solar radiation that reaches a horizontal plane at the surface of the earth.
  10. HDD16dailybypop: Heating degree days reference 16°
  11. HDD18dailybypop: Heating degree days reference 18°
  12. HeatIndexdailybypop: Humidity-corrected 2 meter temperature using heat index
  13. Humidexdailybypop: Humidity-corrected 2 meter temperature using humidex
  14. Precipitationdaily: Accumulated liquid and frozen water, comprising the rain and snow that falls to the earth’s surface
  15. RHdaily: Relative humidity based on 2 metres air and dew temperatures
  16. Temperaturedailybypop: 2 meters mean temperature by population
  17. Temperaturedaily: 2 meters mean temperature
  18. Temperaturemaxdailybypop: 2 meters max temperature by population
  19. Temperaturemaxdaily: 2 meters maximum temperature
  20. Temperaturemindailybypop: 2 meters minimun temperature by population
  21. Temperaturemindaily: 2-meters minimum temperature
  22. population: population
  23. gdp: gross domestic product
  24. energy_per_capita: Primary energy consumption per capita, measured in kilowatt-hours

Pre-process data and make descriptive analysis

I will pre-process the data by checking for missing values and scaling the continuous variables because they are in different units. Then, I will perform descriptive analysis to explore the distributions of the variables, and identify potential outliers and trends.

#All variables are continuous except Country and year
#some NAs in population, gdp and energy_per_capita
Data[!complete.cases(Data),]
## # A tibble: 2,353 × 24
##    Country Terri…¹ CDD18…²  Date CDD21…³ Cloud…⁴ Dayli…⁵ DNIda…⁶ GHIda…⁷ HDD16…⁸
##    <chr>   <chr>     <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 AFG     Afghan…   877.   2019 5.32e+2    29.9    730. 635645. 861714.   2467.
##  2 AFG     Afghan…   817.   2020 4.97e+2    28.2    730. 639755. 864693.   2610.
##  3 AFG     Afghan…   899.   2021 5.53e+2    20.5    730. 681513. 891258.   2118.
##  4 AFG     Afghan…   821.   2022 4.83e+2    25.8    730. 653421. 868719.   2194.
##  5 ALA     Aland …     0    2000 0          64.6    742. 262545. 425579.   3175.
##  6 ALA     Aland …    12.1  2001 0          67.1    742. 269857. 437126.   3476.
##  7 ALA     Aland …    88.1  2002 2.53e+0    60.8    742. 311881. 466501.   3451.
##  8 ALA     Aland …    55.0  2003 4.74e+0    63.3    742. 279331. 448222.   3690.
##  9 ALA     Aland …    19.9  2004 3.08e-1    62.8    742. 272947. 439375.   3490.
## 10 ALA     Aland …    13.8  2005 6.24e-2    63.2    742. 279120. 446895.   3404.
## # … with 2,343 more rows, 14 more variables: HDD18dailybypop <dbl>,
## #   HeatIndexdailybypop <dbl>, Humidexdailybypop <dbl>,
## #   Precipitationdaily <dbl>, RHdaily <dbl>, Temperaturedailybypop <dbl>,
## #   Temperaturedaily <dbl>, Temperaturemaxdailybypop <dbl>,
## #   Temperaturemaxdaily <dbl>, Temperaturemindailybypop <dbl>,
## #   Temperaturemindaily <dbl>, population <dbl>, gdp <dbl>,
## #   energy_per_capita <dbl>, and abbreviated variable names ¹​Territory, …
  • All variables are continuous except Country code and year
  • Some NAs in population, gdp and energy_per_capita
#Drop observations that have Nas for gdp
Data = Data %>% drop_na(gdp) 

#Now only have 7 rows that dont have energy consumption
Data[!complete.cases(Data),]
## # A tibble: 7 × 24
##   Country Territ…¹ CDD18…²  Date CDD21…³ Cloud…⁴ Dayli…⁵ DNIda…⁶ GHIda…⁷ HDD16…⁸
##   <chr>   <chr>      <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 MNE     Montene…    211.  2005    68.3    57.8    733. 368784. 589042.   2679.
## 2 SRB     Serbia      544.  2000   267.     48.1    733. 411276. 619510.   2014.
## 3 SRB     Serbia      390.  2001   165.     57.0    733. 346936. 558712.   2223.
## 4 SRB     Serbia      454.  2002   184.     60.0    733. 347985. 563165.   2052.
## 5 SRB     Serbia      605.  2003   283.     54.1    733. 378256. 589525.   2493.
## 6 SRB     Serbia      316.  2004   120.     59.0    733. 345978. 560569.   2259.
## 7 SRB     Serbia      322.  2005   106.     59.4    733. 331688. 551078.   2508.
## # … with 14 more variables: HDD18dailybypop <dbl>, HeatIndexdailybypop <dbl>,
## #   Humidexdailybypop <dbl>, Precipitationdaily <dbl>, RHdaily <dbl>,
## #   Temperaturedailybypop <dbl>, Temperaturedaily <dbl>,
## #   Temperaturemaxdailybypop <dbl>, Temperaturemaxdaily <dbl>,
## #   Temperaturemindailybypop <dbl>, Temperaturemindaily <dbl>,
## #   population <dbl>, gdp <dbl>, energy_per_capita <dbl>, and abbreviated
## #   variable names ¹​Territory, ²​CDD18dailybypop, ³​CDD21dailybypop, …
  • Drop observations that have NAs for gdp and check if it solves the probelm
  • Now only have 7 rows that dont have energy consumption
#Use mice function to fill energy_consumption nas
m = 7 
mice_mod <- mice(Data, m=m, method='rf')
## 
##  iter imp variable
##   1   1  energy_per_capita
##   1   2  energy_per_capita
##   1   3  energy_per_capita
##   1   4  energy_per_capita
##   1   5  energy_per_capita
##   1   6  energy_per_capita
##   1   7  energy_per_capita
##   2   1  energy_per_capita
##   2   2  energy_per_capita
##   2   3  energy_per_capita
##   2   4  energy_per_capita
##   2   5  energy_per_capita
##   2   6  energy_per_capita
##   2   7  energy_per_capita
##   3   1  energy_per_capita
##   3   2  energy_per_capita
##   3   3  energy_per_capita
##   3   4  energy_per_capita
##   3   5  energy_per_capita
##   3   6  energy_per_capita
##   3   7  energy_per_capita
##   4   1  energy_per_capita
##   4   2  energy_per_capita
##   4   3  energy_per_capita
##   4   4  energy_per_capita
##   4   5  energy_per_capita
##   4   6  energy_per_capita
##   4   7  energy_per_capita
##   5   1  energy_per_capita
##   5   2  energy_per_capita
##   5   3  energy_per_capita
##   5   4  energy_per_capita
##   5   5  energy_per_capita
##   5   6  energy_per_capita
##   5   7  energy_per_capita
## Warning: Number of logged events: 37
Data <- complete(mice_mod, action=m)

#Check there are no more Nas
Data[!complete.cases(Data),]
##  [1] Country                  Territory                CDD18dailybypop         
##  [4] Date                     CDD21dailybypop          Clouddaily              
##  [7] Daylightdaily            DNIdaily                 GHIdaily                
## [10] HDD16dailybypop          HDD18dailybypop          HeatIndexdailybypop     
## [13] Humidexdailybypop        Precipitationdaily       RHdaily                 
## [16] Temperaturedailybypop    Temperaturedaily         Temperaturemaxdailybypop
## [19] Temperaturemaxdaily      Temperaturemindailybypop Temperaturemindaily     
## [22] population               gdp                      energy_per_capita       
## <0 rows> (or 0-length row.names)
  • Use mice function to fill energy_consumption NAs
  • No more NAs

Descriptive

Average means per year of data

Descriptive = Data %>%
  group_by(Date) %>%
  summarize(gdp_mean = mean(gdp),
            population_mean = mean(population),
            energy_per_capita_mean = mean(energy_per_capita),
            Temperaturedaily_mean = mean(Temperaturedaily))

Descriptive
## # A tibble: 19 × 5
##     Date      gdp_mean population_mean energy_per_capita_mean Temperaturedaily…¹
##    <dbl>         <dbl>           <dbl>                  <dbl>              <dbl>
##  1  2000 364619842297.       37204756.                 24640.               18.5
##  2  2001 375406699283.       37695584.                 24920.               18.6
##  3  2002 388201610777.       38185328.                 25326.               18.7
##  4  2003 404161378020.       38673940.                 25887.               18.6
##  5  2004 426859578689.       39164515.                 26629.               18.6
##  6  2005 446330040693.       39420511.                 27068.               18.6
##  7  2006 471927778202.       39916124.                 26995.               18.7
##  8  2007 497808950039.       40418992.                 26891.               18.8
##  9  2008 512902197693.       40928364.                 27058.               18.6
## 10  2009 513373221147.       41444486.                 26082.               18.7
## 11  2010 543623972101.       41964309.                 26942.               18.9
## 12  2011 566491838953.       42486395.                 26793.               18.6
## 13  2012 582919117343.       43014605.                 27025.               18.7
## 14  2013 601079013868.       43543489.                 26916.               18.7
## 15  2014 620443517004.       44070149.                 26519.               18.9
## 16  2015 637729707367.       44591596.                 26690.               19.1
## 17  2016 649075621947.       45108692.                 26642.               19.2
## 18  2017 668199710669.       45623519.                 26826.               19.0
## 19  2018 688001628156.       46122774.                 26887.               19.1
## # … with abbreviated variable name ¹​Temperaturedaily_mean

Observe kind of an increasing trend

Plots: Average histograms, boxplots, summarize with means, variance, etc

bxp<-p <- ggplot(data=Descriptive, aes(x=Date, y=gdp_mean)) +
  geom_line()
dp<-p <- ggplot(data=Descriptive, aes(x=Date, y=population_mean)) +
  geom_line()
dp1<-p <- ggplot(data=Descriptive, aes(x=Date, y=energy_per_capita_mean)) +
  geom_line()
dp2<-p <- ggplot(data=Descriptive, aes(x=Date, y=Temperaturedaily_mean)) +
  geom_line()

bxp + dp + dp1 + dp2

We confirm that there is an increasing trend at least for gdp and population Going to apply log to population and Date because highly asymmetrical

Data$gdp = log(Data$gdp)
Data$population = log(Data$population)

More plots restrict to year 2018 and focus analysis on that year for simplicity

Data = Data %>% filter(Date == 2018)

t2 = ggplot(Data, aes(x=Temperaturedaily)) + geom_density(fill = "darkred")
t3 = ggplot(Data, aes(x=energy_per_capita)) + geom_density(fill = "lightblue")
t4 = ggplot(Data, aes(x=population)) + geom_density(fill = "purple")
t5 = ggplot(Data, aes(x=gdp)) + geom_density(fill = "orange")
  
t2 + t3 + t4 + t5

Density functions to observe behavoirs, poulation and gdp look more like a Normal Distribution now. Temperature and energy_per_capita look inversely skewed.

Some boxplots

r2 = ggplot(Data, aes(x=Temperaturedaily)) + geom_boxplot(fill = "darkred")
r3 = ggplot(Data, aes(x=energy_per_capita)) + geom_boxplot(fill = "lightblue")
r4 = ggplot(Data, aes(x=population)) + geom_boxplot(fill = "purple")
r5 = ggplot(Data, aes(x=gdp)) + geom_boxplot(fill = "orange")

r2 + r3 + r4 +r5

Energy per Capita Map for 2018

map = Data %>% 
  dplyr::select(Country, energy_per_capita) 

matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "Country")
## 164 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 79 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="energy_per_capita",missingCountryCol = "white",
               borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "topo",
               mapTitle = c("Energy per Capita per Country"), lwd=1)

Kind of a similar group based on the location

Apply PCA and interpret:

We will apply PCA to identify the main components of weather variables that are most strongly associated with energy consumption. We will interpret the results by examining the loadings of each component and identifying which variables are most strongly associated with each component.

Want to scale variables because are in different units and going to remove non-numeric

Data2 = Data %>%
  select(-Country, -Territory, -Date)

pca = prcomp(Data2, scale = T)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5    PC6     PC7
## Standard deviation     3.6357 1.7576 1.34237 1.1611 0.76877 0.5555 0.46090
## Proportion of Variance 0.6294 0.1471 0.08581 0.0642 0.02814 0.0147 0.01012
## Cumulative Proportion  0.6294 0.7765 0.86234 0.9265 0.95468 0.9694 0.97949
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.35259 0.33404 0.30948 0.23961 0.14921 0.08920 0.06234
## Proportion of Variance 0.00592 0.00531 0.00456 0.00273 0.00106 0.00038 0.00019
## Cumulative Proportion  0.98541 0.99073 0.99529 0.99802 0.99908 0.99946 0.99964
##                           PC15    PC16    PC17    PC18    PC19     PC20
## Standard deviation     0.06025 0.04732 0.03160 0.02056 0.01197 0.004768
## Proportion of Variance 0.00017 0.00011 0.00005 0.00002 0.00001 0.000000
## Cumulative Proportion  0.99982 0.99992 0.99997 0.99999 1.00000 1.000000
##                            PC21
## Standard deviation     0.001581
## Proportion of Variance 0.000000
## Cumulative Proportion  1.000000

Graph to see it better

fviz_screeplot(pca, addlabels = TRUE)

  • The first component explains 63% of the differences in the world
  • The second component explains 14.7%
  • The third component explains 8.6%

First Component

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(pca$rotation[,1], las = 2, col = "darkblue")

All columns are relevant except for -Temperaturemindaily

5 others are not highly relevant < |0.1| -Clouddaily -Precipitationdaily -RHdaily -gdp -energy_per_capita

The dataset is a combination of climate related variables and energy consumption so it makes sense that almost all the variables are highly significant.

Top countries in the world taking into account the variables from PCA1:

names=Data$Country
names[order(pca$x[,1])][1:10] 
##  [1] "ISL" "RUS" "CAN" "MNG" "FIN" "NOR" "SWE" "EST" "KGZ" "LVA"
#Top 10 countries taking into account the variables

The PCA1 is composed of temperature variables so kind of observe that these countries have similar cold climate.

Bottom countries in the world taking into account the 11 variables:

names[order(pca$x[,1], decreasing=T)][1:10] #Bottom 10 countries taking into account the variables
##  [1] "NER" "MLI" "TCD" "BFA" "DJI" "MRT" "ARE" "OMN" "QAT" "KHM"

These countries have similar climate with the average.

Second Component

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(pca$rotation[,2], las=2, col="darkblue")

Now very relevant:

-Clouddaily -Precipitationdaily -RHdaily

Third Component

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(pca$rotation[,3], las=2, col="darkblue")

Now very relevant:

-gdp -energy_per_capita

WRITE and Interpret map

data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=names)) + geom_point(size=0) +
  labs(title="First two principal components (scores)", x="PC1", y="PC2") + 
  theme_bw() +theme(legend.position="bottom") + geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE) 

We see how based on PC1 and PC2, the most coldest countries are located in the left spectrum and the hottest to the right.

Temperature Map of the World

map = data.frame(country=names, value=-pca$x[,1])
matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 164 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 79 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
               addLegend = FALSE, borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "terrain",
               mapTitle = c("PCA1 by Country"), lwd=1)

Just as we thought depending on where they are located verticial, they have similar temperatures so similar groups.

Clustering

Try 5 centers

set.seed(67365)

k = 5

name.model = kmeans(scale(Data2), centers=k, nstart = 1000 )
name.model
## K-means clustering with 5 clusters of sizes 13, 18, 41, 35, 57
## 
## Cluster means:
##   CDD18dailybypop CDD21dailybypop Clouddaily Daylightdaily   DNIdaily
## 1      -1.2205178      -1.0825838  0.7055554     2.4073103 -1.5459113
## 2       1.4275964       1.6289126 -1.4684171    -0.3711297  1.2851304
## 3      -0.9969246      -0.9256373  0.2666901     0.6328729 -0.8980301
## 4      -0.4006715      -0.4522690 -0.8112450    -0.4213924  0.7232492
## 5       0.7906563       0.6760300  0.6090977    -0.6283115  0.1485967
##     GHIdaily HDD16dailybypop HDD18dailybypop HeatIndexdailybypop
## 1 -1.7094025       2.3414552       2.2585954         -1.83803888
## 2  1.2375451      -0.7005270      -0.7285995          1.11278748
## 3 -0.9657311       0.8709624       0.9164369         -1.00332037
## 4  0.6239380      -0.3481576      -0.2885703         -0.07336208
## 5  0.3105889      -0.7254978      -0.7670333          0.83452876
##   Humidexdailybypop Precipitationdaily    RHdaily Temperaturedailybypop
## 1        -1.7646356         -0.5104081  0.3605698           -1.89539124
## 2         0.7754691         -1.1597577 -1.6713321            1.10124916
## 3        -0.9972097         -0.1403560  0.3215727           -1.00937851
## 4        -0.1585278         -0.4864990 -0.6671507           -0.02172662
## 5         0.9722086          0.8823334  0.6239011            0.82390442
##   Temperaturedaily Temperaturemaxdailybypop Temperaturemaxdaily
## 1       -1.9616070              -1.93481806          -1.9831597
## 2        1.0049511               1.20646624           1.1304714
## 3       -1.0153438              -1.00257733          -1.0140552
## 4        0.1157091               0.07614452           0.2248615
## 5        0.7893167               0.73468166           0.6866439
##   Temperaturemindailybypop Temperaturemindaily  population        gdp
## 1               -1.8204104        -1.893919961 -0.35241109  0.1825424
## 2                0.9147704         0.790495679  0.04208817 -0.0995993
## 3               -0.9811736        -0.978943746  0.12111258  0.4987118
## 4               -0.1103747         0.002585377  0.03191893 -0.1103294
## 5                0.8998369         0.884881450 -0.03963195 -0.3011565
##   energy_per_capita
## 1         0.9748623
## 2         0.5044535
## 3         0.1765114
## 4        -0.2879199
## 5        -0.3318096
## 
## Clustering vector:
##   [1] 4 3 4 5 4 3 4 3 3 2 5 5 1 3 5 5 3 4 5 3 2 5 5 5 1 4 5 2 3 5 5 5 5 3 5 4 3
##  [38] 3 5 3 2 5 5 5 2 5 5 1 4 1 3 5 2 3 3 5 4 5 5 5 5 5 5 3 1 5 5 2 3 4 4 3 5 3
##  [75] 4 1 4 3 2 1 5 1 4 4 5 4 1 3 4 4 5 2 4 2 5 4 1 3 4 5 5 4 4 3 3 5 2 5 1 2 2
## [112] 5 5 3 4 5 4 3 3 5 2 3 3 5 3 1 4 5 5 2 2 3 5 5 5 3 3 4 3 5 4 4 1 3 4 3 5 5
## [149] 5 4 3 4 5 3 2 3 5 3 4 3 5 2 4 4
## 
## Within cluster sum of squares by cluster:
## [1] 112.4115 131.3020 187.4086 212.4822 377.2773
##  (between_SS / total_SS =  70.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We see the means for each variable for each center and how they are selected into the group.

Plot the number of countries in each group

groups = name.model$cluster
barplot(table(groups), col="blue")

Not that unbalanced but less number of countries in 1

Interpretation of Centers 1-5

Center: 1

par(mar = c(10.1, 4.1, 4.1, 2.1))
centers=name.model$centers
barplot(centers[1,], las=2, col="darkblue")

Temperature in this group seems lower than average with a higher energy_per_capita so maybe cold countries with high energy consumption.

Center: 2

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[2,], las=2, col="darkblue", ylim=c(-3,3))

We see countries that are a little hotter than usual and also with higher energy per capita.

Center: 3

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[3,], las=2, col="darkblue")

This groupo is really similar to 1, meaning cold countries with a higher gdp than average.

Center: 4

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[4,], las=2, col="darkblue")

Countries with similar temperature but differences in cloud variable and higher radiation than average.

Center: 5

par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[5,], las=2, col="darkblue")

Clearly hotter countries with less than average gdp and energy per capita so maybe poor countries with a hot climate.

Plotting the countries ina a 2D PCA graph adding colors for groups

fviz_cluster(name.model, data = Data2, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

We can observe the combination of PCA analysis with the cluster and see the different groups of countries. In the left we see cold countries, In the middle rich with moderate climate countries and so forth.

Check number of groups

fviz_nbclust(scale(Data2), kmeans, method = 'wss', k.max = 20, nstart = 1000)

Looks like with 3 groups it can be captured more of the differences but 5 doesn’t seem that bad.

fviz_nbclust(scale(Data2), kmeans, method = 'silhouette', k.max = 20, nstart = 1000)

Less noise then it will look like it is 2 so maybe 5 is a lot

fviz_nbclust(scale(Data2), kmeans, method = 'gap_stat', k.max = 10, nstart = 100, nboot = 500)

Now the optimal is 8 depending on another method

Based on all these insights I believe the optimal number of clusters is 4. This is because there are two groups highly similar and in the graphs from before 4 looks like a turning point in the optimization analysis. For the final model k = 4 is chosen.

k = 4
fit.km = kmeans(scale(Data2), centers=k, nstart = 1000 )

Map the clustering

map = data.frame(country=names, value=fit.km$cluster)

matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 164 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 79 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
               borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "rainbow",
               mapTitle = c("Clusters"), lwd=1)

We observe that there are 4 groups and are a combination with temperature, GDP and energy consumption groups:

  • The first group of countries are located in the north and are considered coldest countries with kind of high GDPs and energy consumption.
  • The second one are countries located in the middle to lower geographical location and have medium to high GDP and energy consumption
  • The third group are highly humid and hot countries with low GDP and energy consumption
  • Finally, the fourth group are also hot countries located in Northern Africa and Middle east so probably have high energy consumption per capita compared to close less developed countries.

Conclusions about the main objective

Based on the analyses, I will draw conclusions about the main objective. In the first part, a PCA was done and saw how the differences were mainly based on temperature with the most coldest and northern countries having the biggest differences from the rest. After that, a K-means cluster analysis was done. First, 5 groups were chosen but as the analysis progressed and more insight was gained, the optimal number of groups changed to 4. With little groups it can be observed how the countries are separated based on their climate and energy consumption characteristics. The most cold with high energy consumption include northern countries and rich countries like the US, Canada and Northern Europe. The rest are separated in hotter and more humid climates and also taking into account energy consumption per capita. A world map was created at the end to see how the groups are located and related. Finally, I think that if more centers are generated then more differences regarding gdp and energy consumption will be observed and not so much focused on climate.