To explore the relationship between weather patterns and energy consumption in different countries, identify potential weather drivers of energy consumption and separate into similar groups based on temperature and energy characteristics.
link for databases: Weather Database Energy
library(tidyverse)
library(mice)
library(patchwork)
library(DataExplorer)
library(rworldmap)
library(factoextra)
library(cluster)
library(mclust)
Data = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-CDD18dailybypop-from-2000-to-2022.csv") %>%
mutate(Unit=NULL)
d2 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-CDD21dailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d3 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Clouddaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d4 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Daylightdaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d5 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-DNIdaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d6 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-GHIdaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d7 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-HDD16dailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d8 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-HDD18dailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d9 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-HeatIndexdailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d10 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Humidexdailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d11 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Precipitationdaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d12 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-RHdaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d13 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturedailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d14 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturedaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d15 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemaxdailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d16 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemaxdaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d17 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemindailybypop-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
d18 = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/IEA-CMCC-WeatherForEnergy-Temperaturemindaily-from-2000-to-2022.csv")%>%
mutate(Unit=NULL)
energy_consumption = read_csv("C:/Users/usuario/Desktop/Master/Advanced Modelling/Midterm_project/owid-energy-data.csv") %>%
filter(year >= 2000 & year <= 2022) %>%
transmute(iso_code, year, population, gdp, energy_per_capita) %>%
rename("Country" = "iso_code", "Date" = "year"
)
Data = Data %>%
left_join(d2)%>%
left_join(d3)%>%
left_join(d4)%>%
left_join(d5)%>%
left_join(d6)%>%
left_join(d7)%>%
left_join(d8)%>%
left_join(d9)%>%
left_join(d10)%>%
left_join(d11)%>%
left_join(d12)%>%
left_join(d13)%>%
left_join(d14)%>%
left_join(d15)%>%
left_join(d16)%>%
left_join(d17)%>%
left_join(d18) %>%
left_join(energy_consumption)
summary(Data)
## Country Territory CDD18dailybypop Date
## Length:5457 Length:5457 Min. : 0.0 Min. :2000
## Class :character Class :character 1st Qu.: 535.3 1st Qu.:2005
## Mode :character Mode :character Median :1819.4 Median :2011
## Mean :1815.0 Mean :2011
## 3rd Qu.:3018.7 3rd Qu.:2017
## Max. :4175.3 Max. :2022
##
## CDD21dailybypop Clouddaily Daylightdaily DNIdaily
## Min. : 0.0 Min. :11.47 Min. :726.4 Min. :101265
## 1st Qu.: 210.1 1st Qu.:45.49 1st Qu.:727.2 1st Qu.:420456
## Median : 978.9 Median :56.34 Median :728.1 Median :533992
## Mean :1119.4 Mean :55.08 Mean :729.9 Mean :514281
## 3rd Qu.:1944.6 3rd Qu.:67.45 3rd Qu.:731.3 3rd Qu.:636470
## Max. :3081.0 Max. :90.50 Max. :752.9 Max. :800599
##
## GHIdaily HDD16dailybypop HDD18dailybypop HeatIndexdailybypop
## Min. : 282809 Min. : 0.000 Min. : 0.00 Min. :-8.762
## 1st Qu.: 658444 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:13.147
## Median : 781075 Median : 8.672 Median : 54.78 Median :22.853
## Mean : 741394 Mean : 843.462 Mean :1047.77 Mean :20.370
## 3rd Qu.: 860539 3rd Qu.:1557.031 3rd Qu.:2026.15 3rd Qu.:27.510
## Max. :1018783 Max. :9030.972 Max. :9760.97 Max. :32.437
##
## Humidexdailybypop Precipitationdaily RHdaily Temperaturedailybypop
## Min. :-12.44 Min. :0.0009588 Min. :22.53 Min. :-8.762
## 1st Qu.: 13.81 1st Qu.:0.0768054 1st Qu.:65.94 1st Qu.:13.080
## Median : 28.05 Median :0.1191295 Median :76.51 Median :22.656
## Mean : 24.56 Mean :0.1386326 Mean :70.90 Mean :19.694
## 3rd Qu.: 35.17 3rd Qu.:0.1856865 3rd Qu.:80.27 3rd Qu.:26.240
## Max. : 39.52 Max. :0.5655734 Max. :87.54 Max. :29.441
##
## Temperaturedaily Temperaturemaxdailybypop Temperaturemaxdaily
## Min. :-18.51 Min. :-7.073 Min. :-15.96
## 1st Qu.: 12.44 1st Qu.:17.703 1st Qu.: 16.91
## Median : 23.71 Median :26.280 Median : 26.84
## Mean : 19.76 Mean :23.114 Mean : 23.21
## 3rd Qu.: 26.30 3rd Qu.:28.456 3rd Qu.: 28.70
## Max. : 29.80 Max. :36.064 Max. : 36.43
##
## Temperaturemindailybypop Temperaturemindaily population
## Min. :-10.579 Min. :-21.221 Min. :1.833e+03
## 1st Qu.: 9.535 1st Qu.: 8.527 1st Qu.:8.496e+05
## Median : 19.091 Median : 20.166 Median :5.963e+06
## Mean : 16.610 Mean : 16.591 Mean :3.306e+07
## 3rd Qu.: 23.705 3rd Qu.: 23.909 3rd Qu.:2.133e+07
## Max. : 27.131 Max. : 27.131 Max. :1.426e+09
## NA's :797
## gdp energy_per_capita
## Min. :3.129e+08 Min. : 0
## 1st Qu.:2.185e+10 1st Qu.: 3636
## Median :7.144e+10 Median : 14909
## Mean :5.244e+11 Mean : 27753
## 3rd Qu.:3.306e+11 3rd Qu.: 35945
## Max. :1.815e+13 Max. :657539
## NA's :2346 NA's :1067
I will pre-process the data by checking for missing values and scaling the continuous variables because they are in different units. Then, I will perform descriptive analysis to explore the distributions of the variables, and identify potential outliers and trends.
#All variables are continuous except Country and year
#some NAs in population, gdp and energy_per_capita
Data[!complete.cases(Data),]
## # A tibble: 2,353 × 24
## Country Terri…¹ CDD18…² Date CDD21…³ Cloud…⁴ Dayli…⁵ DNIda…⁶ GHIda…⁷ HDD16…⁸
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AFG Afghan… 877. 2019 5.32e+2 29.9 730. 635645. 861714. 2467.
## 2 AFG Afghan… 817. 2020 4.97e+2 28.2 730. 639755. 864693. 2610.
## 3 AFG Afghan… 899. 2021 5.53e+2 20.5 730. 681513. 891258. 2118.
## 4 AFG Afghan… 821. 2022 4.83e+2 25.8 730. 653421. 868719. 2194.
## 5 ALA Aland … 0 2000 0 64.6 742. 262545. 425579. 3175.
## 6 ALA Aland … 12.1 2001 0 67.1 742. 269857. 437126. 3476.
## 7 ALA Aland … 88.1 2002 2.53e+0 60.8 742. 311881. 466501. 3451.
## 8 ALA Aland … 55.0 2003 4.74e+0 63.3 742. 279331. 448222. 3690.
## 9 ALA Aland … 19.9 2004 3.08e-1 62.8 742. 272947. 439375. 3490.
## 10 ALA Aland … 13.8 2005 6.24e-2 63.2 742. 279120. 446895. 3404.
## # … with 2,343 more rows, 14 more variables: HDD18dailybypop <dbl>,
## # HeatIndexdailybypop <dbl>, Humidexdailybypop <dbl>,
## # Precipitationdaily <dbl>, RHdaily <dbl>, Temperaturedailybypop <dbl>,
## # Temperaturedaily <dbl>, Temperaturemaxdailybypop <dbl>,
## # Temperaturemaxdaily <dbl>, Temperaturemindailybypop <dbl>,
## # Temperaturemindaily <dbl>, population <dbl>, gdp <dbl>,
## # energy_per_capita <dbl>, and abbreviated variable names ¹Territory, …
#Drop observations that have Nas for gdp
Data = Data %>% drop_na(gdp)
#Now only have 7 rows that dont have energy consumption
Data[!complete.cases(Data),]
## # A tibble: 7 × 24
## Country Territ…¹ CDD18…² Date CDD21…³ Cloud…⁴ Dayli…⁵ DNIda…⁶ GHIda…⁷ HDD16…⁸
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MNE Montene… 211. 2005 68.3 57.8 733. 368784. 589042. 2679.
## 2 SRB Serbia 544. 2000 267. 48.1 733. 411276. 619510. 2014.
## 3 SRB Serbia 390. 2001 165. 57.0 733. 346936. 558712. 2223.
## 4 SRB Serbia 454. 2002 184. 60.0 733. 347985. 563165. 2052.
## 5 SRB Serbia 605. 2003 283. 54.1 733. 378256. 589525. 2493.
## 6 SRB Serbia 316. 2004 120. 59.0 733. 345978. 560569. 2259.
## 7 SRB Serbia 322. 2005 106. 59.4 733. 331688. 551078. 2508.
## # … with 14 more variables: HDD18dailybypop <dbl>, HeatIndexdailybypop <dbl>,
## # Humidexdailybypop <dbl>, Precipitationdaily <dbl>, RHdaily <dbl>,
## # Temperaturedailybypop <dbl>, Temperaturedaily <dbl>,
## # Temperaturemaxdailybypop <dbl>, Temperaturemaxdaily <dbl>,
## # Temperaturemindailybypop <dbl>, Temperaturemindaily <dbl>,
## # population <dbl>, gdp <dbl>, energy_per_capita <dbl>, and abbreviated
## # variable names ¹Territory, ²CDD18dailybypop, ³CDD21dailybypop, …
#Use mice function to fill energy_consumption nas
m = 7
mice_mod <- mice(Data, m=m, method='rf')
##
## iter imp variable
## 1 1 energy_per_capita
## 1 2 energy_per_capita
## 1 3 energy_per_capita
## 1 4 energy_per_capita
## 1 5 energy_per_capita
## 1 6 energy_per_capita
## 1 7 energy_per_capita
## 2 1 energy_per_capita
## 2 2 energy_per_capita
## 2 3 energy_per_capita
## 2 4 energy_per_capita
## 2 5 energy_per_capita
## 2 6 energy_per_capita
## 2 7 energy_per_capita
## 3 1 energy_per_capita
## 3 2 energy_per_capita
## 3 3 energy_per_capita
## 3 4 energy_per_capita
## 3 5 energy_per_capita
## 3 6 energy_per_capita
## 3 7 energy_per_capita
## 4 1 energy_per_capita
## 4 2 energy_per_capita
## 4 3 energy_per_capita
## 4 4 energy_per_capita
## 4 5 energy_per_capita
## 4 6 energy_per_capita
## 4 7 energy_per_capita
## 5 1 energy_per_capita
## 5 2 energy_per_capita
## 5 3 energy_per_capita
## 5 4 energy_per_capita
## 5 5 energy_per_capita
## 5 6 energy_per_capita
## 5 7 energy_per_capita
## Warning: Number of logged events: 37
Data <- complete(mice_mod, action=m)
#Check there are no more Nas
Data[!complete.cases(Data),]
## [1] Country Territory CDD18dailybypop
## [4] Date CDD21dailybypop Clouddaily
## [7] Daylightdaily DNIdaily GHIdaily
## [10] HDD16dailybypop HDD18dailybypop HeatIndexdailybypop
## [13] Humidexdailybypop Precipitationdaily RHdaily
## [16] Temperaturedailybypop Temperaturedaily Temperaturemaxdailybypop
## [19] Temperaturemaxdaily Temperaturemindailybypop Temperaturemindaily
## [22] population gdp energy_per_capita
## <0 rows> (or 0-length row.names)
Average means per year of data
Descriptive = Data %>%
group_by(Date) %>%
summarize(gdp_mean = mean(gdp),
population_mean = mean(population),
energy_per_capita_mean = mean(energy_per_capita),
Temperaturedaily_mean = mean(Temperaturedaily))
Descriptive
## # A tibble: 19 × 5
## Date gdp_mean population_mean energy_per_capita_mean Temperaturedaily…¹
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 364619842297. 37204756. 24640. 18.5
## 2 2001 375406699283. 37695584. 24920. 18.6
## 3 2002 388201610777. 38185328. 25326. 18.7
## 4 2003 404161378020. 38673940. 25887. 18.6
## 5 2004 426859578689. 39164515. 26629. 18.6
## 6 2005 446330040693. 39420511. 27068. 18.6
## 7 2006 471927778202. 39916124. 26995. 18.7
## 8 2007 497808950039. 40418992. 26891. 18.8
## 9 2008 512902197693. 40928364. 27058. 18.6
## 10 2009 513373221147. 41444486. 26082. 18.7
## 11 2010 543623972101. 41964309. 26942. 18.9
## 12 2011 566491838953. 42486395. 26793. 18.6
## 13 2012 582919117343. 43014605. 27025. 18.7
## 14 2013 601079013868. 43543489. 26916. 18.7
## 15 2014 620443517004. 44070149. 26519. 18.9
## 16 2015 637729707367. 44591596. 26690. 19.1
## 17 2016 649075621947. 45108692. 26642. 19.2
## 18 2017 668199710669. 45623519. 26826. 19.0
## 19 2018 688001628156. 46122774. 26887. 19.1
## # … with abbreviated variable name ¹Temperaturedaily_mean
Observe kind of an increasing trend
Plots: Average histograms, boxplots, summarize with means, variance, etc
bxp<-p <- ggplot(data=Descriptive, aes(x=Date, y=gdp_mean)) +
geom_line()
dp<-p <- ggplot(data=Descriptive, aes(x=Date, y=population_mean)) +
geom_line()
dp1<-p <- ggplot(data=Descriptive, aes(x=Date, y=energy_per_capita_mean)) +
geom_line()
dp2<-p <- ggplot(data=Descriptive, aes(x=Date, y=Temperaturedaily_mean)) +
geom_line()
bxp + dp + dp1 + dp2
We confirm that there is an increasing trend at least for gdp and population Going to apply log to population and Date because highly asymmetrical
Data$gdp = log(Data$gdp)
Data$population = log(Data$population)
More plots restrict to year 2018 and focus analysis on that year for simplicity
Data = Data %>% filter(Date == 2018)
t2 = ggplot(Data, aes(x=Temperaturedaily)) + geom_density(fill = "darkred")
t3 = ggplot(Data, aes(x=energy_per_capita)) + geom_density(fill = "lightblue")
t4 = ggplot(Data, aes(x=population)) + geom_density(fill = "purple")
t5 = ggplot(Data, aes(x=gdp)) + geom_density(fill = "orange")
t2 + t3 + t4 + t5
Density functions to observe behavoirs, poulation and gdp look more like a Normal Distribution now. Temperature and energy_per_capita look inversely skewed.
Some boxplots
r2 = ggplot(Data, aes(x=Temperaturedaily)) + geom_boxplot(fill = "darkred")
r3 = ggplot(Data, aes(x=energy_per_capita)) + geom_boxplot(fill = "lightblue")
r4 = ggplot(Data, aes(x=population)) + geom_boxplot(fill = "purple")
r5 = ggplot(Data, aes(x=gdp)) + geom_boxplot(fill = "orange")
r2 + r3 + r4 +r5
Energy per Capita Map for 2018
map = Data %>%
dplyr::select(Country, energy_per_capita)
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "Country")
## 164 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 79 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="energy_per_capita",missingCountryCol = "white",
borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "topo",
mapTitle = c("Energy per Capita per Country"), lwd=1)
Kind of a similar group based on the location
We will apply PCA to identify the main components of weather variables that are most strongly associated with energy consumption. We will interpret the results by examining the loadings of each component and identifying which variables are most strongly associated with each component.
Want to scale variables because are in different units and going to remove non-numeric
Data2 = Data %>%
select(-Country, -Territory, -Date)
pca = prcomp(Data2, scale = T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.6357 1.7576 1.34237 1.1611 0.76877 0.5555 0.46090
## Proportion of Variance 0.6294 0.1471 0.08581 0.0642 0.02814 0.0147 0.01012
## Cumulative Proportion 0.6294 0.7765 0.86234 0.9265 0.95468 0.9694 0.97949
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.35259 0.33404 0.30948 0.23961 0.14921 0.08920 0.06234
## Proportion of Variance 0.00592 0.00531 0.00456 0.00273 0.00106 0.00038 0.00019
## Cumulative Proportion 0.98541 0.99073 0.99529 0.99802 0.99908 0.99946 0.99964
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.06025 0.04732 0.03160 0.02056 0.01197 0.004768
## Proportion of Variance 0.00017 0.00011 0.00005 0.00002 0.00001 0.000000
## Cumulative Proportion 0.99982 0.99992 0.99997 0.99999 1.00000 1.000000
## PC21
## Standard deviation 0.001581
## Proportion of Variance 0.000000
## Cumulative Proportion 1.000000
Graph to see it better
fviz_screeplot(pca, addlabels = TRUE)
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(pca$rotation[,1], las = 2, col = "darkblue")
All columns are relevant except for -Temperaturemindaily
5 others are not highly relevant < |0.1| -Clouddaily -Precipitationdaily -RHdaily -gdp -energy_per_capita
The dataset is a combination of climate related variables and energy consumption so it makes sense that almost all the variables are highly significant.
Top countries in the world taking into account the variables from PCA1:
names=Data$Country
names[order(pca$x[,1])][1:10]
## [1] "ISL" "RUS" "CAN" "MNG" "FIN" "NOR" "SWE" "EST" "KGZ" "LVA"
#Top 10 countries taking into account the variables
The PCA1 is composed of temperature variables so kind of observe that these countries have similar cold climate.
Bottom countries in the world taking into account the 11 variables:
names[order(pca$x[,1], decreasing=T)][1:10] #Bottom 10 countries taking into account the variables
## [1] "NER" "MLI" "TCD" "BFA" "DJI" "MRT" "ARE" "OMN" "QAT" "KHM"
These countries have similar climate with the average.
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(pca$rotation[,2], las=2, col="darkblue")
Now very relevant:
-Clouddaily -Precipitationdaily -RHdaily
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(pca$rotation[,3], las=2, col="darkblue")
Now very relevant:
-gdp -energy_per_capita
WRITE and Interpret map
data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=names)) + geom_point(size=0) +
labs(title="First two principal components (scores)", x="PC1", y="PC2") +
theme_bw() +theme(legend.position="bottom") + geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)
We see how based on PC1 and PC2, the most coldest countries are located in the left spectrum and the hottest to the right.
map = data.frame(country=names, value=-pca$x[,1])
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "country")
## 164 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 79 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
addLegend = FALSE, borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "terrain",
mapTitle = c("PCA1 by Country"), lwd=1)
Just as we thought depending on where they are located verticial, they have similar temperatures so similar groups.
Try 5 centers
set.seed(67365)
k = 5
name.model = kmeans(scale(Data2), centers=k, nstart = 1000 )
name.model
## K-means clustering with 5 clusters of sizes 13, 18, 41, 35, 57
##
## Cluster means:
## CDD18dailybypop CDD21dailybypop Clouddaily Daylightdaily DNIdaily
## 1 -1.2205178 -1.0825838 0.7055554 2.4073103 -1.5459113
## 2 1.4275964 1.6289126 -1.4684171 -0.3711297 1.2851304
## 3 -0.9969246 -0.9256373 0.2666901 0.6328729 -0.8980301
## 4 -0.4006715 -0.4522690 -0.8112450 -0.4213924 0.7232492
## 5 0.7906563 0.6760300 0.6090977 -0.6283115 0.1485967
## GHIdaily HDD16dailybypop HDD18dailybypop HeatIndexdailybypop
## 1 -1.7094025 2.3414552 2.2585954 -1.83803888
## 2 1.2375451 -0.7005270 -0.7285995 1.11278748
## 3 -0.9657311 0.8709624 0.9164369 -1.00332037
## 4 0.6239380 -0.3481576 -0.2885703 -0.07336208
## 5 0.3105889 -0.7254978 -0.7670333 0.83452876
## Humidexdailybypop Precipitationdaily RHdaily Temperaturedailybypop
## 1 -1.7646356 -0.5104081 0.3605698 -1.89539124
## 2 0.7754691 -1.1597577 -1.6713321 1.10124916
## 3 -0.9972097 -0.1403560 0.3215727 -1.00937851
## 4 -0.1585278 -0.4864990 -0.6671507 -0.02172662
## 5 0.9722086 0.8823334 0.6239011 0.82390442
## Temperaturedaily Temperaturemaxdailybypop Temperaturemaxdaily
## 1 -1.9616070 -1.93481806 -1.9831597
## 2 1.0049511 1.20646624 1.1304714
## 3 -1.0153438 -1.00257733 -1.0140552
## 4 0.1157091 0.07614452 0.2248615
## 5 0.7893167 0.73468166 0.6866439
## Temperaturemindailybypop Temperaturemindaily population gdp
## 1 -1.8204104 -1.893919961 -0.35241109 0.1825424
## 2 0.9147704 0.790495679 0.04208817 -0.0995993
## 3 -0.9811736 -0.978943746 0.12111258 0.4987118
## 4 -0.1103747 0.002585377 0.03191893 -0.1103294
## 5 0.8998369 0.884881450 -0.03963195 -0.3011565
## energy_per_capita
## 1 0.9748623
## 2 0.5044535
## 3 0.1765114
## 4 -0.2879199
## 5 -0.3318096
##
## Clustering vector:
## [1] 4 3 4 5 4 3 4 3 3 2 5 5 1 3 5 5 3 4 5 3 2 5 5 5 1 4 5 2 3 5 5 5 5 3 5 4 3
## [38] 3 5 3 2 5 5 5 2 5 5 1 4 1 3 5 2 3 3 5 4 5 5 5 5 5 5 3 1 5 5 2 3 4 4 3 5 3
## [75] 4 1 4 3 2 1 5 1 4 4 5 4 1 3 4 4 5 2 4 2 5 4 1 3 4 5 5 4 4 3 3 5 2 5 1 2 2
## [112] 5 5 3 4 5 4 3 3 5 2 3 3 5 3 1 4 5 5 2 2 3 5 5 5 3 3 4 3 5 4 4 1 3 4 3 5 5
## [149] 5 4 3 4 5 3 2 3 5 3 4 3 5 2 4 4
##
## Within cluster sum of squares by cluster:
## [1] 112.4115 131.3020 187.4086 212.4822 377.2773
## (between_SS / total_SS = 70.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We see the means for each variable for each center and how they are selected into the group.
Plot the number of countries in each group
groups = name.model$cluster
barplot(table(groups), col="blue")
Not that unbalanced but less number of countries in 1
Center: 1
par(mar = c(10.1, 4.1, 4.1, 2.1))
centers=name.model$centers
barplot(centers[1,], las=2, col="darkblue")
Temperature in this group seems lower than average with a higher energy_per_capita so maybe cold countries with high energy consumption.
Center: 2
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[2,], las=2, col="darkblue", ylim=c(-3,3))
We see countries that are a little hotter than usual and also with higher energy per capita.
Center: 3
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[3,], las=2, col="darkblue")
This groupo is really similar to 1, meaning cold countries with a higher gdp than average.
Center: 4
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[4,], las=2, col="darkblue")
Countries with similar temperature but differences in cloud variable and higher radiation than average.
Center: 5
par(mar = c(10.1, 4.1, 4.1, 2.1))
barplot(centers[5,], las=2, col="darkblue")
Clearly hotter countries with less than average gdp and energy per capita so maybe poor countries with a hot climate.
fviz_cluster(name.model, data = Data2, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")
We can observe the combination of PCA analysis with the cluster and see the different groups of countries. In the left we see cold countries, In the middle rich with moderate climate countries and so forth.
fviz_nbclust(scale(Data2), kmeans, method = 'wss', k.max = 20, nstart = 1000)
Looks like with 3 groups it can be captured more of the differences but 5 doesn’t seem that bad.
fviz_nbclust(scale(Data2), kmeans, method = 'silhouette', k.max = 20, nstart = 1000)
Less noise then it will look like it is 2 so maybe 5 is a lot
fviz_nbclust(scale(Data2), kmeans, method = 'gap_stat', k.max = 10, nstart = 100, nboot = 500)
Now the optimal is 8 depending on another method
Based on all these insights I believe the optimal number of clusters is 4. This is because there are two groups highly similar and in the graphs from before 4 looks like a turning point in the optimization analysis. For the final model k = 4 is chosen.
k = 4
fit.km = kmeans(scale(Data2), centers=k, nstart = 1000 )
map = data.frame(country=names, value=fit.km$cluster)
matched <- joinCountryData2Map(map, joinCode = "ISO3",
nameJoinColumn = "country")
## 164 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 79 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
borderCol = "#C7D9FF",
catMethod = "pretty", colourPalette = "rainbow",
mapTitle = c("Clusters"), lwd=1)
We observe that there are 4 groups and are a combination with temperature, GDP and energy consumption groups:
Based on the analyses, I will draw conclusions about the main objective. In the first part, a PCA was done and saw how the differences were mainly based on temperature with the most coldest and northern countries having the biggest differences from the rest. After that, a K-means cluster analysis was done. First, 5 groups were chosen but as the analysis progressed and more insight was gained, the optimal number of groups changed to 4. With little groups it can be observed how the countries are separated based on their climate and energy consumption characteristics. The most cold with high energy consumption include northern countries and rich countries like the US, Canada and Northern Europe. The rest are separated in hotter and more humid climates and also taking into account energy consumption per capita. A world map was created at the end to see how the groups are located and related. Finally, I think that if more centers are generated then more differences regarding gdp and energy consumption will be observed and not so much focused on climate.