Prepare data splits for calibration and validation
Author
Lars Caspersen
Aim
This notebook creates two versions of calibration / validation splits of the bloom observations: a “full” split using a common 75% calibration and 25% validation and a “scarcity” split with only ten observations per cultivar for calibration and the remaining data for validation.
We decided to have three cultivars per location. We only included phenology data from a single location even if there were observations from multiple locations to balance the experiment design.
Prepare the cherry data
In [1]:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)#take three cultivars per locationcherry <-read.csv('data/combined_phenological_data_adamedor_clean.csv') %>%filter(species =='Sweet Cherry') %>%select(species, cultivar, location, flowering_f50, year) %>%mutate(yday = lubridate::mdy(flowering_f50) %>% lubridate::yday()) %>%na.omit() cherry_summary <- cherry %>%group_by(cultivar, location) %>%summarise(n =n(),mean =mean(yday)) %>%filter(n >=20)
`summarise()` has grouped output by 'cultivar'. You can override using the
`.groups` argument.
#also take lapins from zaragoza#need to take burlat schneiders and regina from klein-altendorf#rainier, sam, van from zaragozacherry_sub <- cherry %>%filter(cultivar =='Burlat'& location =='Klein-Altendorf'| cultivar =='Regina'& location =='Klein-Altendorf'| cultivar =='Schneiders'& location =='Klein-Altendorf'| cultivar =='Rainier'& location =='Zaragoza'| cultivar =='Van'& location =='Zaragoza'| cultivar =='Sam'& location =='Zaragoza')#sample for full and scarcity splitcherry_master <-data.frame()share_full<-0.75n_scarce <-10set.seed(12345667)for(cult inunique(cherry_sub$cultivar)){ sub <- cherry_sub %>%filter(cultivar == cult) i_cal_full <-sample(1:nrow(sub), size =floor(share_full*nrow(sub))) i_cal_scarce <-sample(i_cal_full, size =10) cherry_master <- cherry_master %>%rbind(data.frame(sub[i_cal_full,],split ='Calibration',ncal ='full')) %>%rbind(data.frame(sub[i_cal_scarce,],split ='Calibration',ncal ='scarce')) %>%rbind(data.frame(sub[-i_cal_full,],split ='Validation',ncal ='full')) %>%rbind(data.frame(sub[-i_cal_scarce,],split ='Validation',ncal ='scarce'))}write.csv(cherry_master, 'data/master_cherry.csv', row.names =FALSE)
Prepare the apricot data
In [2]:
#take three cultivars per locationapricot <-read.csv('data/combined_phenological_data_adamedor_clean.csv') %>%filter(species =='Apricot') %>%select(species, cultivar, location, flowering_f50, year) %>%mutate(yday = lubridate::mdy(flowering_f50) %>% lubridate::yday()) %>%na.omit() #sometimes R makes trouble with accents. So remove it from Bulidaapricot$cultivar <-ifelse(apricot$cultivar =="B\xfalida",yes ='Bulida',no = apricot$cultivar)apricot_summary <- apricot %>%group_by(cultivar, location) %>%summarise(n =n(),mean =mean(yday)) %>%filter(n >=20)
`summarise()` has grouped output by 'cultivar'. You can override using the
`.groups` argument.
Prepare the almond data. In almond data I accidentally started first with the scarcity split, but in the end it has the same structure. Calibration data that is part of the scarcity split is also present in the calibration data of the “full split”. I decided to keep this structure, so that the splits are reproducible.