Site selection

library(amerifluxr)
library(data.table)
library(pander)

amerifluxr is a programmatic interface to the AmeriFlux. This vignette demonstrates examples to query a list of target sites based on sites’ general information and availability of metadata and data. A companion vignette for Data import is available as well.

Get a site list with general info

AmeriFlux data are organized by individual sites. Typically, data query begins with site search and selection. A full list of AmeriFlux sites with general info can be obtained using the amf_site_info() function.

Convert the site list to a data.table for easier manipulation. Also see link for variable definition.

# get a full list of sites with general info
sites <- amf_site_info()
sites_dt <- data.table::as.data.table(sites)

pander::pandoc.table(sites_dt[c(1:3), ])
Table continues below
SITE_ID SITE_NAME COUNTRY STATE IGBP
AR-Bal Balcarce BA Argentina Buenos Aires CRO
AR-CCa Carlos Casares agriculture Argentina Buenos Aires CRO
AR-CCg Carlos Casares grassland Argentina Buenos Aires GRA
Table continues below
TOWER_BEGAN TOWER_END URL_AMERIFLUX
2012 2013 https://ameriflux.lbl.gov/sites/siteinfo/AR-Bal
2012 NA https://ameriflux.lbl.gov/sites/siteinfo/AR-CCa
2018 NA https://ameriflux.lbl.gov/sites/siteinfo/AR-CCg
Table continues below
LOCATION_LAT LOCATION_LONG LOCATION_ELEV CLIMATE_KOEPPEN MAT MAP
-37.76 -58.3 130 Cfb 14 926
-35.62 -61.32 83 Cfa 16.1 1060
-35.92 -61.19 84 Cfa 16.1 1060
DATA_POLICY DATA_START DATA_END
LEGACY 2012 2013
CCBY4.0 2012 2020
CCBY4.0 2018 2023

The site list provides a quick summary of all registered sites and sites with available data.

It’s often important to understand the data use policy under which the data are shared. In 2021, the AmeriFlux community moved to the AmeriFlux CC-BY-4.0 License. Most site PIs now share their sites’ data under the CC-BY-4.0 license. Data for some sites are shared under the historical AmeriFlux data-sharing policy, now called the AmeriFlux Legacy Data Policy.

Check link for data use policy and attribution guidelines.

# total number of registered sites
pander::pandoc.table(sites_dt[, .N])
677

# total number of sites with available data
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N])
499

# get number of sites with available data, grouped by data use policy
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(DATA_POLICY)])
DATA_POLICY N
LEGACY 70
CCBY4.0 429

Further group sites based on IGBP.

# get a summary table of sites grouped by IGBP
pander::pandoc.table(sites_dt[, .N, by = "IGBP"])
IGBP N
CRO 139
GRA 91
WET 118
DNF 1
DBF 60
EBF 18
WSA 13
CVM 13
MF 17
ENF 103
OSH 41
BSV 10
WAT 13
CSH 13
URB 18
SAV 8
SNO 1

# get a summary table of sites with available data, & grouped by IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = "IGBP"])
IGBP N
CRO 94
GRA 68
WET 73
DNF 1
WSA 9
EBF 8
ENF 97
DBF 55
MF 13
OSH 32
CSH 12
BSV 5
WAT 9
URB 8
CVM 8
SAV 6
SNO 1

# get a summary table of sites with available data, 
#  & grouped by data use policy & IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(IGBP, DATA_POLICY)][order(IGBP)])
IGBP DATA_POLICY N
BSV CCBY4.0 4
BSV LEGACY 1
CRO LEGACY 12
CRO CCBY4.0 82
CSH LEGACY 5
CSH CCBY4.0 7
CVM CCBY4.0 8
DBF CCBY4.0 50
DBF LEGACY 5
DNF CCBY4.0 1
EBF CCBY4.0 6
EBF LEGACY 2
ENF CCBY4.0 88
ENF LEGACY 9
GRA CCBY4.0 56
GRA LEGACY 12
MF CCBY4.0 10
MF LEGACY 3
OSH CCBY4.0 27
OSH LEGACY 5
SAV CCBY4.0 6
SNO CCBY4.0 1
URB CCBY4.0 6
URB LEGACY 2
WAT CCBY4.0 9
WET LEGACY 12
WET CCBY4.0 61
WSA CCBY4.0 7
WSA LEGACY 2

Once decided, users can query a target site list based on the desired criteria, e.g., IGBP, data availability, data policy, geolocation.


# get a list of cropland and grassland sites with available data,
#  shared under CC-BY-4.0 data policy,
#  located within 30-50 degree N in latitude,
# returned a site list with site ID, name, data starting/ending year
crop_ls <- sites_dt[IGBP %in% c("CRO", "GRA") &
                      !is.na(DATA_START) &
                      LOCATION_LAT > 30 &
                      LOCATION_LAT < 50 &
                      DATA_POLICY == "CCBY4.0",
                    .(SITE_ID, SITE_NAME, DATA_START, DATA_END)]
pander::pandoc.table(crop_ls[c(1:10),])
SITE_ID SITE_NAME DATA_START DATA_END
CA-ER1 Elora Research Station 2015 2021
US-A32 ARM-SGP Medford hay pasture 2015 2017
US-A74 ARM SGP milo field 2015 2017
US-AR1 ARM USDA UNL OSU Woodward Switchgrass 1 2009 2012
US-AR2 ARM USDA UNL OSU Woodward Switchgrass 2 2009 2012
US-ARM ARM Southern Great Plains site- Lamont 2003 2024
US-ARb ARM Southern Great Plains burn site- Lamont 2005 2006
US-ARc ARM Southern Great Plains control site- Lamont 2005 2006
US-BMM Bangtail Mountain Meadow 2016 2019
US-BRG Bayles Road Grassland Tower 2016 2020

Get metadata availability

In some cases, users may want to know if certain types of metadata are available for the selected sites. The amf_list_metadata() function provides a quick summary of metadata availability before actually downloading the data and metadata.

By default, amf_list_metadata() returns a full site list with the available entries (i.e., counts) for all BADM groups. Check AmeriFlux webpage for definitions of all BADM groups.

# get data availability for selected sites
metadata_aval <- data.table::as.data.table(amf_list_metadata())
pander::pandoc.table(metadata_aval[c(1:3), c(1:10)])
Table continues below
SITE_ID GRP_ACKNOWLEDGEMENT GRP_CLIM_AVG GRP_COUNTRY GRP_DOI
AR-Bal 1 1 1 1
AR-CCa 1 1 1 1
AR-CCg 1 1 1 1
Table continues below
GRP_DOI_CONTRIBUTOR GRP_DOI_ORGANIZATION GRP_DOM_DIST_MGMT
2 2 1
1 2 1
1 2 2
GRP_FLUX_MEASUREMENTS GRP_HEADER
6 1
2 1
2 1

The site_set parameter of the amf_list_metadata() can be used to subset the sites of interest.

metadata_aval_sub <- as.data.table(amf_list_metadata(site_set = crop_ls$SITE_ID))

# down-select cropland & grassland sites by interested BADM group,
#  e.g., canopy height (GRP_HEIGHTC)
crop_ls2 <- metadata_aval_sub[GRP_HEIGHTC > 0, .(SITE_ID, GRP_HEIGHTC)][order(-GRP_HEIGHTC)]
pander::pandoc.table(crop_ls2[c(1:10), ])
SITE_ID GRP_HEIGHTC
US-Ne2 196
US-Tw3 162
US-Twt 133
US-Ne3 128
US-Ne1 119
US-Bi1 112
US-IB2 105
US-Var 105
US-Snd 70
US-Bi2 54

Get data availability

Users can use amf_list_data() to query the availability of specific variables in the data (i.e., flux/met data, so-called BASE data product). The amf_list_data() provides a quick summary of variable availability (per site/year) before downloading the data.

By default, amf_list_data() returns a full site list of variable availability (data percentages per year) for all variables. The site_set parameter of amf_list_data() can be used to subset the sites of interest.

# get data availability for selected sites
data_aval <- data.table::as.data.table(amf_list_data(site_set = crop_ls2$SITE_ID))
pander::pandoc.table(data_aval[c(1:10), ])
Table continues below
SITE_ID VARIABLE BASENAME GAP_FILLED Y1990 Y1991 Y1992 Y1993
US-AR1 CO2 CO2 FALSE 0 0 0 0
US-AR1 FC FC FALSE 0 0 0 0
US-AR1 G G FALSE 0 0 0 0
US-AR1 H H FALSE 0 0 0 0
US-AR1 H2O H2O FALSE 0 0 0 0
US-AR1 LE LE FALSE 0 0 0 0
US-AR1 LW_IN LW_IN FALSE 0 0 0 0
US-AR1 LW_OUT LW_OUT FALSE 0 0 0 0
US-AR1 NETRAD NETRAD FALSE 0 0 0 0
US-AR1 P P FALSE 0 0 0 0
Table continues below
Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Table continues below
Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012
0 0 0 0 0 0.5905 0.9866 0.9941 0.6654
0 0 0 0 0 0.6082 0.976 0.9886 0.6621
0 0 0 0 0 0.6421 0.9965 0.9999 0.9961
0 0 0 0 0 0.6123 0.9867 0.9938 0.6666
0 0 0 0 0 0.6092 0.971 0.9792 0.6633
0 0 0 0 0 0.6101 0.9816 0.9936 0.6647
0 0 0 0 0 0.6416 0.9965 0.9999 0.9961
0 0 0 0 0 0.6416 0.9965 0.9999 0.9961
0 0 0 0 0 0.5447 0.9964 0.9996 0.996
0 0 0 0 0 0.6422 0.9965 0.9999 0.9961
Table continues below
Y2013 Y2014 Y2015 Y2016 Y2017 Y2018 Y2019 Y2020 Y2021 Y2022
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Y2023 Y2024
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0

The variable availability can be used to subset sites that have certain variables in specific years. The BASENAME column indicates the variable’s base name (i.e., ignoring position qualifier), and can be used to get a coarse-level variable availability.

See AmeriFlux website for definitions of base names and qualifiers.

# down-select cropland & grassland sites based on the available wind speed (WS) and 
# friction velocity (USTAR) data in 2015-2018, regardless their qualifiers
data_aval_sub <- data_aval[data_aval$BASENAME %in% c("WS","USTAR"),
                           .(SITE_ID, BASENAME, Y2015, Y2016, Y2017, Y2018)]

# calculate mean availability of WS and USTAR in each site and each year
data_aval_sub <- data_aval_sub[, lapply(.SD, mean), 
                               by = .(SITE_ID),
                               .SDcols = c("Y2015", "Y2016", "Y2017", "Y2018")]

# sub-select sites that have WS and USTAR data for > 75%
#  during 2015-2018
crop_ls3 <- data_aval_sub[(Y2015 + Y2016 + Y2017 + Y2018) / 4 > 0.75]
pander::pandoc.table(crop_ls3)
SITE_ID Y2015 Y2016 Y2017 Y2018
US-ARM 0.5772 0.9871 0.9683 0.9826
US-IB2 0.9258 0.9971 0.9815 0.9805
US-Ne1 0.77 0.7861 0.756 0.7167
US-Ne2 0.7636 0.7878 0.7594 0.7442
US-SRG 0.9669 0.9851 0.9775 0.9997
US-Tw3 0.9689 0.9569 0.9763 0.4005
US-Var 0.9983 1 0.9455 1
US-Wkg 0.9973 0.9909 0.9965 0.9848

Last, sometimes users would look for sites with multiple measurements of similar variables (e.g., multilevel wind speed, soil temperature). The VARIABLE column in the variable availability can be used to get a fine-level variable availability.


# down-select cropland & grassland sites by available wind speed (WS) data,
#  mean availability of WS during 2015-2018
data_aval_sub2 <- data_aval[data_aval$BASENAME %in% c("WS"),
                            .(SITE_ID, VARIABLE, Y2015_2018 = (Y2015 + Y2016 + Y2017 + Y2018)/4)]

# calculate number of WS variables per site, for sites that 
#  have any WS data during 2015-2018
data_aval_sub2 <- data_aval_sub2[Y2015_2018 > 0, .(.N, Y2015_2018 = mean(Y2015_2018)), .(SITE_ID)]
pander::pandoc.table(crop_ls4 <- data_aval_sub2[N > 1, ])
SITE_ID N Y2015_2018
US-ARM 3 0.8766
US-Ne1 4 0.7027
US-Ne2 4 0.709
US-Ne3 4 0.7287
US-Wkg 2 0.9942

A companion function amf_plot_datayear() can be used for visualizing the data availability in an interactive figure. However, it is strongly advised to subset the sites, variables, and/or years for faster processing and better visualization.

#### not evaluated so to reduce vignette size
# plot data availability for WS & USTAR
#  for selected sites in 2015-2018
amf_plot_datayear(
  site_set = crop_ls4$SITE_ID,
  var_set = c("WS", "USTAR"),
  nonfilled_only = TRUE,
  year_set = c(2015:2018)
)

Get data summary

In addition, users can use amf_summarize_data() to query the summary statistics of specific variables in the BASE data. The amf_summarize_data() provides summary statistics for each variable (e.g., percentiles) before downloading the data.

By default, amf_summarize_data() returns variable summary (selected percentiles) for all variables and sites. The site_set and var_set parameters can be used to subset the sites or variables of interest.

## get data summary for selected sites & variables
data_sum <- amf_summarize_data(site_set = crop_ls3$SITE_ID,
                     var_set = c("WS", "USTAR"))
pander::pandoc.table(data_sum[c(1:10), ])
Table continues below
  SITE_ID VARIABLE BASENAME GAP_FILLED DATA_RECORD
4492 US-ARM WS_1_1_1 WS FALSE 372564
4495 US-ARM USTAR_1_1_1 USTAR FALSE 372564
4548 US-ARM WS_1_2_1 WS FALSE 372564
4551 US-ARM USTAR_1_2_1 USTAR FALSE 372564
4575 US-ARM WS_1_3_1 WS FALSE 372564
4578 US-ARM USTAR_1_3_1 USTAR FALSE 372564
8740 US-IB2 WS WS FALSE 262992
8742 US-IB2 USTAR USTAR FALSE 262992
11626 US-Ne1 USTAR_1_1_1 USTAR FALSE 204504
11698 US-Ne1 WS_1_1_1 WS FALSE 204504
Table continues below
  DATA_MISSING Q01 Q05 Q10 Q15 Q20
4492 25229 0.5072 1.056 1.437 1.741 2.012
4495 24753 0.02914 0.05341 0.07682 0.101 0.1266
4548 33980 0.8224 1.718 2.379 2.882 3.318
4551 32688 0.03162 0.05626 0.07904 0.1014 0.1257
4575 54069 0.9857 2.114 2.986 3.668 4.254
4578 47451 0.0334 0.05851 0.08054 0.1017 0.1243
8740 16813 0.26 0.57 0.86 1.1 1.33
8742 29794 0 0.01 0.04 0.07 0.1
11626 15743 0.024 0.05 0.071 0.092 0.114
11698 141860 0.5 0.91 1.17 1.35 1.51
Table continues below
  Q25 Q30 Q35 Q40 Q45 Q50 Q55
4492 2.265 2.522 2.786 3.063 3.347 3.655 3.981
4495 0.1527 0.1788 0.2045 0.2298 0.2548 0.28 0.3055
4548 3.715 4.096 4.458 4.818 5.185 5.553 5.93
4551 0.1513 0.1783 0.2051 0.2319 0.2591 0.2861 0.3133
4575 4.773 5.259 5.729 6.193 6.648 7.104 7.571
4578 0.1494 0.1772 0.2059 0.2355 0.265 0.2944 0.3246
8740 1.53 1.72 1.91 2.11 2.32 2.54 2.77
8742 0.14 0.17 0.2 0.23 0.26 0.28 0.31
11626 0.137 0.161 0.186 0.21 0.236 0.261 0.287
11698 1.66 1.79 1.93 2.08 2.23 2.41 2.6
Table continues below
  Q60 Q65 Q70 Q75 Q80 Q85 Q90
4492 4.335 4.715 5.132 5.604 6.147 6.812 7.679
4495 0.332 0.3598 0.3901 0.4234 0.4615 0.507 0.5647
4548 6.317 6.731 7.188 7.708 8.329 9.087 10.12
4551 0.3411 0.3702 0.4002 0.4329 0.4702 0.5136 0.5684
4575 8.05 8.538 9.045 9.59 10.19 10.92 11.9
4578 0.3558 0.3878 0.4218 0.4586 0.4997 0.5483 0.6105
8740 3 3.25 3.52 3.81 4.15 4.56 5.1
8742 0.33 0.36 0.39 0.42 0.45 0.49 0.55
11626 0.314 0.341 0.372 0.405 0.443 0.49 0.552
11698 2.82 3.07 3.34 3.65 4 4.44 5.01
  Q95 Q99
4492 8.939 11.27
4495 0.6508 0.8258
4548 11.68 14.45
4551 0.6509 0.8209
4575 13.39 16.29
4578 0.7056 0.9182
8740 5.95 7.74
8742 0.63 0.79
11626 0.645 0.8564
11698 5.89 7.59

Alternatively, a companion function amf_plot_datasummary() provides interactive visualization to the data summary.

#### not evaluated so to reduce vignette size
## plot data summary of USTAR for selected sites, 
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("USTAR")
)
#### not evaluated so to reduce vignette size
## plot data summary of WS for selected sites, 
#  including clustering information
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("WS"),
  show_cluster = TRUE
)

Once having a target site list, users can download these sites’ data and metadata using the site IDs. See Data import for data download and import examples.