---
title: "Data Import"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Import}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, warning = FALSE}
library(amerifluxr)
library(pander)
```

amerifluxr is a programmatic interface to the 
[AmeriFlux](https://ameriflux.lbl.gov/). This vignette 
demonstrates examples to import data and metadata downloaded
from AmeriFlux, and to parse and clean data for further use.
A companion vignette for [site selection](site_selection.html)
is available as well.

## Download data

AmeriFlux data and metadata can be downloaded using 
amf_download_base() and amf_download_bif(). Users
will need to create a personal AmeriFlux account 
[here](https://ameriflux-data.lbl.gov/Pages/RequestAccount.aspx)
before download. 

The following downloads AmeriFlux flux/met data (aka BASE 
data product) from a single site: US-CRT.

```{r eval = FALSE}
## When running, replace user_id and user_email with a real AmeriFlux account
floc2 <- amf_download_base(
  user_id = "my_user",
  user_email = "my_email@mail.com",
  site_id = "US-CRT",
  data_product = "BASE-BADM",
  data_policy = "CCBY4.0",
  agree_policy = TRUE,
  intended_use = "other",
  intended_use_text = "amerifluxr package demonstration",
  verbose = TRUE,
  out_dir = tempdir()
)

```

The downloaded file is a zipped file saved in tempdir()
(e.g., AMF\_\{SITE_ID\}\_BASE-BADM\_\{VERSION\}\.zip), 
which contains a BASE data file (e.g., 
AMF\_\{SITE_ID\}\_BASE\_\{RESOLUTION\}\_\{VERSION\}\.csv, 
RESOLUTION = HH (half-hourly) or HR (hourly)) and a 
metadata file (aka BADM data product, e.g., 
AMF\_\{SITE_ID\}\_BIF\_\{VERSION\}\.xlsx). The 
amf_download_base() also returns the file path to the 
downloaded file, which can be used later to read the 
file into R.

The following downloads a single file containing all AmeriFlux sites'
metadata (i.e., BADM data product) for sites under the CC-BY-4.0 data 
use policy. 

```{r eval = FALSE}
## When running, replace user_id and user_email with a real AmeriFlux account
floc1 <- amf_download_bif(
  user_id = "my_user",
  user_email = "my_email@mail.com",
  data_policy = "CCBY4.0",
  agree_policy = TRUE,
  intended_use = "other",
  intended_use_text = "amerifluxr package demonstration",
  out_dir = tempdir(),
  verbose = TRUE,
  site_w_data = TRUE
)

```

The downloaded file is a Excel file saved to tempdir()
(e.g., AMF\_\{SITES\}\_BIF\_\{POLICY\}\_\{VERSION\}\.xlsx, SITES
= AA-Net (all registered sites) or AA-Flx (all sites with flux/met 
data available); POLICY = CCBY4 (shared under AmeriFlux CC-BY-4.0 
data use policy) or LEGACY (shared under AmeriFlux Legacy data use
policy)). Similarly, the amf_download_bif() also returns the file 
path to the downloaded file, which can be used later to read the 
file into R.

For this vignette, we will use following example data files [files
are truncated to limit package size, for demonstration purposes only].

```{r results = 'hide'}
# An example of BASE zipped files downloaded for US-CRT site
floc2 <- system.file("extdata", "AMF_US-CRT_BASE-BADM_2-5.zip", package = "amerifluxr")

# An example of unzipped BASE files from the above zipped file
floc3 <- system.file("extdata", "AMF_US-CRT_BASE_HH_2-5.csv", package = "amerifluxr")

# An example of all sites' BADM data
floc1 <- system.file("extdata", "AMF_AA-Flx_BIF_CCBY4_20201218.xlsx", package = "amerifluxr")

```

# BASE data product
## Import data

The amd_read_base() imports a BASE file, either from a zipped file
or an unzipped comma-separated file (.csv). The parse_timestamp 
parameter can be used if additional time-keeping columns (e.g., 
year, month, day, hour) are desired. 

```{r results = "asis"}
# read the BASE from a zip file, without additional parsed time-keeping columns
base1 <- amf_read_base(
  file = floc2,
  unzip = TRUE,
  parse_timestamp = FALSE
)
pander::pandoc.table(base1[c(1:3),])

# read the BASE from a csv file, with additional parsed time-keeping columns
base2 <- amf_read_base(
  file = floc3,
  unzip = FALSE,
  parse_timestamp = TRUE
)
pander::pandoc.table(base2[c(1:3), c(1:10)])
```

## Parse and interpret data

The details of the BASE data product's format and variable
definitions can be found on 
[AmeriFlux website](https://ameriflux.lbl.gov/data/aboutdata/data-variables/).
In short, the BASE data product contains flux, 
meteorological, and soil observations that are reported 
at regular intervals of time, generally half-hourly or 
hourly, for a certain time period. **TIMESTAMP_START**
and **TIMESTAMP_END** columns (i.e., YYYYMMDDHHMM 12 digits)
denote the starting and ending time of each reporting interval 
(i.e., row). 

All other variables use the format of \{base name\}\_\{qualifier\},
e.g., FC_1, CO2_1_1_1. Base names indicate fundamental quantities
that are either measured or calculated / derived. Qualifiers are
suffixes appended to variable base names that provide additional
information (e.g., gap-filling, position) about the variable. 
In some cases, qualifiers are omitted if only one variable is 
provided for a site.

The amf_variable() retrieves the latest list of base names and
default units. For sites that have relatively fewer variables
and less complicated qualifiers, the users could easily interpret
variables and qualifiers. The amf_variable() also returns the
expected maximal and minimal values based on physically plausible
ranges or network reported values.   

```{r results = "asis"}
# get a list of latest base names and units. 
FP_ls <- amf_variables()
pander::pandoc.table(FP_ls[c(11:20), ])
```

Alternatively, the amf_parse_basename() can programmatically
parse the the variable names into base names and qualifiers.
This function can be helpful for sites with many variables
and relatively complicated qualifiers, as a prerequisite for
handling data from many sites. The function returns a data
frame with information about each variable's base name,
qualifier, and whether a variable is gap-filled, 
layer-aggregated, or replicate aggregated.

```{r results = "asis"}
# parse the variable name
basename_decode <- amf_parse_basename(var_name = colnames(base1))
pander::pandoc.table(basename_decode[c(1, 2, 3, 4, 6, 11, 12),])
```

## Data filtering

While BASE data products are quality-checked before release,
the data may not be filtered for all outliers. The amf_filter_base()
can be use to filter the data based on the expected physically 
ranges (i.e., obtained through amf_variables()). By default, a ±5%
buffer is applied to account for possible edge values near the lower
and upper bounds, which are commonly observed for certain variables
like radiation, relative humidity, and snow depth. 

```{r }
# filter data, using default physical range +/- 5% buffer
base_f <- amf_filter_base(data_in = base1)

```

## Measurement height information

Measurement height information contains height/depth and instrument
model information of the BASE data products. The info can be 
downloaded directly using the amf_var_info() function. The function
returns a data frame for all available sites, and can be subset
using the "Site_ID" column. The "Height" column refers to the
distance from the ground surface in meters. Positive values are
heights, and negative values are depths. See the 
[web page](https://ameriflux.lbl.gov/data/measurement-height/)
for explanation. 

```{r results = "asis"}
# obtain the latest measurement height information
var_info <- amf_var_info()

# subset the variable by target Site ID
var_info <- var_info[var_info$Site_ID == "US-CRT", ]
pander::pandoc.table(var_info[c(1:10), ])
```

# BADM data product
## Import BADM data

Biological, Ancillary, Disturbance, and Metadata (BADM)
are non-continuous information that describe and complement
continuous flux and meteorological data (e.g., BASE data
product). BADM include general site description, metadata
about the sensors and their setup, maintenance and disturbance
events, and biological and ecological data that characterize
a site’s ecosystem. See 
[link](https://ameriflux.lbl.gov/data/badm/badm-basics/)
for details.

The amf_read_bif() can be used to import the BADM data file.
The function returns a data frame for all available sites, 
and can subset using the "SITE_ID" column. 

```{r results = "asis"}
# read the BADM BIF file, using an example data file
bif <- amf_read_bif(file = floc1)

# subset by target Site ID
bif <- bif[bif$SITE_ID == "US-CRT", ]
pander::pandoc.table(bif[c(1:15), ])

# get a list of all BADM variable groups and variables
unique(bif$VARIABLE_GROUP)
length(unique(bif$VARIABLE))

```

As shown above, BADM data contain information from
a variety of variable groups (i.e., GRP\_\{BADM_GROUPS\}).
Browse the definitions of all available variable groups
[here](https://ameriflux.lbl.gov/data/badm/badm-standards/).  

To get the BADM data for a certain variable group, 
use amf_extract_badm() function. The function also renders
the data format (i.e., display all variables by columns)
for human readability.

```{r results = "asis"}
# extract the FLUX_MEASUREMENTS group
bif_flux <- amf_extract_badm(bif_data = bif, select_group = "GRP_FLUX_MEASUREMENTS")
pander::pandoc.table(bif_flux)

# extract the HEIGHTC (canopy height) group
bif_hc <- amf_extract_badm(bif_data = bif, select_group = "GRP_HEIGHTC")
pander::pandoc.table(bif_hc)
```

Note: amf_extract_badm() returns all columns in characters.
Certain groups of BADM variables contain columns of time
stamps (i.e., ISO format) and data values, and need to be
converted before further use.     

```{r fig.width = 7}
# convert HEIGHTC_DATE to POSIXlt
bif_hc$TIMESTAMP <- strptime(bif_hc$HEIGHTC_DATE, format = "%Y%m%d", tz = "GMT")

# convert HEIGHTC column to numeric
bif_hc$HEIGHTC <- as.numeric(bif_hc$HEIGHTC)

# plot time series of canopy height
plot(bif_hc$TIMESTAMP, bif_hc$HEIGHTC, xlab = "TIMESTAMP", ylab = "canopy height (m)")

```

Last, the contacts of the site members and data
DOI can be obtained from the BADM data. The AmeriFlux 
[data policy](https://ameriflux.lbl.gov/data/data-policy/) 
requires proper attribution (e.g., data DOI). 

In some case, for example, using data shared under Legacy
Data Policy for publication, data users are required to
contact data contributors directly, so that they have
the opportunity to contribute substantively and become a
co-author. 

```{r results = "asis"}
# get a list of contacts
bif_contact <- amf_extract_badm(bif_data = bif, select_group = "GRP_TEAM_MEMBER")
pander::pandoc.table(bif_contact)

# get data DOI
bif_doi <- amf_extract_badm(bif_data = bif, select_group = "GRP_DOI")
pander::pandoc.table(bif_doi)
```