3. Working with Data

Working with Data in anusolar

One of the primary reasons for the formulation of this Rpackage in a non-traditional format, is the way in which it allows you to work with your own data (or even data provided by the ANU).  The “/data/“ folder is a place for you to add any data relevant to your project

The underlying structure of the data within this code package relies on macro and micro locations.  The macro level is data organised by a regional level, most commonly a greater metropolitan area or the city limits.  These are assigned a three-letter identifier, comprised of the first three consonants of a location (e.g. Melbourne - MLB, Sydney - SYD, Freiburg - FRB).  Note CBR for Canberra is an exception as its creation predates the creation of this data structure.  Micro locations are sub-regions within a macro, also given a three-letter identifier, and are organised under the macro location in the data directory.

Data are to be placed under the data/ directory under the root folder (see Initialising).  Under this macro ID, data are then organised by primary type, the directory PVO/ or RAD/

The data required for this package to “do things”, includes primary data (PV or radiation data), supplementary data (such as meteorological data or atmospheric data), and metadata (data about the location of your observations and properties of the reporting equipment).  

Let's start off discussing the metadata...

1. Metadata

The primary data are assumed to be broken down by individual sites, where each stream of data has an individual location with its own unique meta properties.  There should be meta-data for each macro (see next section, Data Structure).

The required data (minimum for functional code) are $lat, $lon, $alt.  For the radiation or pv data we'll need some extra information.


Radiation Site Metadata

Radiation measurement site data in placed into the /data/locations.csv file.  Several of the Bureau of Meteorology radiation stations are already in this file. If you want to add a location, you'll need to place it in this file.

You can create a metadata list object for a radiation site by using the rad.meta function. You can see this done in the Engerer2 separation modelling example for Melbourne, done as follows:

mlb = rad.meta("MLB")

In the mlb metadata list object, you'll find the variables below.  Of particular interest are the $wxid and $radid variables.  These correspond to the Australian Bureau of Meteorology radiation or weather data site numbers, with leading zeros removed.  If you are working with data outside of Australia, use the World Meteorological Site Number. The $radius value is kilometer value, which specifies what zone is included in the macro region, by drawing a circle of that radius around the lat/lon values.

Timezones are also an important thing to list here, including the local time offset from GMT ($lst, $dst).  For more information on $cutz, see this webpage. 

PV Metadata

PV metadata refers to all of the important information required for simulating PV system power output through this Rpackage. The database is a very simple .csv file, which is stored in either:

/data/$macro_id/meta/sites_info.csv. [for raw PV data sites]

/data/$macro_id/meta/qc.sites_info.csv. [for quality controlled data sites]

I'll explain the $macro_id more below!

These PV metadata objects ("lists" in R) can be loaded using pre-made functions in the rpackage, for example to load raw data PV sites:

cbr = ar.in("CBR") # here $macro_id = "CBR"

Or for the quality controlled data:

cbr.qc = ar.in("CBR",qc=T)

Try the above code out! In these list objects, you'll see the following variables:

pv metdata

In order to simulate a PV system's power output, you'll require at least the following:

$sid a unique numerical identifier for each site
$nm, $ms, $mp, $mr information about the modules and their place in the array layout
$mm, see module matching below
$ni,$ir inverter details
$im, see inverter matching
$int the reporting interval of the PV data in seconds

Module Matching:

You'll need to pick the best match (the value of the line number in the database) for your module from within the Sandia Module Database, located in: /code/rpkg/support/SPM/SPMDb.csv.

Inverter Matching:

You'll need to pick the best match (the value of the line number in the database) for your inverter from within the Sandia Inverter Database, located in: /code/rpkg/support/SPM/inv_lib.csv.

Note: this package does not yet support split array layouts directly, that is a feature that will show up soon.

2. Supplementary Data

There are two types of supplementary data that are required for analyses of the primary data, these are meteorological data and atmospheric data.  Meteorological data, in the case of Australia, is sourced from the Bureau of Meteorology, but may be provided by you for other locations.  Atmospheric data includes values like ozone concentration or atmospheric turbidity should be downloaded from the SoDa webpage. Let's dig into these, starting with...  

Supplementary Data: Atmospheric Data

In order for the clear-sky radiation modelling routines to operate, many of them require input data about the condition of atmospheric variables, these are:

  1.     Ozone
  2.     Water Vapour
  3.     Aerosol Optical Depth
  4.     Turbidity

These are all sourced from the SoDa database using the URL:

The SoDa database at the link above.  You can extract the data you need for each site using the lat, lon and altitude for your site.

The SoDa database at the link above.  You can extract the data you need for each site using the lat, lon and altitude for your site.

You'll use this portal to extract this atmospheric data by entering the latitude, longitude and altitude of the location you are setting up. Use “Execute SoDa Service” for each of the variables. This needs only be done once for each macro region (and does not need to be done if it is already setup for you).

These are then saved directly into the macro directory, e.g. data/ LB/lt.csv, data/MLB/wv.csv, etc. The best way to do this is to copy the example files from /data/MLB/ to your new directory, and fill in the values from the portal service.

Supplementary Data: Weather Data

Basic meteorological observations are required for computing several of the included radiation models and for simulating PV system power output. For this purpose, we'll use the /data/WX/$wxid/ directory where $wxid matches up with the locations.csv file (see above), including any leading zeroes required to be six digits long.  In this directory, weather data files should be broken into daily files and named according to the YYYY-MM-DD.rda format. Once they are there, we can load weather data via:

# set the local region ('macro')
mlb = rad.meta("MLB")
# load in some weather data
wx = read.wx(mlb,2013,1,15) #YYYY,MM,DD

Weather variables in the above $wx data.frame should obey the following naming format (e.g. $stn, $ltms, etc.):

wx vars

Lastly, you can see all of the weather stations available in Australia in the /data/WX/stations.csv file

3. Primary Data

There are two primary types of input data to this R package: PV power output data (most commonly from PVOutput.org) and solar radiation data.  These will be referred to as PVO and RAD data respectively.  

Radiation observations Data [RAD]

The radiation data within this environment is stored within the macro region directory, under /data/$macro_id/RAD/ in either the /qc/ or /raw/ directories.  The files are broken into days, according to YYYY-MM-DD.rda format.  These can be read using the function read.rad:

# load in example radiation data, 15 Jan 2015 in Melbourne
indf = read.rad(mlb,2013,1,15) # macro, year, month, day, resolution (sec)

The above example is from the Engerer2 Melbourne example.  If you look inside the data.frame indf, you'd find:

rad vars

If you want to add radiation data to your copy of anusolar place it in the /RAD/raw/ directory.  The 'qc' directory will be for the QCRad function (more on that in future posts), which will quality control the raw radiation data.

PV power output data [PVO]

PV power output data should be stored under the PVO/ directory, the organisation of data is broken down into six subdirectories; these subdirectories will be created the first time the setup.new.location() function is run:



The meta/ directory contains the metadata information organised into a csv file named sites_info.csv (see PV metadata above!).  It will also contain the qc.sites_info.csv file for quality controlled data (e.g. the data available in this post).

all_sites/ contains a single .rda file for each day with the file format YYYY-MM-DD.rda.  All of the sites within the macro location are collated into this file. Each file contains a data frame named all_sites containing the variables $gtms, $ltms and then $sid for each site.  The $sid variable means e.g. site $sid = 1797, $1797 is the variable name for the power output at that site.   

qc_sites/ is arranged in the same format as all_sites/, but the data have been pre-processed by the quality control algorithm. The data in these files has a bit more complexity, so see the post on working with the quality controlled data.

assembled/ contains quality controlled, collated data, broken down into specified intervals, combined with multiple other data sources or post-processing fields.  Each interval of assembled data is organised by the number of seconds the output is averaged to.  E.g. 3600 is hourly data, 600 ten minute data.  Files are output at intervals specified within the function assemble(), but most commonly are produced at daily, monthly and yearly intervals, with file types organised as YYYY-MM-DD.rda, YYYY-MM.rda and YYYY.rda, respectively.  Data are stored in the parent data frame "adata".

Within this package, the PV data variable names used are:

This concludes the introduction to data section of this online R manual for the anusolar package.  I hope it was helpful. Please use the comments section below to ask questions!

Check out the rest of the R manual here: