Hosted by CU logo University of Colorado
Boulder
Powered by ESGF-CoG logo
Welcome, Guest. | Login | Create Account
CoG logo
You are at the CoG-CU node
 

Downloading data using ESGF Wget scripts

One of the most powerful features of the Earth System Grid Federation (ESGF) is the capability to generate scripts to download files for arbitrary query parameters, that can download more than one file from one data node. The script generator is even able to create several scripts in one request if data from several data nodes are desired. Currently, these scripts are based on the wget command, which is typically installed by default on nearly all modern laptops and desktops. Before downloading the data, the script will prompt the user for their OpenID and password, which will be used to retrieve a short-lifetime digital certificate from the ESGF site where the user registered. This certificate (which is valid for only 72 hours) is passed by Wget to the server holding the data, as a proof of the user's identity.

ESGF Wget scripts are smart enough to recognize if files have already been downloaded and skip them. If the download was interrupted before having finished, simply run the script in the same directory again. The script will continue the download then. Even the download of a partially downloaded file will be continued.

ESGF Wget scripts can also help you to recognize if a new version of the downloaded data is available in ESGF. After download, keep the script and run it again with the option -u to search for new versions. The download itself is not repeated then but the download script is created again and is compared with the old one.

Pre-requisites

Before being able to execute a Wget download script, the following pre-requisites must be satisfied:

  • The user needs the following software:
    • A UNIX-like operating system (Linux or Mac OS). Under Windows, Linux may be installed as a virtual machine (recommended). Many users instead utilize a UNIX emulation under Windows, e.g. Cygwin (not recommended but maybe easier than a Linux installation)
    • Wget application (version 1.12 or later) compliled with the OpenSSL libraries. Under Linux, this is already installed with one of the base packages usually. Nevertheless, Mac users may have to install Wget first (details see ESGF Wget FAQ). Cygwin users have to install the package Web-Wget (again run the Cygwin setup executable to install it).
    • Tools for calculation of SHA256 and MD5 checksums. Under Linux and Cygwin, this is already installed usually. Mac users may have to install these tools first.
    • For use of Wget scripts in the default mode additionally Oracle Java, version 1.7 or newer. OpenJRE is not sufficient. Java is not needed if Wget scripts are used with the option -H (details see below).
  • The user must have been registered with one of the ESGF sites (portals). To register with an ESGF node, simply use a browser to visit the portal's home page and follow the Create Account link.
  • The user must have been authorized to access the desired data, see tutorial "Authorization for ESGF data access".
  • Network port 7512 (TCP) has to be open.

Step 1: Generate a Wget script

Login to an ESGF portal, perform a search and add all datasets you desire to your DataCart. Go to your DataCart.

Many CMIP5 datasets contain several hundreds of files, some even more than thousand. If you want to download CMIP5 data, narrow your search with help of the text field (arrow "N"). In detail, take down the names of the variables you need delimited by a blank and press the Apply button. This affects also files inside a dataset and usually reduces the number of download files considerably. On the contrary, the categories "Variable", "Variable Long Name" and "CF Standard Name" in the search form only influence the dataset selection, not the selection of files inside a dataset.

Screenshot of DataCart. Arrows point to parts important for script creation.

Figure 1: Utilization of the DataCart for Wget script creation

In the DataCart, several links "WGET Script" are shown, for every dataset one (e.g. arrow "1") and additionally one for all selected datasets (arrow "all"). To select a dataset click on the little square left of the dataset. The link at every dataset allows creation of a separate Wget script for the one dataset only. After clicking on one of these links, the download manager of your browser pops up a control window for script download. In the case of Chrome, the downloaded script will appear at the bottom of the browser. Download the script to your local machine now.

Step 2: Edit the script (optional)

The file name wget-############.sh of the downloaded script begins with wget- followed by a time stamp, a number and the extension .sh. The script is a UNIX Shell script and may be edited with a text editor. In this way, you may shorten the list of download files, e.g. if you do not need data for all available periods. Do not change other parts of the script.

Step 3: Run the script

Open a terminal window. Mac users can find a terminal icon in the Launchpad. If Oracle Java is available, run the script in the default mode:

bash wget-##############.sh

Otherwise run the script with option -H to avoid use of Java and locally downloaded certificates.

bash wget-##############.sh -H

The bash command in front of the script name opens the right Shell for running the script. The script will ask you for your OpenID and password. Only in default mode and if you run several downloads from the same ESGF data node, this will be skipped and a locally stored credential will be used for authentication instead.

Alternative for step 1: Create a wget script using a special URL

Wget scripts can also be generated with help of the ESGF Search RESTful API, which can be used by a script or by simply typing-in a URL augmented with commands, which are interpreted by an ESGF index node (portal). For example, the following URL will generate a Wget script that match all CMIP5 files in the ESGF, across all sites:

 http://esgf-data.dkrz.de/esg-search/wget?project=CMIP5

Nevertheless, this script will contain download links for only the first 1000 files, the recent limit for the number of download files. CMIP5 has much more. For generation of a useful script, more selection commands are needed. For example,

 http://esgf-data.dkrz.de/esg-search/wget?project=CMIP5&experiment=decadal2000&variable=tas

will generate a script for download of all surface temperature files for experiment decadal2000 across all CMIP5 models.

The blanks in the category name (facet name) you may know from the CoG surface, for example in "Time Frequency", have to be replaced by underscores:

 http://esgf-data.dkrz.de/esg-search/wget?project=CMIP5&experiment=decadal2000&variable=tas&time_frequency=day

Selection commands are delimited by an ampersand and interpreted in the sense of a logical AND, except those specifying the same category. For example, in

 http://esgf-data.dkrz.de/esg-search/wget?experiment=decadal2000&variable=tas&variable=tasmax

the category variable is used twice. These two selection commands are interpreted in the sense of a logical OR, in detail:

experiment=decadal2000 AND (variable=tas OR variable=tasmax)

A script will be generated for download of all decadal2000 files containing the variables tas or tasmax, i.e. both variables will be downloaded in one script run.

Use as much selection commands as possible and useful in your case to reduce the number of download files. For some power users, thousand files in one script run may not be suffient. They can use the limit command to raise the limit for the number of download files, e.g.:

 http://esgf-data.dkrz.de/esg-search/wget?experiment=decadal2000&variable=tas&limit=2000

This additional command would enable the example URLs above (except the first) to create a script with a complete file list. Please note that a limit of more than 10000 files will generally not be accepted.

Another nice feature for users who need many data files is preservation of the directory structure with the command download_structure. This command can be used to define a directory tree at the user's local machine. If you want to copy the files to a directory tree which is also used in ESGF for CMIP5 data, utilize the following command:

 download_structure=project,product,institute,model,experiment,time_frequency,realm,cmor_table,ensemble,variable

Accordingly the same for CORDEX:

 download_structure=project,product,domain,institute,driving_model,experiment,ensemble,rcm_name,rcm_version,time_frequency,variable

Last an example for a complete URL with preservation of the CMIP5 directory tree:

 http://esgf-data.dkrz.de/esg-search/wget?experiment=decadal2000&variable=tas&limit=2000&download_structure=project,product,institute,model,experiment,time_frequency,realm,cmor_table,ensemble,variable

Wget script options

ESGF Wget scripts can be run with options. For an overview of possible options type-in

bash wget-##############.sh -h

(-h for help). Different options can be combined. The following options are important:

-d, the debug option

This option causes the script to send more than the usual response to standard output. Use

bash wget-##############.sh -H -d

if you have problems with option -H since scripts run with option -H are nearly silent. They don't even send useful error messages.

Caution: Do not send your standard output to the user support mailing list esgf-user@lists.llnl.gov because option -d may cause the script to print your password! Everyone can subscribe to esgf-user@lists.llnl.gov and your post will be distributerd to every subscriber.

-H, the certificate-less option

Since many users have problems with Java and certificates on their local machines, the option -H was developed to avoid use of Java and locally stored certificates. Instead, your OpenID and password are sent with help of a Wget command. Your password is encrypted with SSL (or TLS if you have additionally switched to TLS with option -T). Without option -H, a local credential is created and sent to ESGF servers for the user's authentication but Oracle Java 1.7+ is needed for this purpose.

-i, the "insecure" option

This option disables check of server certificates. This has nothing to do with locally stored certificates and option -H. On the contrary, in a Grid as ESGF authentication is needed in two directions: The user has to authenticate herself/himself at the server and the server has to authenticate itself at the user's local machine. You may use

bash wget-##############.sh -i

to switch off the check of the server certificate by your local machine. This is sometimes helping in case of an expired server certificate. Before use of this option, you should ask your system administrator if you are allowed to do this.

-p, the preserve option

After download, the Wget script calculates a checksum for the freshly downloaded file. If -p is not set, downloaded files will usually be deleted if their checksum does not match the value in the script's file list. Afterwards, download will be repeated until it succeeds. This feature shall automatically correct alterations in the bitstream of the downloaded file. Use the -p option to suppress file deletion.

bash wget-##############.sh -p

The downloaded file will then be preserved despite checksum mismatch. This option does not suppress checksum comparison. In case the calculated checksum of a downloaded file does not match the checksum in the download file list, a warning will be thrown. This option may be useful if the checksum stored in the data node's metadata is outdated (seldom but already happened).

-T, the TLS option

Network traffic between ESGF servers and the user's local machine is usually encrypted using SSL (Secure Sockets Layer). The option -T switches to TLS v1 (Transport Layer Security) instead of SSL.

Find changes with -u

The option -u is used to repeat the search and find changes in the download file list. In more detail, the Wget script is again generated and compared with the old, locally stored Wget script. New available files are listed as well as new versions of previously downloaded files since the checksum of a replaced file differs from that of the old version. Other changes in the script are also shown. If a modification is detected, the Wget script will be updated and the previous version will be stored at my_wget_script.old.# where # is just a running index. This option needs the UNIX diff program. Data files will not be downloaded.

Last Update: May 22, 2017, 9:10 a.m. by Hydra Administrator