Modified 25 Sep 2025; Added by Paul van Genuchten
Guide
A Practice on Cataloguing Soil Information and Data
ISRIC has developed a practice on cataloguing Soil Information and Data in various projects. The practice has been implemented in multiple projects, each having different key functionalitities. In this article an overview of the approach is given, with pointers to relevant documentation.
Catalogues vs Repositories
Some of the confusion around metadata originates from the fact that the boundary between catalogues and repositories is not as sharp as it should be. From the principle of maintain metadata close to where data is generated, it is obvious that metadata should be maintained as part of a data deposit (in a repository). Catalogued can then ingest that information, to provide an optimised search experience. Catalogues itself should therefore not offer functionality to add or update records manually.
Many projects, which include a data or knowledge collection task, start their research by defining a metadata template to capture that information. Instead it is better to focus on existing metadata of publications which are deposited in repositories. Projects would only need to identify a list of relevant DOI’s.
In many cases not all relevant information is stored in a repository (yet). In such a scenario, use Dublin Core to capture temporary metadata, until the work is published.
A repository is not something you set up in the scope of a project. Setting up a persistent repository requires a persistence strategy beyond a single project. A catalogue however is something you can easily add to your project to facilitate information and data discovery within the project. Records can be ingested at intervals from relevant repositories or catalogues. The pycsw software provides an efficient, standards aware, catalogue implementation. The origin of pycsw is in the OGC spatial domain, but it also supports general cataloguing standards, such as dublin core and oai-pmh.
Authoring catalogue records
pycsw includes record harvesting options via the CSW harvest operations, See chapter 10.12 in the CSW specification. Due to its limited funcitonality, some alternatives are commonly applied:
- a geodatacrawler tool is able to harvest metadata from a variety of endpoints or can extract metadata for a set of DOI references
- The Soilwise project implemented a set of harvesting scripts, which can run in an autmated workflow (gitlab CI/CD, github actions, Argo workflows).
The ingested records are stored directly in the pycsw database, or first deposited in a Git repository. The pygeometa library offers a YAML based format modeled against iso19115, optimal for versioning in a git environment. Maintaining records in GIT provides a edit history of each record, as well as enables users to suggest optimisations on records via Git issues or Pull requests.
Keyword matching
Keywords play an important role in resource discovery. After an initial search users are offered functionality to further limit the search results via facet filtering in the side bar (similar to filtering shoes by size and color in a web store). In order to provide efficient facet categories, it is important to cluster similar keywords and categorize them. The Soilwise project developed a keyword matching module for this scenario. It clusters keywords based on translations and Agrovoc synonyms.
Map services on spatial data
Providing map services enable users to access the data without the need to do a full download. A usage example is a website which visualises the dataset on a topographic background. Typical software products facilitate this scenario, such as GeoServer, mapserver, pygeoapi. The geodatacrawler tool is able to generate a Mapserver configuration file, based on the datasets in a folder (with their metadata). In this scenario the data, metadata and provided map services are synchronous.
A spatial data viewer
In order to vizualise map services interactively on a topographic background an application layer is required. The TerriaJS library offers such a framework. The contents of a TerriaJS application is configured via a config file. It’s even possible to link a TerriaJS application to a catalogue endpoint, so users can query the catalogue for relevant data and add it to the current map view.
Read more
- An article highlighting a subset of the components is in the Soil data Assimilation Guidance Cookbook
- A training is available to set up a full environment using the above tools based on Docker
- pygeometa
- pycsw
- geodatacawler
- terriajs
- mapserver