Overview

Throughput is designed to facilitate the transmission of schientific cultural knowledge around data resources to assist interdisciplinary research. We want to move anecdotal information about records from the lab to the cloud. The goal of this document is to highlight several key case studies that inform the development of the data resource, by caturing key use patterns.

Technology Background

At present the Throughput DB stack is at the proof of concept stage, but much of the technological infrastructure has been chosen with the data model and architecture at the forefront of the design decisions.

API First Development

The Throughoput Database is intended as a back-end service to link data resources. As such investment in UI/UX design is less important at this stage than investment in a well developed API. The API will have POST and GET methods, roughly mapping to:

GET  resource
     body
     target
POST annotation

The GET methods will return data objects representing the resources associated with the annotation engine (the data repositories), the annotation bodies, supporting, for example, keyword searches &cetera, and target related searches, generally querying DOIs or other UIDs.

neo4j

The linked nature of the data suggests that a graph database would provide the greatest support for the data model. Here we envision several node and link types, including person, resource, thing data types (using the W3C Annotation and schema.org data models). Targets and Bodies would be identified through graph relationships, for example:

By ensuring that bodies and targets have a common node type we can then ensure that annotation bodies themselves can be annotated. neo4j has a number of packages to support Java, JavaScript, R and other programming languages. This means that development of the API itself can be platform agnostic, thus, replication of the database and development of a new interface should provide the ability to fork the project easily.

Docker

Docker containers act to isolate a set of system components from the underlying architecture. Given this setup it is possible to run a system with the same custom components on a local system, a remote server, or on a collaborators system. We use a container configured for neo4j using the neo4j:3.0 container, and run using a yaml configuration file that points the neo4j browser to port 17474. This way it doesn’t interfere with any local instances of neo4j.

W3C Standards

In general this model contains only four key elements. An annotation, which is effectively an empty node used to link elements, a body and target are both the same node type, this allows an annotation body to be a target in and of itself.

Understanding the Core Classes

An annotation is considered to be a set of connected resources, typically including a body and target, and conveys that the body is related to the target. The exact nature of this relationship changes according to the intention of the annotation, but the body is most frequently somehow “about” the target.

In general, this model works for the kind of annotations we want to discuss. For example:

  • An individual makes a note about a data record [note - annotates - record]
  • An individual tags a data record with some keywords [keywords - annotate - record]
  • A dataset is related to an existing grant [grant # - annotates - record]

In each of these cases, the annotation (the note, the keywords or the grant) are used to supplement the information of the underlying dataset. We consider this a “data-focused” annotation system, so the data record is most commonly the object being annotated. It is possible that an annotation can itself be annotated, or the body of an annotation.

An annotation assumes that each body applies to each target equally. This may not be the case, for example, we may want to generate an annotation for a dataset that both links it to a record from another database, and explains why the records are being linked. In this case, neither of the body elements can be considered independent of one another with respect to the target since the text must reference the URI and the URI relationship is not clear until addressed by the text annotation.

The OpenAnnotation standards provide the concept of a Composite. In this case, the two body elements can be bound together, indicating that they must be considered in connection to one another, and then linked to the annotation.

[body:(composite:((body1)(body2)))]-->[annotation]-->[target:()]

The annotation requires some secondary information if we are going to manage annotations across the data lifecycle. These include the creator, the created date/time stamp, the generator, the agent responsible for generating the reference, the date in which the serialization was generated, and then, potentially, modified.

Objects: Body and Target Properties

Both the body and target are of a class called object in the throughput R package. An object contains the information relavent to the annotation. It may link to a target singly, or as part of a composite (see above). Similarly, the target is the annotation, but may be both an annotation or body (see below, add link). For example, a user may annotate a previous annotation.

At present there is a 'misc' element in the JSON representation. This element is intended to contain extra information as we modify the data model, trying to determine which elements are critical for understanding, searching and retrieving data from the annotation graph.

The current implementation is a class with character elements:

{'object': {'uid': 'character',
          'type': 'character',
          'value': 'character',
          'createTime':'character',
          'misc':'character'}}

The throughputdbr package implements only three classes of type: DOI, URL and annotationText. This will expand as we move implementation to a Java based system.

The values for uid and createTime are generated by the server as the annotation is processed. The user will not generate these.

Annotations: Empty Objects

The annotation object in this data model is a node with minimal information that serves to link annotation elements to a central node with a unique identifier and creation date. As such it is a relatively lighweight element that is primarily used as a target (or source) for relationships within the graph.

{'annotation':{'created':'character',
                   'uid':'character'}}

The user does not create or modify properties for the annotation. These are generated automatically within the system.

Creator: Who makes the annotation?

The creator is intended to be a lightweight container to link to external data resources that manage personal information for researchers. Our current implementation assumes the use of ORCiD identifiers, but is flexible enough to manage data from other services provided similar data structures.

{'creator':{'identifier': 'character',
            'PropertyID': 'character',
             'firstName': 'character',
              'lastName': 'character',
                  'name': 'character',
            'createTime': 'character',
                   'uid': 'character'}}

We expect the identifier (e.g. '0000-0002-2700-4605') to be associated with a PropertyID (a term borrowed from schema.org) that identifies the source of the identifier (making this system flexible enough to accept outher unique identifiers, such as OpenIDs).

We borrow the terms firstName and lastName from schema.org’s person schema, allowing us to implement JSON-LD solutions more easily in the future, without having to cross-walk our internal vocabulary.

Class Summary

With these three fundamental classes object, creator and annotation we can link all elements of the annotation database. The capacity of neo4j to manage flexible data models and relationships provides us with an opportunity to begin development based on a core competency for linking datasets conceptually, through publication or annotating them individually for clarification.

Given these classes, we might expect a response to look something like this:

Case Studies

Cross-Database Annotation

Linking Projects to Presentations - AGU 2017

This is empty right now. ToDo

Individual Focused Annotation

An Individual Annotates One or Several Records.

In this instance, a researcher working on a project has discovered an issue with a single, or several datasets within a resource. This issue may not be an error per se or a deficiency with the dataset, but may simply be an artifact of methodological or disciplinary-related procedures. Thus, the annotation helps indicate that special steps may be required to process the data, or may indicate reasons why the data may be an outlier in analysis.

Duplicate analytic layers in Neotoma

In sedimentary pollen analysis it may be the case that a researcher counts multiple slides at a single depth. In many cases these counts are summed and reported as a single value, but it may be the case that they are reported independently. For a researcher unfamiliar with pollen analysis this may result in confusion, particularly when counts differ. Thus, an individual encountering this issue for the first time may wish to annotate a single record.

In this case, the annotator, Simon Goring (orcid: 0000-0002-2700-4605) creates a text annotation that targets the dataset object from Neotoma. Given this structure we generate a graph to indicate this relationship. Once generated as part of the larger graph and API system, the annotation can be served as part of a query relating to a number of factors.

clear_gph <- function(x) {
  cypher(x, 'MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r;')
}

clear_gph(con)

user <- creator(identifier = '0000-0002-2700-4605',
                PropertyID = 'orcid',
                 firstName = 'Simon',
                  lastName = 'Goring')

body <- object(type = "TextualBody",
               value = paste0("Two samples counted at the same depth (x=94cm) with the same ages. ",
                              "It seems like these samples are two separate counts at the same depth ",
                              "and may be summed during analysis."))

target <- object(type = "URL",
                 value = "http://api.neotomadb.org/v1/data/datasets/13047")

source <- object(type = "DOI",
                 value = "10.17616/R3PD38")

link_record(      con = con, 
            generator = user, 
                 body = body, 
               target = target,
               source = source)

We might expect a response object in JSON to look something like this:


{"data": 
  {
    "nodes": 
      [
        {
        "id":24629,
        "labels":["annotation"],
        "properties":
          {
            "uid":"9052324",
            "created":"2017-11-24 09:45:18-0800"}
          },
        {
          "id":24630,
          "labels":["creator"],
          "properties":{"lastName":"Goring",
                        "firstName":"Simon",
                        "id":"0000-0002-2700-4605",
                        "PropertyID":"orcid"}
        },
        {
          "id":24631,
          "labels":["object"],
          "properties":
            {
              "type":"TextualBody",
              "value":"Two samples counted at the same depth (x=94cm) with the same ages. \
                       It seems like these samples are two separate counts at the same depth \
                       and may be summed during analysis."
            }
        },
        {
          "id":24632,
          "labels":["object"],
          "properties":
            {
              "type":"URL",
              "value":"http://api.neotomadb.org/v1/data/datasets/13047"
            }
        },
        {
          "id":24633,
          "labels":["object", "resource"]
          "properties":
            {
              "type":"DOI",
              "value":"10.17616/R3PD38"
            }
        }
      ]
    },
  { "relationships":
    [
      {
        "id":12324,
        "labels":["created"],
        "relates":[24630,24629],
        "origin":24630
      },
      {
        "id":12324,
        "labels":["hasBody"],
        "relates":[24629,24631],
        "origin":24629
      },
      {
        "id":12324,
        "labels":["hasTarget"],
        "relates":[24629,24632],
        "origin":24629
      },
      {
        "id":12324,
        "labels":["hasResource"],
        "relates":[24632,24633],
        "origin":24632
      }
    ]
  }
}

Data resource linking

Here we have two records, one, a set of pollen records from Neotoma, the other a set of records within the Arctic Data Center. In this first code block, a researcher (Jack Williams, University of Wisconsin) notes that a set of records added to the Arctic Data Center are linked to a publication in PNAS:

clear_gph(con)

creator <- creator(identifier = '0000-0001-6046-9634',
                   PropertyID = 'orcid',
                    firstName = 'John',
                     lastName = 'Williams')

body <- list(object(type = 'annotationText',
                   value = 'All resources linked by publication and project as part of the St Paul\'s project.'),
             object(type = 'DOI',
                   value = '10.1073/pnas.1604903113'))

target <- list(object(type = 'DOI',
                     value = '10.18739/A27N5S'),
               object(type = 'DOI',
                     value = '10.18739/A2X68N'),
               object(type = 'DOI',
                     value = '10.18739/A2MH0X'))

source <- object(type = 'DOI',
                value = '10.17616/R37P98')

link_record(con = con, 
            generator = creator, 
            body = body, 
            target = target,
            source = source)

Some of the records linked by Dr. Williams are also in Neotoma, and, as such, these records can be linked in any number of ways. In this case, Simon Goring links the resources by noting that one of the Arctic Data Center records represents the same physical object (it’s the same core) as the pollen object within Neotoma:

creator <- creator(identifier = '0000-0002-2700-4605',
                   PropertyID = 'orcid',
                    firstName = 'Simon',
                     lastName = 'Goring')

body <- list(object(type = 'annotationText',
                   value = 'The Neotoma record has a set of geophysical data recorded in the Arctic Data Center.  It\'s all part of the same analysis.'),
             object(type = 'DOI',
                   value = '10.18739/A27N5S'))

target <- object(type = 'URL',
                value = 'http://apps.neotomadb.org/explorer/?datasetid=20188')

source <- object(type = "DOI",
                value = "10.17616/R3PD38")

link_record(con = con, 
            generator = creator, 
            body = body, 
            target = target,
            source = source)

This then produces the following graph:

In this way, the dataset, through this graph, is now connected to the publication indirectly. There is no requirement that the database add this information, it can be discovered through the annotation engine API.

Data resource modification

Neotoma contains a large number of paleoecological records obtained from lake sediment. The lakes are identified by name and location is reported using latitude and longitude coordinates. Data publication for records in Neotoma spans nearly 60 years (check), and, as such, the earliest reported locations often had low precision, either rounding when locations were reported as DMS, or as a result of uncertainty when reporting location based on the use of topographic maps.

A recent project by Neotoma resulted in the conversion of locations within the database. Some lake sites had coordinates changed, some sites were unchanged, but additional informaton was added to the resources (for example, original publications were checked and it was not possible to correct the location).

For this exercise we rely on a single annotation that can connect the neotoma data objects resources to a script, a csv table and a text annotation.

A resource annotates records

There may be occasions where a script, written by an individual, but associated with a resource, may annotate records. For example, Neotoma and the Paleobiology Database share certain records, but this may not be entirely apparent, or one database may use a script or workflow to improve geolocation of data resources from historical papers. In this case it may be useful to annotate records to either identify links across resources, or to identify any post-processing that has occurred within the resource.