WhatEvery1Says Manifest Schema Documentation

Draft v0.6
March 29, 2015

Contents


Introduction

A WhatEvery1Says manifest is a valid YAML document saved with the suffix .yml. Manifests can be used to store information about the following:

  1. Primary or processed data data files.
  2. Workflow sequences.
  3. Supporting materials (processing scripts, documentation, visualisations, and so on).

A typical manifest will resemble the following template (terminal nodes with [] can have a list or array as their value; otherwise, the value is assumed to be text):

Note that the sample below does not represent changes in v0.6.

  ---
      manifestId:
      namespace:
      version:
      resourceId:
      resourceType:
      title:
      altTitle:
      label:
      publicationInfo:
        publication:
        publisher:
        publicationDate:
        contentType:
        documentType:
        authors: []
        language: []
        country: []
        OCR:
        resourceMetadata:
        rights:
      collectionInfo:
        collectors: []
        collectionDate:
        accessMethod: []
        resourceLocation:
      relationships: []
      processes:
        ref:
        seq: []
      notes: []

Certain items will not appear in all manifests. For example, a manifest recording a workflow process will normally not contain collectionInfo. This documentation explains each item in detail in the manifest field definitions below.


Basic YAML Format

The example below demonstrates the basics of the YAML format.

  ---        
  manifestId:1
  namespace: WE1S
  version: 1.0
  resourceId: 1
  title: Understanding Vintage Cartoons
  publicationInfo:
    publication: The New York Times
    publicationDate: 2015
    authors:
    - seq: 1
      author: Fred Flintstone
      - seq: 2
      author: George Jetson
  ---

The above document defines an associative array with 4 top-level keys. The value of a key may itself be an array containing further key-value pairs. YAML keys are easily mapped onto the keys of data objects in most programming languages or fields in a database.

Valid YAML documents must observe the following formatting guidelines:

YAML auto-detects data type, so it is generally not necessary to enclose strings like "George Jetson" in quotation marks in order to distinguish them from integers or other data types. However, quotation marks can be used for clarity. See the YAML guidelines for ways to specify other data types.

Multiline values will be parsed as a single string if the has the modifier | or >. The former will preserve line spacing and white space; the latter will fold multiple lines into a single paragraph.

  # Preserve line breaks and other white space
  poem: |
    There once was a short man from Ealing
      Who got on a bus to Darjeeling
            
  # Fold line breaks into a single paragraph
  prose: >
    There once was a short man from Ealing
    Who got on a bus to Darjeeling

Items in lists do not have an inherent sequence. Therefore, if their order needs to be preserved, a key identifying the sequence number should be supplied:

  authors:
  - seq: 1
    author: Fred Flintstone
  - seq: 2
    author: George Jetson

YAML has an inline, or "flow" style similar to JSON. It can express multiple values as a list, enclosed in square brackets, or as an array of key-value pairs, enclosed in curly brackets. Both list and array items should be separated by a comma followed by a space. Hence the sample manifest above can also be represented as follows:

  [{manifestId: 1, namespace: WE1S, version: 1.0, resourceId: 1, title: Understanding Vintage Cartoons, 
    publicationInfo: {publication: The New York Times, publicationDate: 2015, authors: [{seq: 1, author: 
    Fred Flintstone}, {seq: 2, author: George Jetson}]}]

The structure is perhaps easier to see in a typical JSON pretty-printed format:

  [
    {manifestId: 1,
     namespace: WE1S,
     version: 1.0,
     resourceId: 1,
     title: Understanding Vintage Cartoons,
     publicationInfo: {
       publication: The New York Times,
       publicationDate: 2015, 
       authors: [
         {seq: 1,
          author: Fred Flintstone
         },
         {seq: 2,
          author: George Jetson
         }
       ]
     }
   ]

Flow notation can be useful for compact data transfer, but it may also be required for filling out forms. For instance, Bolt does not possess a "repeatable" content type for items in a list. Therefore, a user may have to enter multiple items as a list.


WhatEvery1Says Manifest Field Definitions

Borrowing from database terminology, each YAML key node will generally be referred to as "fields". Some fields, such as relationships and notes, are not required. Depending on the type of information stored, collectionInfo and processes may be optional.

Unlike XML, YAML does not have to declare a root note explicitly, although top-level fields can be said to be "contained by" the root field. If an explicit naming of the root is required (such as for conversion to XML), it can be done so by concatenating the namespace and version values. For example:

root: WE1S1.0
  namespace: WE1S
  version: 1.0

Detailed descriptions of the individual fields are given below.


accessMethod

Description A description of the URL, API query, or other method used to access the resource.
Required Required
Contained by collectionInfo
Notes This field can contain an array of arbitrary keys to classify types of access as in the example below. Multiple methods can be given as a list.
Examples
accessMethod:
  url: http://somewebsite.com
  api: http://somewebresource.com/?search=somequery

action

Description A description of a step in a processing workflow.
Required Optional
Contained by processes

altTitle

Description An alternative title for the resource. This may be a descriptive title provided by the collector(s) or some other useful designator.
Required Optional
Contained by root

author

Description Optional field for listing multiple authors.
Required Required if the author is part of a list.
Contained by authors
Notes This field can contain an array of arbitrary keys to classify parts of names as in the example below. It is possible that these subfields should be required for search purposes.
Examples
authors:
- author:
  firstname: Fred
  lastname: Flintstone
- author:
  firstname: George
  lastname: Jetson

authors

Description A list of the author(s) of the file or workflow described in the manifest.
Required Required for some times of resources.
Contained by publicationInfo
Notes Multiple authors can be expressed as a simple list of names (e.g. [Fred Flintstone, George Jetson]) or as a list of key-value pairs:
  authors:
  - Fred Flintsone
  - George Jetson
  
  authors:
  - seq: 1
    author: Fred Flintstone
  - seq: 2
    author: George Jetson
The use of the author field is only required if authors contains other fields like seq.

collectionDate

Description The date when the resource was acquired in the format YYYY-MM-DD.
Required Required
Contained by collectionInfo

collectionInfo

Description A set of fields providing information about the data or resource was acquired.
Required Required only for resources not authored by WhatEvery1Says staff.
Contained by root
Must contain collectors, collectionDate, accessMethod

collectionQueryTerms

Description A description of the query terms used in harvesting the collection.
Required Optional
Contained by collectionInfo
May Contain notes queryTerm seq

collector

Description Optional field for listing multiple collectors.
Required Required if the collector is part of a list.
Contained by collectors
Notes This field can contain an array of arbitrary keys to classify parts of names as in the example below. It is possible that these subfields should be required for search purposes.
Examples
collectors:
- collector:
  firstname: Fred
  lastname: Flintstone
- collector:
  firstname: George
  lastname: Jetson

collectors

Description A list of the collector(s) of the file or workflow described in the manifest.
Required Required
Contained by collectionInfo
Notes Multiple collectors can be expressed as a simple list of names (e.g. [Fred Flintstone, George Jetson]) or as a list of key-value pairs.
  collectors:
  - Fred Flintsone
  - George Jetson
  
  collectors:
  - seq: 1
    collector: Fred Flintstone
  - seq: 2
    collector: George Jetson
The use of the collector field is only required if collectors contains other fields like seq.

contentType

Description A taxonomic description from a controlled vocabulary (e.g. newspaper article) created by WhatEvery1Says. Can contain multiple items.
Required Optional
Contained by publicationInfo

country

Description The country or countries from which the resource originated.
Required Optional
Contained by publicationInfo

date

Description The date an action was performed or a note was recorded, given in the format YYYY-MM-DD.
Required Optional
Contained by notes, processes

description

Description Contains a prose description of the parent field's content supplied by WhatEvery1Says staff. Equivalent to notes but does not take subfields.
Required Optional
Contained by Global

documentType

Description A taxonomic description from a controlled vocabulary(e.g. genre) created by WhatEvery1Says. Can contain multiple items.
Required Optional
Contained by publicationInfo

endDate

Description The end date of a date range in the format YYYY-MM-DD.
Required Required if a startDate is given.
Contained by publicationDates

group

Description An arbitrary field to indicate group responsibility.
Required Optional
Contained by resp
Notes This field name is included for completeness since it is mentioned in the resp field definition. It can also be applied to other fields like authors and collectors.

label

Description A short designator for the resource to be used for generating CSVs, visualisations, etc.
Required Optional
Contained by root

language

Description The language or languages contained in the resource.
Required Optional
Contained by publicationInfo
Notes Language codes should be taken from the ISO 639-2 list.

manifestId

Description A unique identifier for the manifest. This may be a sequential number or the file name.
Required Required
Contained by root

namespace

Description A URI for a namespace used for describing a metadata schema.
Required The WhatEvery1Says namespace is required within the root. namespace is optional elsewhere.
Contained by resourceMetadata, root
Notes For now, we can use the label "WE1S" for the namespace.

note

Description Contains an item in a list of notes supplied by WhatEvery1Says staff.
Required Required
Contained by notes

notes

Description Contains a note or a list of notes supplied by WhatEvery1Says staff. In lists, each item should use the note field.
Required Optional
Contained by Global
Examples
notes:
  -note: This is a note.
  -note: This is a second note.

OCR

Description Has the value "true" if the resource was generated by Optical Character Recognition.
Required Optional
Contained by publicationInfo

processedText

Description Path to processed text files.
Required Optional
Contained by processes

processes

Description A description of a processing workflow. This can be a sequence of processing steps or a manifestId reference to a separate manifest containg the sequence.
Required Optional
Contained by root
May contain action, date, processedText, ref, resp, seq
Examples
processes:
  -seq: 1
   action: Removed stop words.
   date: 2015-01-03
   resp: Scott Kleinman 
   processedText: /path/to/file/without/stopwords   

publication

Description The title of the publication from which the resource derives.
Required Required for some times of resources.
Contained by publicationInfo
Notes Some fields may be required for certain resource types and not for others. Manifests for processing workflows and other types of materials produce by WhatEvery1Says staff are considered to be authored by that staff and their names must be listed in the authors field.

publicationDates

Description The date or dates of publication in the format YYYY-MM-DD.
Required Required for some times of resources.
Contained by publicationInfo
May Contain endDate startDate

publicationInfo

Description A set of fields providing information about the production and dissemination of the resource.
Required Required
Contained by root
Must contain authors
May contain contentType, country, documentType, language, OCR, publication, publicationDate, publisher, resourceMetadata, rights
Notes Some fields may be required for certain resource types and not for others. Manifests for processing workflows and other types of materials produce by WhatEvery1Says staff are considered to be authored by that staff and their names must be listed in the authors field.

publisher

Description The publisher's name.
Required Required for some times of resources.
Contained by publicationInfo
Notes For resources authored by WhatEvery1Says staff, "4Humanities" can be given as the value.

queryTerm

Description Description of one of a number of query terms used in harvesting the collection.
Required Optional
Contained by collectionQueryTerms

ref

Description Contains a reference to another manifesttId.
Required Optional
Contained by processes

relationships

Description A list of resources related to the current document. Related files may be documentation or visualization files, or they may be subsets or suprasets of a collection to which the resource belongs.
Required Optional
Contained by root
Notes This field can contain an array of arbitrary fields to classify the type and location of the related resource. Some suggestions are type (with values like hasPart and isPartOf), ref (a reference to a manifestId), and location (a path to an archived file).

resourceId

Description A unique ID for a resource file. This may be a sequential integer or an alphanumeric value.
Required Required if the manifest refers to an outside resource (i.e. not a description of workflow).
Contained by root

resourceLocation

Description The path to the WhatEvery1Says archived copy of a resource.
Required Required only if there is no WhatEvery1Says archive file.
Contained by collectionInfo

resourceMetadata

Description A description of metadata accompanying the data from its source publisher.
Required Optional
Contained by publicationInfo
Notes Contains an array with the keys namespace (the URI of metadata schema), prefix (a namespace label to attach to the metadata field), field (the name of the metadata field). The namespace is probably only necessary if the schema is not a standard like Dublin Core.
Examples
resourceMetadata:
  namespace: http://dublincore.org/documents/dcmi-namespace/
  prefix: DC
  field: type

resourceType

Description A description of the resource, for example, primary data, processing information, scripts, documentation, etc.
Required Optional
Contained by root
Notes
  • Currently, values may be drawn from an open-ended list of common options, but it is possible to specify a controlled vocabulary at a later date.
  • resourceType should be inferred from the other manifest fields, but recording this might make sorting of particular types of manifests easier. We should consider making it required.

One possible use is to specify collection for the resourceType. This means that all processes are inherently applied tot he collection, not the publication (or any field at the same level).

resp

Description The name of the person or persons responsible for an action or note.
Required Optional
Contained by notes, processes
Notes In some cases, it may be necessary to indicate group responsibility. This can be done using an arbitrary group key: e.g. resp: {group: UCSB}.

rights

Description A statement of licensing rights or intellectual property restrictions. Free culture is assumed by default.
Required Optional
Contained by publicationInfo

seq

Description Contains an integer used for keeping track of the sequential order of items in a list. This should be indicated with the hyphen plus space flag in full YAML notation.
Required Optional
Contained by Global—can be used in any field containing a list.

startDate

Description The start date of a date range in the format YYYY-MM-DD.
Required Required if an endDate is given.
Contained by publicationDates

title

Description The title of the resource. This may be the original title of a publication or a descriptive title provided by the collector(s).
Required Required
Contained by root

type

Type Contains a taxonomic description of the parent field's content supplied by WhatEvery1Says staff.
Required Optional
Contained by Global
Notes Equivalent to XML @type, intended for general use. The vocabulary may be controlled in certain contexts.
Examples
process:
  type: collection

Specifies a collection process. Other process types may be "wrangling", "outputs", etc.

version

Description Contains a version number.
Required The version number of the WhatEvery1Says manifest schema is required in the document root.
Contained by root

Index of Fields

a

accessMethod action altTitle author authors

c

collectionDate collectionInfo collectionQueryTerms collector collectors contentType country

d

date description documentType

e

endDate

g

group

l

label language

m

manifestId

n

namespace note notes

o

OCR OS

p

platforms processedText processes publication publicationDates publicationInfo publisher

q

queryTerm

r

ref relationships resourceId resourceLocation resourceMetadata resourceType resp rights

s

seq startDate

t

title type

v

version