WhatEvery1Says Manifest Schema Documentation
Draft v0.5
March 27, 2015
Contents
Introduction
A WhatEvery1Says manifest is a valid YAML document saved with the suffix .yml
. Manifests
can be used to store information about the following:
- Primary or processed data data files.
- Workflow sequences.
- Supporting materials (processing scripts, documentation, visualisations, and so
on).
A typical manifest will resemble the following template (terminal nodes with []
can have
a list or array as their value; otherwise, the value is assumed to be text):
---
manifestId:
namespace:
version:
resourceId:
resourceType:
title:
altTitle:
label:
publicationInfo:
publication:
publisher:
publicationDate:
contentType:
documentType:
authors: []
language: []
country: []
OCR:
resourceMetadata:
rights:
collectionInfo:
collectors: []
collectionDate:
accessMethod: []
resourceLocation:
relationships: []
processes:
ref:
seq: []
notes: []
Certain items will not appear in all manifests. For example, a manifest recording a workflow process will normally
not contain collectionInfo
. This documentation explains each item in detail in the manifest
field definitions below.
Basic YAML Format
The example below demonstrates the basics of the YAML format.
---
manifestId:1
namespace: WE1S
version: 1.0
resourceId: 1
title: Understanding Vintage Cartoons
publicationInfo:
publication: The New York Times
publicationDate: 2015
authors:
- seq: 1
author: Fred Flintstone
- seq: 2
author: George Jetson
---
The above document defines an associative array with 4 top-level keys. The value of a key
may itself be an array containing further key-value pairs. YAML keys are easily mapped
onto the keys of data objects in most programming languages or fields in a database.
Valid YAML documents must observe the following formatting guidelines:
- YAML data structure hierarchy is maintained by outline indentation. The specific
number of spaces in the indentation is unimportant as long as parallel elements have
the same left justification and the hierarchically nested elements are indented
further. Tab characters are never allowed for indentation.
- Key-value pairs are separated by a colon followed by a space.
- A hyhen followed by a space before a key indicates that it is an item in a
list.
- A YAML stream contain multiple documents separated by
---
.
- The character
#
is used to precede comments that are ignored by a YAML
parser.
YAML auto-detects data type, so it is generally not necessary to enclose strings like
"George Jetson"
in quotation marks in order to distinguish them from integers or other
data types. However, quotation marks can be used for clarity. See the YAML guidelines
for ways to specify other data types.
Multiline values will be parsed as a single string if the has the modifier |
or >
. The
former will preserve line spacing and white space; the latter will fold multiple lines
into a single paragraph.
# Preserve line breaks and other white space
poem: |
There once was a short man from Ealing
Who got on a bus to Darjeeling
# Fold line breaks into a single paragraph
prose: >
There once was a short man from Ealing
Who got on a bus to Darjeeling
Items in lists do not have an inherent sequence. Therefore, if their order needs to be
preserved, a key identifying the sequence number should be supplied:
authors:
- seq: 1
author: Fred Flintstone
- seq: 2
author: George Jetson
YAML has an inline, or "flow" style similar to JSON. It can express multiple values as a
list, enclosed in square brackets, or as an array of key-value pairs, enclosed in curly
brackets. Both list and array items should be separated by a comma followed by a space.
Hence the sample manifest above can also be represented as follows:
[{manifestId: 1, namespace: WE1S, version: 1.0, resourceId: 1, title: Understanding Vintage Cartoons,
publicationInfo: {publication: The New York Times, publicationDate: 2015, authors: [{seq: 1, author:
Fred Flintstone}, {seq: 2, author: George Jetson}]}]
The structure is perhaps easier to see in a typical JSON pretty-printed format:
[
{manifestId: 1,
namespace: WE1S,
version: 1.0,
resourceId: 1,
title: Understanding Vintage Cartoons,
publicationInfo: {
publication: The New York Times,
publicationDate: 2015,
authors: [
{seq: 1,
author: Fred Flintstone
},
{seq: 2,
author: George Jetson
}
]
}
]
Flow notation can be useful for compact data transfer, but it may also be required for
filling out forms. For instance, Bolt
does not possess a "repeatable" content type for items in a list. Therefore, a user
may have to enter multiple items as a list.
WhatEvery1Says Manifest Field Definitions
Borrowing from database terminology, each YAML key node will generally be referred to as "fields".
Some fields, such as relationships and notes, are not required. Depending on the type of
information stored, collectionInfo and processes may be optional.
Unlike XML, YAML does not have to declare a root note explicitly, although top-level fields
can be said to be "contained by" the root field. If an explicit naming of the root is
required (such as for conversion to XML), it can be done so by concatenating the namespace
and version values. For example:
root: WE1S1.0
namespace: WE1S
version: 1.0
Detailed descriptions of the individual fields are given below.
Description |
A description of the URL, API query, or other method
used to access the resource. |
Required |
Required |
Contained by |
collectionInfo |
Notes |
This field can contain an array of arbitrary keys to classify types of access
as in the example below. Multiple methods can be given as a list. |
Examples |
accessMethod:
url: http://somewebsite.com
api: http://somewebresource.com/?search=somequery
|
Description |
A description of a step in a processing workflow. |
Required |
Optional |
Contained by |
processes |
Description |
An alternative title for the resource. This may be a
descriptive title provided by the collector(s) or some other useful designator.
|
Required |
Optional |
Contained by |
root |
Description |
Optional field for listing multiple authors. |
Required |
Required if the author is part of a list. |
Contained by |
authors |
Notes |
This field can contain an array of arbitrary keys to classify parts of names
as in the example below. It is possible that these subfields should be
required for search purposes. |
Examples |
authors:
- author:
firstname: Fred
lastname: Flintstone
- author:
firstname: George
lastname: Jetson
|
Description |
A list of the author(s) of the file or workflow
described in the manifest. |
Required |
Required for some times of resources. |
Contained by |
publicationInfo |
Notes |
Multiple authors can be expressed as a simple list of names (e.g. [Fred
Flintstone, George Jetson]) or as a list of key-value pairs:
authors:
- Fred Flintsone
- George Jetson
authors:
- seq: 1
author: Fred Flintstone
- seq: 2
author: George Jetson
The use of the author field is only required if
authors contains other fields like seq. |
Description |
The date when the resource was acquired in the format YYYY-MM-DD . |
Required |
Required |
Contained by |
collectionInfo |
Description |
A set of fields providing information about the data
or resource was acquired. |
Required |
Required only for resources not authored by WhatEvery1Says
staff. |
Contained by |
root |
Must contain |
collectors, collectionDate, accessMethod |
Description |
Optional field for listing multiple collectors. |
Required |
Required if the collector is part of a list. |
Contained by |
collectors |
Notes |
This field can contain an array of arbitrary keys to classify parts of names
as in the example below. It is possible that these subfields should be
required for search purposes. |
Examples |
collectors:
- collector:
firstname: Fred
lastname: Flintstone
- collector:
firstname: George
lastname: Jetson
|
Description |
A list of the collector(s) of the file or workflow
described in the manifest. |
Required |
Required |
Contained by |
collectionInfo |
Notes |
Multiple collectors can be expressed as a simple list of names (e.g. [Fred
Flintstone, George Jetson]) or as a list of key-value pairs.
collectors:
- Fred Flintsone
- George Jetson
collectors:
- seq: 1
collector: Fred Flintstone
- seq: 2
collector: George Jetson
The use of the collector field is only required if
collectors contains other fields like seq. |
Description |
A taxonomic description from a controlled vocabulary (e.g. newspaper article) created by WhatEvery1Says. Can contain multiple items. |
Required |
Optional |
Contained by |
publicationInfo |
Description |
The country or countries from which the resource
originated. |
Required |
Optional |
Contained by |
publicationInfo |
Description |
The date an action was performed or a note was
recorded, given in the format YYYY-MM-DD . |
Required |
Optional |
Contained by |
notes, processes |
Description |
A taxonomic description from a controlled vocabulary(e.g. genre) created by WhatEvery1Says. Can contain multiple items. |
Required |
Optional |
Contained by |
publicationInfo |
Description |
An arbitrary field to indicate group responsibility.
|
Required |
Optional |
Contained by |
resp |
Notes |
This field name is included for completeness since it is mentioned in the
resp field definition. It can also be applied to other fields like
authors and collectors. |
Description |
A short designator for the resource to be used for
generating CSVs, visualisations, etc. |
Required |
Optional |
Contained by |
root |
Description |
The language or languages contained in the resource.
|
Required |
Optional |
Contained by |
publicationInfo |
Notes |
Language codes should be taken from the ISO 639-2 list. |
Description |
A unique identifier for the manifest. This may be a
sequential number or the file name. |
Required |
Required |
Contained by |
root |
Description |
A URI for a namespace used for describing a metadata schema. |
Required |
The WhatEvery1Says namespace is required within the root. namespace is optional elsewhere. |
Contained by |
resourceMetadata, root |
Notes |
For now, we can use the label "WE1S" for the namespace. |
Description |
Contains an item in a list of notes supplied by
WhatEvery1Says staff. |
Required |
Required |
Contained by |
notes |
Description |
Contains a note or a list of notes supplied by
WhatEvery1Says staff. In lists, each item should use the note field. |
Required |
Optional |
Contained by |
Global |
Examples |
notes:
-note: This is a note.
-note: This is a second note.
|
Description |
Has the value "true" if the resource was generated by
Optical Character Recognition. |
Required |
Optional |
Contained by |
publicationInfo |
Description |
Path to processed text files. |
Required |
Optional |
Contained by |
processes |
Description |
A description of a processing workflow. This can be a
sequence of processing steps or a manifestId reference to a separate
manifest containg the sequence. |
Required |
Optional |
Contained by |
root |
May contain |
action, date, processedText,
ref, resp, seq |
Examples |
processes:
-seq: 1
action: Removed stop words.
date: 2015-01-03
resp: Scott Kleinman
processedText: /path/to/file/without/stopwords
|
Description |
The title of the publication from which the resource
derives. |
Required |
Required for some times of resources. |
Contained by |
publicationInfo |
Notes |
Some fields may be required for certain resource types and not for others.
Manifests for processing workflows and other types of materials produce by
WhatEvery1Says staff are considered to be authored by that staff and their
names must be listed in the authors field. |
Description |
The date of publication in the format YYYY-MM-DD . |
Required |
Required for some times of resources. |
Contained by |
publicationInfo |
Description |
A set of fields providing information about the
production and dissemination of the resource. |
Required |
Required |
Contained by |
root |
Must contain |
authors |
May contain |
contentType, country, documentType,
language, OCR, publication, publicationDate,
publisher, resourceMetadata, rights |
Notes |
Some fields may be required for certain resource types and not for others.
Manifests for processing workflows and other types of materials produce by
WhatEvery1Says staff are considered to be authored by that staff and their
names must be listed in the authors field. |
Description |
The publisher's name. |
Required |
Required for some times of resources. |
Contained by |
publicationInfo |
Notes |
For resources authored by WhatEvery1Says staff, "4Humanities" can be given as
the value. |
Description |
Contains a reference to another manifesttId . |
Required |
Optional |
Contained by |
processes |
Description |
A list of resources related to the current document.
Related files may be documentation or visualization files, or they may be
subsets or suprasets of a collection to which the resource belongs. |
Required |
Optional |
Contained by |
root |
Notes |
This field can contain an array of arbitrary fields to classify the type and
location of the related resource. Some suggestions are type (with values
like hasPart and isPartOf ), ref (a reference to a
manifestId ), and location (a path to an archived file). |
Description |
A unique ID for a resource file. This may be a
sequential integer or an alphanumeric value. |
Required |
Required if the manifest refers to an outside resource (i.e.
not a description of workflow). |
Contained by |
root |
Description |
The path to the WhatEvery1Says archived copy of a
resource. |
Required |
Required only if there is no WhatEvery1Says archive file. |
Contained by |
collectionInfo |
Description |
A description of metadata accompanying the data from
its source publisher. |
Required |
Optional |
Contained by |
publicationInfo |
Notes |
Contains an array with the keys namespace (the URI of metadata schema),
prefix (a namespace label to attach to the metadata field), field (the name
of the metadata field). The namespace is probably only necessary if the
schema is not a standard like Dublin Core. |
Examples |
resourceMetadata:
namespace: http://dublincore.org/documents/dcmi-namespace/
prefix: DC
field: type
|
Description |
A description of the resource, for example, primary
data, processing information, scripts, documentation, etc. |
Required |
Optional |
Contained by |
root |
Notes |
- Currently, values may be drawn from an open-ended list of common
options, but it is possible to specify a controlled vocabulary at a
later date.
- resourceType should be inferred from the other manifest fields, but
recording this might make sorting of particular types of manifests
easier. We should consider making it required.
|
Description |
The name of the person or persons responsible for an
action or note. |
Required |
Optional |
Contained by |
notes, processes |
Notes |
In some cases, it may be necessary to indicate group responsibility. This can
be done using an arbitrary group key: e.g. resp: {group: UCSB} . |
Description |
A statement of licensing rights or intellectual
property restrictions. Free culture is assumed by default. |
Required |
Optional |
Contained by |
publicationInfo |
Description |
Contains an integer used for keeping track of the
sequential order of items in a list. This should be indicated with the hyphen
plus space flag in full YAML notation. |
Required |
Optional |
Contained by |
Globalcan be used in any field containing a list. |
Description |
The title of the resource. This may be the original
title of a publication or a descriptive title provided by the collector(s). |
Required |
Required |
Contained by |
root |
Description |
Contains a version number. |
Required |
The version number of the WhatEvery1Says manifest schema is required in the document root. |
Contained by |
Global |