A Three-Pound Monkey Brain: XML

Showing posts with label XML. Show all posts

14 February 2013

Mathematical expressions as JSON (and phyloreferencing)

For Names on Nodes I did a lot of work with MathML (specifically MathML-Content), an application of XML for representing mathematical concepts. But now, as XML wanes and JSON waxes, I've started to look at ideas for porting Names on Nodes concepts over to JSON.

I've been drawing up a very basic and extensible way to interpret JSON mathematically. Each of the core JSON values translates like so:

Null, Boolean, and Number values are interpreted as themselves.
Strings are interpreted as qualified identifiers (if they include ":") or local identifiers (otherwise).
Arrays are interpreted as the application of an operation, where the first element is a string identifying the operation and the remaining elements are arguments.
Objects are interpreted either as:

a set of declarations, where each key is a [local] identifier and each value is an evaluable JSON expression (see above), or
a namespace, where each key is a URI and each value is a series of declarations (see previous).

Examples

Here's a simple object declaring some mathematical constants (approximately):

{
    "e": 2.718281828459045,
    "pi": 3.141592653589793
}

Supposing we had declared some operations (only possible in JavaScript, since JSON doesn't have functions) equivalent to those of MathML (whose namespace URI is "http://www.w3.org/1998/Math/MathML"), we could do this:

{
    "x":

        ["http://www.w3.org/1998/Math/MathML:plus",

1,

        ],
    "y":

        ["http://www.w3.org/1998/Math/MathML:sin",

            ["http://www.w3.org/1998/Math/MathML:divide",

                "http://www.w3.org/1998/Math/MathML:pi",

]
}

Once evaluated, x would be 3 and y would be 1 (or close to it, given that this is floating-point math).

Now for the interesting stuff. Suppose we had declared Names on Nodes operations and some taxa using LSIDs:

{
    "Homo sapiens": "urn:lsid:ubio.org:namebank:109086",
    "Ornithorhynchus anatinus": "urn:lsid:ubio.org:namebank:7094675",
    "Mammalia":

        ["http://namesonnodes.org/ns/math/2013:clade",

            ["http://www.w3.org/1998/Math/MathML:union",

                "Homo sapiens",

                "Ornithorhynchus anatinus"

Voilá, a phylogenetic definition of Mammalia in JSON!

I think this could be pretty useful. My one issue is the repetition of long URIs. It would be nice to have a mechanism to import them using shorter handles. Maybe something like this?

{
    "mathml":   "http://www.w3.org/1998/Math/MathML:*",
    "namebank": "urn:lsid:ubio.org:namebank:*",
    "NoN":      "http://namesonnodes.org/ns/math/2013:*",

    "Mammalia":

        ["NoN:clade",

            ["mathml:union",

                "namebank:109086",

                "namebank:7094675"

]
}

Something to ponder. Another thing to ponder: what should I call this? MathON? MaSON?

28 January 2013

Using TypeScript to Define JSON Data

JSON has gradually been wearing away at XML's position as the primary format for data communication on the Web. In some ways, that's a good thing: JSON is much more compact and readable. In other ways, it's not so great: JSON lacks some of XML's features.

One of these features is document type definitions. For XML, there are a variety of formats (DTD, XML Schema, RELAX NG, etc.) for specifying exactly what your XML data looks like: what are the tag names, possible attributes, etc. JSON is a lot more loosey-goosey here.

Okay, that's not entirely true: there is JSON Schema. I've never known anyone to use it, but it's there. It's awfully verbose, though. (So are the definitional formats for XML, but it's XML — you expect it!)

I was thinking about this the other day, and I realized that there is actually a great definitional format for JSON already in existence: TypeScript! If you haven't heard of it, TypeScript is a superset of JavaScript which introduces optional strict typing. And since JSON is a subset of JavaScript, TypeScript is applicable to JSON as well.

One of the great features of TypeScript is that interface implementation is implicit. In Java or ActionScript, you have to specifically say that a type "implements MyInterface". In TypeScript, if it fits, it fits. For example:

interface List

{

length: number;

}

function isEmpty(list: List): bool

{

return list.length === 0;

}

console.log(isEmpty("")); // true

console.log(isEmpty("foo")); // false

console.log(isEmpty({ length: 0 })); // true
console.log(isEmpty({ length: 3 })); // false
console.log(isEmpty({ size: 1})); // Compiler error!

(Note: for some reason that I can't fathom, isEmpty() doesn't work on arrays. Well, TypeScript is still in development — version 0.8.2 right now. Update: I filed this as a bug.)

Note that you can use interfaces even on plain objects. So of course you can use it to describe a JSON format. Here's an example from a project I hope to release before too long:

interface Model

{

uid: string;

}

interface Name extends Model

{

citationStart?: number;

html?: string;

namebankID?: string;

root?: bool;

string?: string;

type?: string;

uri?: string;

votes?: number;

}

interface Taxon

{

canonicalName?: Name;

illustrated?: bool;

names?: Name[];

}

Now, for example, I can declare that an API search method will return data as an array of Taxon objects (Taxon[]). And look how compact and readable it is!

Note that there is one drawback here: there is no way to enforce this at run-time. JSON Schema might be a better choice if that's what you need. But for compile-time checking and documentation, it's a pretty great tool.

02 April 2010

Names on Nodes: MathML Definitions (Version 1.2)

After the epiphany that Names on Nodes did not have to be associated with a database, I set to work creating a "standalone" version of the application. Progress has been pretty good, and if you are interested in the details (or collaborating), you can check the project out at its new home on Bitbucket (which also houses the related project, ASMathema).

I've just updated the Names on Nodes website based on these revisions to the project, most notably the MathML Definitions document. Most of the changes have actually been removals: no more mentions of rank-based taxonomy (which may be covered in future versions but not in this one), qualified names as taxonomic identifiers (no longer a necessary feature), etc. So if you didn't read it before because it was too long and dense ... well, it's still pretty long and dense, actually. But less so!

I've also added an example MathML document as a supplement. This document:

Defines a phylogenetic context (the same one used in the MathML Definitions examples), arranging taxonomic units as vertices in a directed, acyclic graph.
Defines sets based on characters ("wings used for powered flight" and "extant")
Refers a specimen (YPM-VP 1450) to a taxonomic unit (Ichthyornis).
Equates several species names as synonyms.
Defines some hybrid formulas as referring to specific taxonomic units.
Defines a number of clade names.

This file can be opened with Names on Nodes: Standalone Version, which I am currently developing and hope to release this year.

25 February 2010

Names on Nodes: MathML Definitions (Version 1.1)

After posting Version 1.0 earlier this week, I had a revelation: the cladogen functions are completely unnecessary, and everything would work a lot nicer if I just tossed them. I also realized that there really was no reason I couldn't include the various relations (precedence, immediate precedence, proper precedence, etc.), just in case anyone wanted to do some seriously non-standard definitions. After some significant revisions, I present Version 1.1.

Some examples of the updated notation, using humans (Homo sapiens), platypuses (Ornithorhynchus anatinus), and Dimetrodon grandis, a stem-mammal:

Union. Homo sapiens ∪ Ornithorhynchus anatinus = all humans and all platypuses (polyphyletic taxon, also monothetic)

Exclusive Predecessors. Homo sapiens ← Ornithorhynchus anatinus = humans and all of their ancestors, except for the ancestors shared with platypuses (lineage)

Synapomorphic Predecessors. "milk glands" @ Homo sapiens = humans and all human ancestors to possess milk glands synapomorphic with those in humans (lineage)

Node-Based Clade. Clade(Homo sapiens ∪ Ornithorhynchus anatinus) = Mammalia

Branch-Based Clade (simple). Clade(Homo sapiens ← Ornithorhynchus anatinus) = "Pan-Theria"

Branch-Based Clade (multiple external specifiers). Clade(Homo sapiens ← Ornithorhynchus anatinus ∪ Dimetrodon grandis) = "Pan-Theria"

Branch-Based Clade (multiple internal specifiers). Clade(Homo sapiens ∪ Ornithorhynchus anatinus ← Dimetrodon grandis) = (unnamed clade comprised mostly of Therapsida)

Null Branch-Based Definition (multiple internal specifiers). Clade(Homo sapiens ∪ Dimetrodon grandis ← Ornithorhynchus anatinus) = ∅

Apomorphy-Based Clade. Clade("milk glands" @ Homo sapiens) = "Apo-Mammalia"

Node-Modified Crown Clade. Crown(Homo sapiens ∪ Dimetrodon grandis, "extant as of or after 2010") = Mammalia

Branch-Modified Crown Clade. Crown(Homo sapiens ← Ornithorhynchus anatinus, "extant as of or after 2010") = Theria

Apomorphy-Modified Crown Clade. Crown("milk glands" @ Homo sapiens, "extant as of or after 2010") = Mammalia

Total Clade. Total(Mammalia, "extant as of or after 2010") = Synapsida (or "Pan-Mammalia")

Image showing a node-based clade (Mammalia) under a given phylogenetic hypothesis. Click to enlarge. More here.

23 July 2009

Two "Names on Nodes"-Related Launches

I'm still a clear way away from launching the beta application, but I've just made a couple of launches related to my long-time work-in-progress, Names on Nodes.

First up, and probably of more interest to most people, I've begun the documentation for the MathML definitions used by Names on Nodes. The document includes general reviews of relevant mathematical and biological concepts, a quick review of MathML and the technologies it's based on, some comments on correlating mathematical and biological concepts, and definitions for all entities (including operations) used by Names on Nodes. Note that this covers a lot of the same ground as in my 2007 paper, with a few minor changes in the symbols and terminology (e.g., I now call the ancestor of a clade a "cladogen" rather than a "cladogenetic set").

Secondly, I've made the project open-source, by moving it to Google Code. If you are a developer interested in checking this out, go here. It's incomplete, so I don't know if anyone will have any real interest in looking at it yet. (Honestly, I mostly posted so that, on the off chance that I unexpectedly kick the bucket, my magnum opus won't be lost forever.)

This information is also on the new Names on Nodes home page.

21 October 2008

Six Ways to Say the Same Thing

Prose

"'Aves' refers to the crown clade stemming from the most recent common ancestor of Ratitae (Struthio camelus Linnaeus 1758), Tinamidae (Tetrao [Tinamus] major Gmelin 1789), and Neognathae (Vultur gryphus Linnaeus 1758)."

—Jacques Gauthier & Kevin de Queiroz 2001 December

Simple Mathematical Formula

Aves := Clade(Struthio camelus + Tetrao major + Vultur gryphus)

Complex Mathematical Formula

Aves Linnaeus 1758 [Gauthier & de Queiroz 2001] := (AD o max o CA)(Struthio camelus ∪ Tetrao major ∪ Vultur gryphus)

Ridiculously Complex Mathematical Formula

C := {x : (∀y ∈ (Struthio camelus ∪ Tetrao major ∪ Vultur gryphus))[x ≼ y]}
A := {x ∈ C : (∀y ∈ C)[x ⊀ y]}
Aves := {x : (∃y ∈ A)[x ≽ y]}

Simple MathML-Content

<apply>
    xmlns="http://www.w3.org/1998/Math/MathML"
  <csymbol
    definitionURL="http://namesonnodes.org/2008/phylo/math/nodeClade"/>
  <csymbol
    definitionURL="urn:isbn:0-85301-006-4/Struthio+camelus"/>
  <csymbol
    definitionURL="urn:isbn:0-85301-006-4/Tetrao+major"/>
  <csymbol
    definitionURL="urn:isbn:0-85301-006-4/Vultur+gryphus"/>
</apply>

Complex MathML within Custom Markup

<pn:definition
    xmlns="http://www.w3.org/1998/Math/MathML"
    xmlns:pn="http://namesonnodes.org/2008/phylo/names">
  <apply>
    <csymbol
        definitionURL="http://namesonnodes.org/2008/phylo/math/clade">
      <mi form="prefix">Clade</mi>
    </csymbol>
    <apply>
      <csymbol
          definitionURL="http://namesonnodes.org/2008/phylo/math/nodeAncestors">
        <mo form="infix">+</mo>
      </csymbol>
      <csymbol
          definitionURL="urn:isbn:0-85301-006-4/Struthio+camelus">
        <![CDATA[<i>Ratitae</i> (<i>Struthio camelus</i> Linnaeus 1758)]]>
      </csymbol>
      <csymbol
          definitionURL="urn:isbn:0-85301-006-4/Tetrao+major">
        <![CDATA[<i>Tinamidae</i> (<i>Tetrao</i> [<i>Tinamus</i>] <i>major</i> Gmelin 1789)]]>
      </csymbol>
      <csymbol
          definitionURL="urn:isbn:0-85301-006-4/Vultur+gryphus">
        <![CDATA[<i>Neognathae</i> (<i>Vultur gryphus</i> Linnaeus 1758)]]>
      </csymbol>
    </apply>
  </apply>
</pn:definition>

References

urn:isbn:0-912532-57-2/1 (Jacques Gauthier & Kevin de Queiroz 2001 December)
urn:doi:10.1111/j.1463-6409.2007.00302.x (T. Michael Keesey 2007 November)

31 January 2008

The Nouns of Names on NEXUS

Programming is a mystery to most folks. They see a bunch of overpunctuated gobbledygook with words strewn about here and there and it's completely opaque. They know that it somehow translates into the functionality of the applications, games, websites, etc. that they use. But they have no inroads to understanding how on Earth that works.

I will now attempt a (very) partial explanation for the phylogenetics-literate crowd.

One thing people don't understand is that object-oriented computer languages (which is what I primarily use) are actually designed to be compatible with how humans think. Or at least, they're a sort of compromise between how computers think and how humans think. Natural languages, of course, are totally biased toward how humans think, while machine codes (and their slightly dressed-up cousins, assembly languages) are totally biased toward how computers think. (There are also functional languages which are slightly more computer-biased than object-oriented languages.)

Like natural languages, object-oriented languages have nouns, except they're called objects. They also have verbs, except they're called methods. Methods are usually (but not always) attached to objects. Objects can have attributes which are themselves other objects—these are called fields, and they can work a bit like adjectives (although that's not a perfect analogy).

One of the first tasks I do as a programmer when approaching a new project is to figure out what the nouns of the project are. These will be used as the basis for classes, which are the templates which objects (and their methods and fields) are created from.

So let's use Names on NEXUS as an example. This is my project, hinted at in my paper, to relate the data in NEXUS files (Maddison et al. 1997) to definitions of names as governed by the PhyloCode. So my first step is to come up with lists of nouns (i.e., class candidates) for each side of the equation:

PhyloCode (nomenclature)	scientific name (or nomen), uninomen, binomen, prenomen, genus name, clade name, phylonym, definition
PhyloCode (specification)	specifier, species, specimen, specimen collection, specimen accession, apomorphy, definition
NEXUS	NEXUS file, tree, tree element, tree node, tree terminus, character state
shared	phylogeny, citation, piece of literature, calendar date, URI

The goal of this project is to translate a PhyloCode definition (associated with a phylonym) into a list of NEXUS taxa (i.e., operational taxonomic units) using a NEXUS tree. For that to happen, there need to be some additional nouns that help relate NEXUS entities to PhyloCode entities:

Names on NEXUS	character state specifier, taxon specifier, character state link, taxon link

The next step is to figure out how these nouns—these classes—relate to each other. Typically, this involves statements of the form "X is a Y" (which has to do with class hierarchy) and the forms "X has a Z", "X has one or more Zs", "X has zero or more Zs", etc. (which have to do with fields). I'll also translate these nouns into capitalized "camel-humped" format, the standard format for class names in the languages I use. Lower-case "camel-humped" nouns are of primitive types (numbers, strings, Booleans) which I don't need to make a class for.

Literature

A LiteraturePiece has a CalendarDate, one or more authorNames, and zero or more URIs.
A Citation has a LiteraturePiece and zero or more authorNames.

PhyloCode: Nomenclature

A Nomen has a Citation, an orthography, and zero or more URIs.
A Uninomen is a Nomen.
A Binomen is a Nomen and a Phylonym, and has a Prenomen and a Uninomen.
A GenusName is a Uninomen and a Prenomen.
A CladeName is a Uninomen, a Phylonym, and a Prenomen.
A PhyloDefinition has a Citation, a Phylonym, one or more Specifiers, a prose statement, and a mathML statement(see my paper for details on the last one).

PhyloCode: Specification

A Specifier has zero or more URIs.
An Apomorphy is a CharStateSpecifier, and has a description and a Citation.
A Specimen is a TaxonSpecifier, and has one or more SpecimenAccessions.
A SpecimenAccession has a code and a SpecimenCollection.
A SpecimenCollection has a code, a name, and zero or more URIs.
A Species is a TaxonSpecifier, and has one or more Binomens (binomina) and one or more Specimens (name-bearing types).

NEXUS

A NexusFile has textData, zero or one Citations, zero or more URIs, a numTaxa amount, a numChars amount, two or more CharStates, zero or more Trees, zero or more CharStateLinks, and zero or more TaxonLinks.
A CharState has a character index and a character scoring.
A Tree has a TreeNode.
A TreeNode is a TreeElement and has two or more TreeElements.
A TreeTerminus is a TreeElement and has a taxonIndex.

Names on NEXUS

A CharStateSpecifier is a Specifier.
A TaxonSpecifier is a Specifier.
A CharStateLink has a CharState and a CharStateSpecifier.
A TaxonLink has a taxonIndex and a TaxonSpecifier.

Now I can describe the core functionality of Names on NEXUS. Taking a NexusFile, the user selects one of its Trees. Next, the application finds all PhyloDefinitions whose Specifiers are each referred to by one of the NexusFile's CharStateLinks or TaxonLinks. Using the Tree and each PhyloDefinition's mathML statement, it correlates the PhyloDefinition's Phylonym to a set of taxon indices in the NexusFile.

Of course, this is not all the application will do. (In fact, I've been done with that part of the programming for a while now.) There will also need to be a lot of programming for saving these data permanently in a database, presenting the data to the user, and making it easier for the user to enter data (for example, by creating methods for coming up with specifier suggestions based on definition statements). This may take a while....

05 November 2007

My First Paper

The inauguration of this blog was just barely in time for me to report my first paper as primary (and sole) author:

KEESEY, T. M. 2007. A mathematical approach to defining clade names, with potential applications to computer storage and processing. Zoologica Scripta 36 (6): 607–621. doi:10.1111/j.1463-6409.2007.00302.x

Here's the abstract, also available here:

Clade names may be objectively defined based on conditions of phylogeny. Definitions usually take one of three forms — node-, branch- or apomorphy-based — but other forms and complex permutations of these forms are also possible. Some database projects have attempted to store definitions of clade names in a manner accessible to computer applications, but, so far, they have only provided ways of storing the most common types of definition. To create a more extensible system, I have taken a mathematical approach to defining clade names. To render definitions accessible to computer storage and analysis, I propose using Mathematical Markup Language (MATHML) with extensions. Since the mathematical approach is granular to the level of the organism, not to fuzzy higher levels such as population or species, it sheds light on some theoretical difficulties with defining clade names. For example, some definitions do not resolve to a single organism as the ancestor, but to sets of organisms which are not ancestral to each other and share common descendants. I term such sets ‘cladogenetic sets’.

If you made it through that, congratulations. Now you may have some questions.

What is a "clade"?

An ancestor and all of its descendants. As an example, mammals form a clade. Fish do not form a clade, since they exclude some descendants (tetrapods). Hoofed mammals ("ungulates") do not form a clade, since their common ancestors were not hoofed (instead, hooves have evolved several times among placental mammals).

What is "branch-based", again?

The PhyloCode is a set of rules being put together to deal with the naming of clades. It recommends certain forms of definition. The main ones (but certainly not the only ones), with examples, are:

node-based. "Mammalia is the final common ancestor of platypuses and humans, and all descendants of that ancestor."
branch-based. "Synapsida is the initial ancestor of humans which is not also ancestral to sand lizards, and all descendants of that ancestor." (The image below represents two branch-based clades, one in red and one in yellow. White dots represent organisms in both clades.)
apomorphy-based. "Avialae is the first ancestor of Andean condors to possess powered flight homologous with that in Andean condors, and all descendants of that ancestor."

(Actual definitions would use proper scientific names instead of "platypuses", "humans", etc. but you get the idea.)

This stands in contrast to the current taxonomic codes, which are rank-based. Definitions under rank-based codes look more like, "Homo is the genus that includes Homo sapiens." There is a very important difference between these two styles of definition. Rank-based definitions are based (at least partly) on subjective opinions, since the ranks (with the possible, but contentious, exception of species) do not have any objective meaning. We all probably learned about kingdoms, classes, orders, families, and genera in biology class, but these ranks don't have any intrinsic meaning. A family of birds might include a few closely related species, while a family of insects might include thousands, with more distant common ancestry.

Phylogenetic definitions, on the other hand, proceed directly from our knowledge of phylogeny. When two researchers disagree on the content of a rank-based taxon, they might be arguing about aesthetics, actual relationships, or both. When they disagree about the content of a phylogenetic taxon, they can only be arguing about actual relationships.

So, what did you do?

Since phylogenetic definitions are based directly on phylogeny, without need for opinions, this means they can be expressed in completely unambiguous language. This includes:

Mathematical formulas.
Computer languages.

As I discuss in the paper, some people have created unambiguous shorthand formulas and unambiguous database schemas for representing phylogenetic definitions. But the previous efforts have all focused on simple definitional formats, ignoring other formats and complex permutations.

Well, la-ti-da. So what?

This means more of the taxonomic process can be automated. With rank-based definitions, there has to be an expert to "feel out" how expansive a genus, family, order, etc. should be. But with phylogenetic definitions, you can feed a computer application the phylogeny encoded in a popular file format (e.g., NEXUS) and taxonomic definitions encoded in a popular file format (MathML), and it can figure out the content referred to by a taxonomic name in fractions of a second.

Okay, so where's the application?

I'm still working on one, called Names on NEXUS. So far it's going well; I just need to refactor and complete the server-side application and touch up the client-side application. Should have some time for that next year.