Data models

Definition of collection, document and annotation

Lynx is about providing better services on compliance. The added value of the Lynx services revolve around a better processing of heterogenous, multilingual documents in the legal domain. Hence, the most important document is the Lynx Document.Lynx Documents may be grouped in Collections, and may be enriched with Annotations. Thus, the main entities to deal with are three:

  • Lynx Documents are the basic information units in Lynx: identified pieces of text.
  • Collections are groups of Lynx Documents with any logical relation. There may be one collection per use case, per jurisdiction, etc.
  • Annotations are enrichments of Lynx Documents, such as summaries, translation, recognized entities, etc.

Because most of the AI algorithms dealing with documents focus on text -manipulation of images, videos or tables is less developed-, the essence of a Lynx Document is its text version. Thus, the key element in a Lynx Document is an identified piece of text. This document can be annotated with an arbitrary number of metadata elements (creation date, author, etc.), and eventually structured for a minimally attractive visual representation.

Original documents are transformed as represented in the following figure: first, they are acquired by harvesters from their heterogeneous sources and formats, being structured and represented in a uniform manner. Then, they are enriched with annotations (such as named entities like persons, organisations, etc.).

 

 

The elements in a complete Lynx Document, with annotations, are depicted in the following figure. Metadata is defined as a list of pairs attribute-values. Parts are defined as text fragments delimited by two offsets, Building the Legal Knowledge Graph for Smart Compliance Services in Multilingual Europe 36 possibly with a title and a parent, so that they can be nested. Annotations also refer to text fragments delimited by two offsets, and describe in different manners such a fragment (e.g. ‘it refers to a Location which is Madrid, Spain’).

 

 

Lynx Documents with metadata

The simplest possible Lynx Document as a JSON file is shown in the listing below.

{
 "@context": "http://lynx-project.eu/doc/jsonld/lynxdocument.json",
 "@id": "doc001",
 "@type": "http://lynx-project.eu/def/lkg/LynxDocument",
 "text" : "This is the first Lynx document, a piece of identified text."
}

The first line declares the context (@context), which describes how to interpret the rest of the JSON LD document. It references an external file. The second one (@id) declares the identifier of the element. The complete URI to identify the document is created from this string and also from the @base declared in the context. The @type declares what is the type of the document, and finally the text element represents the text of the document. The text is not repeated in the fragments, in order to save space. Alternative transformations of this JSON structure are possible and recommended for every specific implementation need (e.g. OLS in Pilot 1). The JSON-LD version can, however, be automatically converted into other RDF syntaxes. For example, the Turtle version of the same document follows.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://lkg.lynx-project.eu/res/doc001>
 a <http://lynx-project.eu/def/lkg/LynxDocument> ;
 rdf:value "This is the first Lynx document, a piece of identified text." .

Metadata is a collection of pairs property-list of values (such as "subject-testing/documents") and pairs of property-value (such as "title-SecondDocument"). This is better illustrated with the example below.

{
 "@context": "http://lynx-project.eu/doc/jsonld/lynxdocument.json",
 "@id": "doc002",
 "@type": "http://lynx-project.eu/def/lkg/LynxDocument",
 "text" : "This is the second Lynx document.",
 "metadata" : {
 "title": "Second Document",
 "subject": ["testing", "documents"]
 }
}

Which is rendered as RDF Turtle in the next listing.

@prefix lkg: <http://lkg.lynx-project.eu/def/lkg/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://lkg.lynx-project.eu/res/doc002>
 a <http://lynx-project.eu/def/lkg/LynxDocument> ;
 lkg:metadata [
 dc:subject "testing", "documents";
 dc:title "Second Document"
 ] ;
 rdf:value "This is the second Lynx document." .

The language tag can be defined with the @language JSON-LD element, as an additional context element. This will make strings (RDF literals) to have the language tag set to Spanish.

{
 "@context": ["http://lynx-project.eu/doc/jsonld/lynxdocument.json", {"@language": "es"}],
 "@id": "doc003",
 "@type": "http://lynx-project.eu/def/lkg/LynxDocument",
 "text" : "Un documento en español."
}

Lynx Documents with structuring information

Parts and structuring information can be included as shown in the next example. Parts are defined by the offset (begin and final character of the excerpt). They can be nested because they have a parent property and they can be possibly identified. Fragment identifiers can be built as described in the NIF specification. The example below shows an example of nested fragments, as Art. 2.1.

{
 "@context": "http://lynx-project.eu/doc/jsonld/lynxdocument.json",
 "@id": "doc004",
 "@type": "http://lynx-project.eu/doc/lkg/LynxDocument",
 "text": "Art.1 This is the fourth Lynx document. Art.2 This is the fourth Lynx document. Art 2.1.
Empty.",
"metadata": {
 "title": ["A document with parts."]
 },
 "parts": [
 {
 "offset_ini": 0,
 "offset_end": 39,
 "title": "Art.1"
 },
 {
 "@id": "http://lkg.lynx-project.eu/res/doc004/#offset_41_94",
 "offset_ini": 41,
 "offset_end": 94,
 "title": "Art.2"
 },
 {
 "offset_ini": 80,
 "offset_end": 94,
 "title": "Art.2.1",
 "parent": {
 "@id": "http://lkg.lynx-project.eu/res/doc004/#offset_41_94"
 }
 }
 ]
}

In the following example, the Turtle RDF version is shown.

@prefix eli: <http://data.europa.eu/eli/ontology#> .
@prefix nif: <http://persistence.unileipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix lkg: <http://lkg.lynx-project.eu/def/lkg/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://lkg.lynx-project.eu/res/doc004>
 a <http://lynx-project.eu/doc/lkg/LynxDocument> ;
 eli:has_part [
 nif:beginIndex 0 ;
 nif:endIndex 39 ;
 dc:title "Art.1"
 ], <http://lkg.lynx-project.eu/res/doc004/#offset_41_94>, [
 lkg:parent <http://lkg.lynx-project.eu/res/doc004/#offset_41_94> ;
 nif:beginIndex 80 ;
 nif:endIndex 94 ;
 dc:title "Art.2.1"
 ] ;
 lkg:metadata [ dc:title "A document with parts." ] ;
 rdf:value "Art.1 This is the fourth Lynx document. Art.2 This is the fourth Lynx document. Art 2.1. E
mpty."^^.
<http://lkg.lynx-project.eu/res/doc004/#offset_41_94>
 nif:beginIndex 41 ;
 nif:endIndex 94 ;
 dc:title "Art.2" .

In terms of UML, a couple of classes may be enough to represent them as objects.

The previous example did not contain parts. An example with parts follow:

{
 "id": "0001",  
 "text": "Complete Text Example. 1. Introduction. This is a text 2. Objectives. We foresee several objectives, in particular: 2.1 Objective 1. To make it simple. 2.2. Objective 2. To make it fast. 3. Is this a third objective or a third section?  .",
 "parts": [
	{
	  "id":"part01",  
	  "offset_ini": "17", (the first letter of the block, e.g., before 1)
	  "offset_end": "1200", (the last character of the block)
	  "title": "1. Introduction",
	   "parent": "0001"
	},
	{
	  "id":"part02",
	  "offset_ini": "1201", (it should be the previous one’s end plus one!)
	  "offset_end": "2200", (the last character of the block)
	  "title": "2. Objectives",
	   "parent": "0001"
	},
	{
	 "id": "part021",
	 "offset_ini": "1210",
	 "offset_end: "1300"
	 "title": "2.1 Objective 1",
	 "parent": "part02"
	}
	]
}

List of recommended metadata fields and their representation

Group Property Usage RDF property
basic elements id Lynx identifier of the document dct:identifier
text Text of the document rdf:value
parts Parts of the document eli:has_part
general type Type of document (legislation, caselaw, etc.) dct:type
rank Sub-type of document (constitution, law, etc.) eli:type_document
language Language of the document dct:language
jurisdiction Jurisdiction using ISO eli:jurisdiction
wasDerivedFrom Original URL if the document was extracted from the web prov-o:wasDerivedFrom
title Title of the document dct:title
hasAuthority Authority issuing the document lkg:hasAuthority
nick Alternative names of the document foaf:nick
version Consolidated, draft or bulletin eli:version
subject Subjects or keywords of the document dtc:subject
identifiers id_local Local identifier (e.g. BOE-A-2019-1234) eli:id_local
identifier Official identifier (e.g. ELI etc.) dct:identifier
dates first_date_entry_in_force Date when enters into force eli:first_date_entry_in_force
date_no_longer_in_force Date when repealed / expired eli:date_no_longer_in_force
version_date Date of publication of the document eli:version_date
mappings hasEli Official identifier (ELI, ECLI or equivalent) lkg:hasEli
hasPDF Link to the PDF version lkg:hasPDF
hasDbpedia Link to the equivalent dbpedia version lkg:hasDbpedia
hasWikipedia Link to the equivalent wikipedia version lkg:hasWikipedia
sameAs Equivalent document owl:sameAs
seeAlso Related documents rdfs:seeAlso
Internal creator Creators of the documents in Lynx (person or software) dct:creator
created Date when created in Lynx (internal) dct:created

The following is a list of some NIF-related properties and their values. 

Element Meaning Values / example
itsrdf:taClassRef Class of the annotated context dbo:Person, dbo:Location, dbo:Organization, dbo:TemporalExpression
itsrdf:taldentRef URL from external resource, such as DBPedia, Wikidata, Geonames, etc. http://dbpedia.org/resource/London
itsrdf:taConfidence Confidence [0..1]
nif:summary  Summary  text


The following table lists the prefixes used in this section.

Vocabulary Prefix URL
LKG Ontology lkg http://lkg.lynx-project.eu/def/
Dublin Core dct http://purl.org/dc/terms/
RDF rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
European Legislation Ontology eli http://data.europa.eu/eli/ontology#
W3C Provenance Ontology prov-o https://www.w3.org/TR/prov-o/
Friend of a Friend Ontology foaf http://xmlns.com/foaf/spec/
NLP Interchange Format nif http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#
ITS 2.0 / RDF Ontology itsrdf https://www.w3.org/2005/11/its/rdf#

The elements in the LKG element are better described in the draft ontology.