The SEEK platform is open-source and built using Ruby on rails. The source code can be downloaded from a BitBucket code repository (https://bitbucket.org/seek4science/seek/wiki/Home), but it is also available as a virtual machine image, in order to allow easy deployment of the whole system (http://www.sysmo-db.org/seek-vm). To date, 30 different instances of the SEEK virtual machine have been downloaded and instantiated for data management support in systems biology projects. These instances support different groups of scientists, ranging from large consortia, like SysMO, to SEEKs that support particular institutes, or individual laboratories.
Following an Agile methodology, new versions and updates of SEEK are released frequently and incrementally. On average, minor upgrades containing bug fixes are released monthly, and major releases are released two or three time a year. Virtual machine packages are also periodically released for key versions that contain significant new features, after a longer period of stabilisation.
The SEEK is a collection of components designed to work together, but it is offered as a set of configurable plug-ins to suit individual user requirements (see section
Modularisation
for further details). These components can be combined, customised and included as required for any particular instance. Figure 1 illustrates the different SEEK components and their relationships to each other. The core components include the Assets Catalogue (for storing or linking to data, models and protocols), the Yellow Pages directory of consortium members and their expertise, the access control and versioning framework, the Investigations, Studies and Assays (ISA) infrastructure, the harvesting and indexing framework, and the APIs, interfaces and links to external systems biology resources.
The SEEK can be used as either a metadata catalogue or a centralised repository. Data, models and other research assets can be uploaded centrally, or they can be stored remotely and referenced in SEEK with specified minimal metadata. Remote storage includes local content management systems, or other public databases, such as GEO [12] or BioModels [13]. In practice, most SEEK instances operate exclusively as a repository, although the SysMO-SEEK instance (described in Section SysMO-SEEK) originally contained both deposited data and catalogue links. As projects began to reach completion and the emphasis changed from daily collaboration to long-term stewardship and dissemination, all users switched to submitting directly to the central repository. The only external links were to data that had already been deposited in public repositories. The SysMO-SEEK instance provides a 10-year storage guarantee, which enables consortium members to conform to funding agency requirements for data management and provides a persistent URL for published work. Figure 2 shows a screenshot of a model that has been shared and published on SysMO-SEEK.
Since the initial description of the SEEK platform [9], its functionality and utility have been extended. In addition to data management, the SEEK is a platform for exploring and analysing data and models, and an environment for researchers to collaborate with one another. Major additions to the platform include:
The following sections describe the new developments in SEEK and the differences in use and configuration between two of the largest adopting consortia.
Publishing framework
All items uploaded to SEEK have a persistent URL and are linked to the individual scientist who uploaded them. This allows direct references to the items and their creators. These links help to promote credit through data citation as well as rewarding contributions by individuals in a consortium. It is also possible to generate and register a DOI for any SEEK asset that is public and visible (for example https://dx.doi.org/10.15490/seek.1.datafile.1152.2). SEEK instance administrators can add this feature by registering for a DataCite username and password.
In SEEK, individuals are ultimately responsible for registering and sharing their own assets, allowing them to control who has access and when. SEEK assets can be shared publically, shared with the whole consortium, or shared with named individuals and groups. By providing a fine-grained sharing model, the SEEK ensures that scientists remain in control. To further encourage the dissemination of assets after publication, one-click public release is available for related assets. For example, if a scientist makes a model publicly accessible, a report of all data that was used for its construction, simulation, or validation will be displayed. She can select all or a subset of these data to publish along with the model.
For large consortia, whilst it is desirable to incentivise individuals, it is also prudent to release assets only when it benefits the consortium. If further safeguards and administration are required, the roles of Project Manager, Assets Manager and Gatekeeper can be configured. A Project Manager is responsible for assigning project membership and administering project-level information, the Asset Manager can assume responsibility for managing assets if people leave the consortium, and the Gatekeeper can act as a final checkpoint before assets are shared publically.
Semantic web framework
The majority of data in SEEK are uploaded as Excel spreadsheets and the majority of models are uploaded as SBML (Systems biology Mark-Up Language) files [14]. Uploaded content is indexed using Lucene (http://lucene.apache.org/) and metadata is extracted and stored in RDF (Resource Description Framework, http://www.w3.org/RDF/), which is the W3C standard for data interchange on the web.
Extracting and storing data in RDF allows more complex queries to be formulated either across an individual SEEK instance, or potentially across multiple federated resources via the Linked Data cloud (http://lod-cloud.net/).
Individual SEEK instances can be configured with or without RDF support. For those that extract and store data in RDF, a SPARQL endpoint (http://www.w3.org/TR/sparql11-overview/)for querying the RDF can be exposed. An example of a SEEK SPARQL endpoint populated from SysMO-SEEK metadata, and a collection of example queries, can be found at https://wiki.sysmo-db.org/seek/sparql-examples. Documentation for setting up the RDF triplestore and configuring the RDF framework is available at:
http://seek4science.org/sites/default/files/seekdocs-0.22.0/doc/SETTING-UP-VIRTUOSO.html
Currently, this is an advanced user feature. We do not anticipate that many SEEK users will search SEEK contents directly from the SPARQL endpoint. Instead, the SPARQL endpoint will be used in visualisation and analysis applications, in order to make complex querying more accessible. A detailed description of the semantic web framework and the types of complex queries requested and designed by SEEK users is available in Wolstencroft et al. 2013 [15]. This paper also compares and evaluates Lucene/solr querying against RDF/SPARQL.
A major advantage of serving SEEK metadata as RDF is that it enables greater interoperation with other related resources, such as ArrayExpress [2], ChEBI [16], or the collective content of the EBI RDF platform [17], or Bio2RDF [18]. A major finding in Wolstencroft et al. 2013 was that Lucene/solr and RDF/SPARQL queries performed equally well on many user questions, with the exception of those questions that involved the inclusion of external data or ontology sources. The RDF/SPARQL interface can therefore extend the capabilities of SEEK. The recent production of an RDF representation of ISA-TAB also provides further possibilities for interoperability between SEEK and other ISA-structured resources [19].
The JERM (Just Enough Results Model) is the underlying data model in SEEK [15]. Like MIAME or MIRIAM, it is designed to describe the minimum information, which means it describes the basic set of metadata elements required in order to find and interpret SEEK data. The JERM ontology (available from the BioPortal [20], http://bioportal.bioontology.org/ontologies/1488) formalises these relationships. RDF extracted from SEEK conforms to this ontology model, allowing complex queries and inferences over the data.
The JERM describes the relationships between the SEEK assets and the content of those assets. For example, for each dataset uploaded to SEEK, the JERM describes what type of experiment it was, what was measured, and what the values in the dataset mean. The JERM captures the core elements of metadata shared by existing minimum information guidelines, allowing users to comply with these standards as well as capturing the information required for linking in SEEK. Where different types of experiments require the same metadata elements, datasets can be aggregated. There is no requirement to homogenise content that is unique to any one experiment type.
This flexibility is a major advantage of the RDF approach and contrasts with relational database approaches that would require changes to the underlying data model in order to accommodate new experimental data types. SEEK users are therefore able to add new data types as their experimental approaches expand. To assist users in producing data in JERM-compliant formats, spreadsheet templates are provided that encapsulate the JERM and other minimum metadata standards. These templates have been augmented with ontology term selection functions, using the RightField [10] semantic annotation tool (also developed in this project). RightField enables lists of ontology terms (from the web locations, like BioPortal, or from local files) to be embedded into specific spreadsheet cells. As scientists annotate their data, they can select appropriate ontology terms from simple drop-down lists, without requiring any knowledge of the ontology or its structure.
SEEK users can select from a range of JERM-compliant spreadsheet templates from the help section in the SEEK platform (e.g. the SysMO template collection is available from https://seek.sysmo-db.org/help/templates). The main advantage of this approach is that the majority of data is already collected via spreadsheets. To comply with the standards recommended by SEEK, users do not need to make large modifications to their current working practices and they do not need to use new applications.
For models, the annotation tool that is integrated in the JWS simulator follows the MIRIAM guidelines in annotating SBML models, using the Semantic SBML Web Services [21]. The MIRIAM specification requires species and reactions in models to be annotated with official identifiers and recommends the use of terms from community ontologies to describe model elements. MIRIAM annotation ultimately improves interlinking with experimental datasets because components in models are annotated with the same biological identifiers as the datasets.
Data and model exploration
Data and models in systems biology investigations are inherently interlinked. The SEEK provides tools to assist users in understanding and exploring those links and visualising data and models.
The ISA structure (Investigations, Studies and Assays) is central to the organisation and visualisation of all assets in the SEEK. An ISA tree-view describes which Assays belong to which Studies and which Studies belong to which Investigations. Assays in the ISA-TAB specification refer only to experimental assays. In SEEK, however we extend this description to encapsulate modelling analyses and bioinformatics analyses. Conceptually, they are the same, but they have different sets of metadata descriptions. For example, a modelling assay should be defined by the biological problem being addressed by the model and the modelling framework being used. This extension allows researchers to associate modelling and experimental activities with one another, giving a complete overview of all work associated with a particular investigation. An example can be seen in Fig. 3.
There are a number of different options for visualising and analysing the content of the data and models in SEEK. For example, the Explore Data application allows users to view the contents of spreadsheets without downloading them. This view can also be used to add further annotation to the data, or to select data values from the spreadsheets and plot them. Plots can be saved as annotations to the data sheet, or they can be exported and saved separately.
For visualising and exploring models, the JWS Online simulation environment is embedded in SEEK. Through JWS Online, users can view their models in SBGN (Systems biology Graphical Notation) [22], simulate them with the data and parameters provided, or simulate them with alternative values, which could be from other SEEK files, or from elsewhere. Models that are uploaded with a Cytoscape web compatible file (XGMML), can also be visualised using a Cytoscape plugin [23].
The current MIRIAM standard for model annotation enables the identification of model species and parameters, but recording the source of parameter values is not a minimum requirement and may therefore be omitted in many cases. Experimental data containing these values may be included in a table in a publication, but is not readily accessible from model repositories and not amenable to computational processing. The SEEK provides a common interface to display and preserve links between data and models, allowing modellers to record which data was used for model construction and which for validation. The data itself is linked to all the contextual information required for others to interpret the results and determine the validity of using those values in the model. For example, data can be linked to the standard operating procedures and protocols that were followed during its creation. Figure 3 shows an example of intra-experimental connections. Data that is associated with a model can either be annotated as construction, validation, or simulation.
By sharing and linking data and models, and allowing model simulations, the SEEK promotes the reuse of existing resources. Users can simulate models with the original data, or run new simulations with other data in SEEK. Users can also search and access external modelling repositories, such as BioModels, in order to further promote reuse.
On-going work with the SED-ML model simulation format [24] enables SEEK users to record these in silico simulation results alongside experimental values, or directly compare simulation data with experimental results. The JWS simulator in SEEK allows the export of any model simulation as a SED-ML archive. Through a collaboration with the University of Rostok, BiVeS (the BioModel Version Control System, https://sems.uni-rostock.de/bives/) has recently been integrated with SEEK. This supports the comparison of SBML models, to detect differences at the XML level, and provide a summary of the differences along with a graphical representation. To use this feature, users must have an account in SEEK. For a demonstration, guest users can log into a demo version of SEEK (https://demo.sysmo-db.org/models/33).
Modularisation
SEEK is developed using a modular approach, making it easy to add and remove given features and behaviours. There are configuration points for turning certain features on and off, supporting customisation when setting up a new installation of SEEK for certain purposes. These are defined in the document:
https://github.com/seek4science/seek/blob/master/lib/seek/config_setting_attributes.yml
An example of the configuration options used in the SysMO SEEK can be found here:
https://github.com/seek4science/seek/blob/master/config/initializers/seek_configuration.rb-openseek
Developers wanting to adapt SEEK can leverage the modular nature of SEEK to more easily integrate new features or modify existing features. We leverage the Rails built in Plugin and Gem system. All the plugins and gems we use are listed in our gemfile: https://github.com/seek4science/seek/blob/master/Gemfile (not all created by and for SEEK).
SEEK is also being updated to use JQuery and the Bootstrap framework to make it easier to theme and customise the user interface following modern conventions.
Programmatic access
SEEK provides a RestFul API, which is currently read-only. Any resource in SEEK can also be represented as XML, by requesting it in this format instead of HTML through content-negotiation. This is also possible by putting a .xml at the end of the URL, for example https://demo.sysmo-db.org/investigations/2.xml. This XML is backed and validated against an XSD schema, available at https://github.com/seek4science/seek/blob/master/public/2010/xml/rest/schema-v1.xsd.
As well as XML, SEEK also provides an RDF representation, for example https://demo.sysmo-db.org/investigations/2.rdf.
Future plans include updating the Restful API to support JSON, and also to add write access for some key actions such as adding a data file, or defining an assay.