This guide is intended for maintainers and developers of InvenioRDM itself.
The guide provides a high-level overview of the core software architecture of InvenioRDM.
InvenioRDM has a layered architecture that consistent of three layers:
- Presentation layer
- Service layer
- Data layer
There is a strict data flow between the layers, and each layer has very specific responsibilities. It's highly important that you as a developer know the basic principles for the data flow and each layer's responsibilities. Failure to understand the basic data flow, leads to using the wrong objects for the wrong things, which eventually turns into messy unmaintainable code.
Data flow basics
The diagram below shows a simplified view of the data flow in the architecture.
The presentation layer parses incoming requests and routes them to service layer. This involves sending and receiving data in multiple different formats and translating these into an internal representation, as well as e.g. parsing arguments from an HTTP request (e.g parsing the query string parameters).
The service layer is completely independent from the presentation layer and can be used by many different presentation interfaces such as REST APIs, CLIs, Celery tasks. The service layer contains the overall control flow and is responsible for e.g. checking permissions and performing semantic data validation.
The data access layer is responsible for ensuring data integrity, harmonizing data access to different storages as well as fetching and storing the data in the underlying systems.
The data flow between the layers is strictly limited to some few well-defined objects to ensure a clean separation of concerns. The presentation layer communicates with the service layer via a e.g. a record projection (i.e. a view of a record localised to a specific identity). The service layer communicates with the data access layer via e.g. a record entity that provides data abstraction, syntactic data validation, and a strong programmatic API.
Tip: Where do you belong?
A key question you should always ask yourself when designing or writing code is where you code belongs in the architecture:
- Is it a presentation, service, or data access layer object?
- Is the object crossing boundaries between layers?
Answering where you code belongs helps identity and disentangle responsibilities.
Data access layer¶
The data access layer is responsible for:
- Fetching and storing data on primary (the database) and secondary storage (Elasticsearch/OpenSearch, cache, files, ...).
- Harmonizing data access to the same object on primary and secondary storages (e.g. a record in the database vs in the search index).
- Ensuring data integrity and managing relations among data objects.
The data access layer usually lives inside an Invenio module in a package named
records. It may consist of
- Record APIs (
- JSONSchemas (
- Elasticsearch mappings (
- SQLAlchemy models (
- System fields (
- Dumpers (
The data access layer serves two purposes:
- Provide a strong programmatic API that produce a clean, simple and reliable control flow in the service layer.
- Persist our business objects on data storage in an reliable and performant way.
Tip: Messy service layer?
If you service layer code looks messy, likely you need to work on your data access layer.
A typical example is the service layer doing data-wrangling with
dictionaries. For instance a conditional get on a dictionary key (e.g.
data.get('...')), or having to e.g. convert back and forth between
data types (e.g. UUIDs to/from strings).
The data layer is built around the following guiding principles:
One data representation: The service layer should work with one an only one data representation of an entity independent of if the entity was retrieved from primary or secondary storage.
One primary storage, many secondary storages: The primary version of a record exists in one and only one copy on the primary storage (the database), however multiple secondary copies may exist in the search index.
Idempotence of dumping/loading: Dumping and loading to/from secondary storage (such as the search index) must produce the same record.
Denormalization over normalization: If we have to choose, we usually prefer fast read speed over fast write speed.
Data versioning: We version data and rely heavily on optimistic concurrency control for detecting conflicts and determining stale secondary copies.
The record API is the primary programmatic API that the service layer uses to work with the data access layer. The record API ensures data integrity and manages the life-cycle of the record itself and related objects such as persistent identifiers and files.
The record is in charge of:
- define the structural schema that data is validated against (using JSONSchemas).
- define search index routing and indexing behaviour.
- managing the life-cycle of an associated persistent identifier.
- data versioning
- state management
A record is usually defined using a declarative API named system fields based on Python data descriptors.
The JSONSchemas defines the structure of a JSON document we store in the database. The main responsibility is structural validation of the JSON document. The best analogy is that it is a database table schema. Most importantly, it is NOT responsible for business-level validation of the JSON document.
A good example of this, is making a field a required property. It's correct to require a property if you would e.g. have defined a database table column as
NOT NULL. It's wrong to require a property, if it's requirement that the user must enter a value in a certain field (because this is business-level validation, and you may want to store partially valid documents).
Modules: - Invenio-Records: Defines the high-level APIs for the Record API, SQLAlchemy models, system fields and dumpers. - Invenio-JSONSchemas: Provides a registry for JSONSchemas available to the application.
The search mappings define how records are indexed and made searchable. Records are denormalised when indexed to provide high performance for searches over the records. The mapping MAY therefore define additional fields compared to the JSONSchema.
Dumpers are responsible for dumping and loading prior to storing/fetching records on secondary storage (e.g. the search index), and play a key role for harmonizing data access to records from primary and secondary storages.
Dumpers are specific to a secondary storage system (e.g. an search dumper, a file dumper, ...).
The dump and load of a dumper MUST be idempotent - i.e.
record == Record.load(record.dump()). This ensures that independently of if a record was retrieved from primary or secondary storage, it has the same data and works in the same manner.
For instance, the Extended Date Time Format dumper works in the following manner:
- The dump adds a start and end date range so that the EDTF can be queried by Elasticsearch.
- The load removes the two start and end date fields from the search document when loaded.
System fields are responsible for:
- providing managed access to a top-level property in a record
- manages relations with other objects
- hooking into the record life-cycle
System fields basically provides a declarative programmatic API that makes it easier to work with records and related objects. Under the hood, system fields are Python data descriptors.
A key design principle for system fields, is that an instance of a system field manages a single namespace of a record so that system fields do not conflict. For instance an access system field manages the top-level
access key in a record
System fields participate in the dumping/loading of records from secondary storage via being able to hook into the record life-cycle. The difference between system fields and dumpers, is that a dumpers produce a dump for a specific secondary storage system, while system fields produce the same dump for all secondary storage systems.
System fields may be used to manage relations to other objects, and can work similar to a foreign key.
Applications of system fields are vast, but some examples include:
$schemato the record to ensure JSON schema validation.
- Created, update and delete persistent identifiers for records and serialize them into the record.
- Ensure a certain property on the JSON document is operated as a set.
System fields to a large degree avoids building inheritance among record APIs and instead provides a declarative way of composing a record API class.
SQLAlchemy record models are responsible for storing the master version of a record (i.e. the primary storage) and provide database independence. All record models share some few common properties:
- A JSON column for storing the JSON-encoded document of a record.
- An internal UUID identifier.
- Creation and modification timestamps.
- Version counter for optimistic concurrency control.
UUIDs are used because they are storage efficient (128 bits) and random so that an application server can generate an id with low chance of collision.
It's important to understand that there's two distinct representations of a record: - Python dictionary - JSON document
These two distinct representations of a record may often be very similar, but it's important to understand that the JSON document is constrained to the JSON object model, while the Python dictionary can hold more rich data types as long as they are JSON-serializable (e.g. a datetime object).
The service layer contains the domain and business logic of the application and is responsible for:
- Authorization (i.e. checking permissions)
- Business-level validation
- Control flow
The service layer usually lives inside an Invenio module in a package named
service. It may consist of:
- Service components (
- Service config (
- Service schema (
- Service results (
- Domain errors (
- Background tasks.py
The main purpose of the service layer is to have an interface independent entry point into the application.
The service layer is built around the following guiding principles:
Mimick the end-user interface: There is usually a one to one correspondence between say a button in the user interface and a method in a service.
Clean control flow: The control flow of a service method should be reasonable easy to follow,
Interface independent: The service must be independent of the interface it's being called from. This means among other things that a service knows nothing about the HTTP request.
A service itself is the high-level entry point into the application. A service provides methods that usually maps directly to some sort of user interface action like pressing a button, performing a search and similar.
A service, similar often provides transactional boundaries within InvenioRDM.
The service config is a container for
Responsibility: - Inject dependencies via a single object.
Unit of work (UoW)
We use a design pattern called unit of work in order to ensure that we can group multiple state changing service methods into a single atomic operation. State changing service methods is essentially anything that commits a database transaction such as create/update/delete.
In a single service method we for instance must always ensure that we commit the database transaction before indexing and sending off Celery tasks. Otherwise we risk the transaction commit fails, and we have documents out of sync between our database and search index.
When we group multiple service method calls, we have to delay the database transaction commit until after all service methods have done their work, and thus we need to coordinate also the indexing and task operations from each service method that has to be done afterward the commit.
The unit of work context manager takes care of this job for us. It coordinates transaction operations between multiple service calls.
Do not use
db.session.commit() in a service method
You should use the unit of work instead of running an explicit
db.session.rollback() in the code. Example:
from invenio_records_resources.services.uow import \ RecordCommitOp, unit_of_work, @unit_of_work() def create(self, ..., uow=None): record = ... # No db.session.commit(), no self.indexer.index(...) uow.register(RecordCommitOp(record, indexer=self.indexer))
unit_of_work() decorator ensures that if a UoW is not provided,
it is automatically created and committed once the function returns.
Responsibility: - data validation - field-level permission checking - dumping and loading record projections
Responsibility: - Faceting, search query parsing, etc.
Responsibility for defining a declarative permission model.
Responsible for providing a specific feature in the service, and make the service customizable.
The main purpose of the presentation layer is to parse user requests and call the different services.
Explicit parse and validate all input parameters.
Serialize to/from a single internal representation.
Conflict detection through optimistic concurrency control.
Presentation must not contain business logic (e.g. permission checks).
Celery tasks are considered part of the presentation layer and thus normally simply call a service method. As a service may want to use background jobs to perform its task, we however often defined the celery tasks in the service.
Resources defines the REST API and are responsible for RESTful routing, parameter parsing, content negotiation etc.
Resources request context
The resource request context is a Flask context object on which only validated
input data is stored. Thus, accessing data on
requests_resourcectx instead of
request means that at least basic validation have been performed.
The resource config are used for dependency injection to
Performance is of very high importance for InvenioRDM. There's however often trade-offs to be made.
Query vs indexing speed
For InvenioRDM query speed is more important that fast indexing speeds. This means we'll sometimes denormalize data to have high enough query speed. Once we denormalize data we immediately must also deal with stale data and cache invalidation.
The version counter on on all records is instrumental in being able to manage the speed.
Database vs search engine
The database is the primary data storage for InvenioRDM, however the database is not performing well for searching large number of records. Thus in general we only perform primary key lookups in the database, and try to move all other queries to the search index.
As the search index can be slightly outdated from the primary copy, we focus on updating the index immediately in the cases where it's important for the user experience (e.g. a user deletes a draft and is immediately returned to a list of their drafts - this list should not include the just deleted draft).