Skip to content

Usage statistics

Introduced in InvenioRDM v12

The record usage statistics in InvenioRDM are implemented with the Invenio-Stats module and they are designed to be compatible with the COUNTER Code of Practice Release 4.

Persisted in the search indices

All information related to statistics (i.e. the raw events and aggregations) are stored exclusively in search indices and not in the database, which makes search engine backups much more relevant. Some recommendations are given in the how-to section.

Inner workings

The following sections aim to give you some insights into how usage statistics are collected under the hood.

Raw events

Usage events are generated in the resources (for the REST API) and view functions (for the web interface). The basic events are filtered and enriched via event builders that usually capture information only present in the request context at the time of the event (e.g. IP address, user agent, etc.). After their build process is finalized, they are sent off to a message queue.

Deduplication of events

Events that "look" the same (except for the timestamp) and are less than a second apart from each other will be deduplicated and counted as only one single event. So when somebody hits refresh on a landing page very quickly, not every page load will be counted.

A periodic background task will pick up pending events from the message queues, process them further with the configured preprocessors (e.g. user anonymization) and index them into the events indices. This task can be called on demand via the CLI command invenio stats events process.

An indexed record-view statistics event looks like the following:

{
  "_index": "my-site-events-stats-record-view-2023-04-04",
  "_id": "2023-04-04T09:26:30-951a582a144b51479477fc89a1ca96ab8891a10d",
  "_score": 1,
  "_source": {
    "timestamp": "2023-04-04T08:26:30",
    "recid": "fq14q-7ja92",
    "parent_recid": "n5qej-kaz30",
    "referrer": "https://127.0.0.1:5000/",
    "via_api": false,
    "is_robot": false,
    "country": null,
    "visitor_id": "10ac3a4737efabc81e12e8fbaeb2aab0d25f23c7d5731f6387461528",
    "unique_session_id": "f75b509f9aad420811b965d867940e1675296a7b5e95eb72fe0733be",
    "unique_id": "ui_fq14q-7ja92"
  }
}

The mappings for newly created event indices are automatically registered as defined in the configured index templates.

Event aggregations

While using all the raw usage events to calculate the statistics is possible, it can be very expensive – especially when this is a frequent operation. So to save some calculations, the raw events are periodically consolidated into intermediate aggregations that can be used for querying statistics rather than the raw events.

A periodic background task will check if there are any new events since the last run and if there are, it will aggregate them into intermediate results ready for querying. A bookmark mechanism is used to keep track of the periods for which events have already been aggregated and which may contain new events to aggregate. This task can be called on demand via the CLI command invenio stats aggregations process.

An indexed events aggregation over record-view events looks like the following:

{
  "_index": "my-site-stats-record-view-2023-04",
  "_id": "ui_fq14q-7ja92-2023-04-04",
  "_score": 1,
  "_source": {
    "timestamp": "2023-04-04T00:00:00",
    "unique_id": "ui_fq14q-7ja92",
    "count": 2,
    "unique_count": 1,
    "recid": "fq14q-7ja92",
    "parent_recid": "n5qej-kaz30",
    "via_api": false
  }
}

Querying statistics

Invenio-Stats provides query classes that can be used to calculate the finalized statistics, e.g. by fetching the relevant intermediate aggregations from the search indices and summing them up.

A query result for the record-view statistics of a record in Python looks like the following:

{
    "start_date": None,
    "end_date": None,
    "recid": "fq14q-7ja92",
    "parent_recid": "n5qej-kaz30",
    "views": 13.0,
    "unique_views": 6.0
}

Final record usage statistics

The final usage statistics for a record include the record views and file downloads for both the selected record version as well as across all of its versions. InvenioRDM turns them into the following shape:

{
    "this_version": {
        "views": 10,
        "unique_views": 6,
        "downloads": 7,
        "unique_downloads": 7,
        "data_volume": 123.456,
    },
    "all_versions": {
        "views": 30,
        "unique_views": 16,
        "downloads": 23,
        "unique_downloads": 21,
        "data_volume": 345.678,
    }
}

Putting the stats into the records

Every record has usage statistics available via a transient stats property that's lazy-loaded only when it is accessed.

For consistency between the search results and the landing pages (and a bit of caching), the primary source for the collected record statistics is the records search index. As a fallback, the statistics are fetched directly via several queries from the aggregations' search indices.

Outdated statistics?

A special search dumper extension for records will take care of updating the statistics before indexing the record in the search engine. The upshot here is that when the statistics seem to be outdated, you should try to reindex the record.

REST API endpoint

Invenio-Stats provides a REST API endpoint for querying the statistics. The required permissions to access this endpoint are determined by the query_stats entry in the permissions policy.

Disabled per default

Per default, access to this specific API endpoint is disabled to prevent attackers from overloading the system with too many or heavy queries. When enabling access to the system, it should be limited to certain groups of users (e.g. authenticated users, administrators, etc.) and/or rate-limiting should be put in place.

However, the system will still include a record's usage statistics in the web interface as well as the API endpoints for records.

Additional information

Unique views and downloads

If it seems like the view and download counts on the landing page are a bit low, that's probably because the landing page shows the unique views/downloads per default. They deduplicate events for each record that are coming from the same source. This is simply the more honest metric, even if it can be a little bit disappointing.

Only records have stats

Out of the box, InvenioRDM only collects statistics for records but not drafts. As a consequence, only the record search supports the display of and sorting by views and downloads.

Only UI visits are counted as "view" events

Currently, InvenioRDM will generate but immediately discard record-view events generated via REST API accesses. Thus, only landing page visits will count as record views.