Infrastructure linkages – Overview

Infrastructure linkages are linkages of datasets that are regularly updated and are generally available for a range of approved projects (as opposed to linkages of datasets for a single approved project only). The most accessed infrastructure linkages are the DLB’s ‘core’ linkages, such as the Hospital Morbidity Data Collection, Emergency Department Data Collection and Death Registrations, however the DLB also maintains a wide range of other non-core infrastructure linkages. For more information on the infrastructure datasets available please see the Dataset Menu.

Infrastructure linkages are established according to ten principles read more

  • the linkage is technically feasible;
  • data for linkage updates is provided regularly (e.g. annually);
  • data can be accessed by multiple projects and/or groups according to the relevant application and approval process;
  • there is ongoing value for the linkage to be performed as infrastructure work rather than project-specific work;
  • the data flow, privacy considerations and associated data governance arrangements are fully documented in a Data Agreement between the Data Provider and DLB
  • the new linkage is approved by the Department of Health WA Human Research Ethics Committee;
  • the Data Provider is ‘linkage ready’ insofar as having the staff, skills and resources needed to supply the data for linkage, answer data queries, and fulfil project-specific data extraction requirements once all of the necessary approvals are in place;
  • there is sufficient funding available for initial development and ongoing operational costs in line with DLB’s charging model;
  • the Data Provider and/or a Research Group is responsible for meeting the aforementioned costs;
  • the DLB schedules work (including linkages) to optimise the balance between capacity and demand.

All linkage activities involve logistical risks, and this increases with complexity. Generally, infrastructure linkages are associated with the greatest complexity (and risk) due to a variety of inter-related factors, including legal, ethical and legislative considerations, the stakeholders involved, the underlying data quality, and the effort, duration, skills and budget required. Often, it is difficult to estimate complexity with absolute certainty prior to commencement of an infrastructure linkage. For this reason, it can take the DLB several months to link data before it is available for approved requests. Sometimes, this duration can take longer due to unforeseen circumstances (e.g. changes in staff, data collection systems, data set specifications, available resources, etc.).
If you have any queries about infrastructure linkages please contact Tom Eitelhuber at tom.eitelhuber@health.wa.gov.au

 

Standard linkage processes

Data can be transferred to the DLB via secure online file transfer systems such as MyFT. New datasets provided to the DLB for linkage go through multiple stages before they can be linked, including:

  • liaison with the Data Provider;
  • evaluation;
  • process preparation;
  • importing;
  • cleaning;
  • assignment of linkage specific record IDs; and
  • linkage strategy development.

Theoretically, data updates (‘refreshes’) are more straightforward to complete, because processes and linkage strategies can be reused. Data Providers should be aware that changes to the data, or poor consolidation against what has been received before, can lead to problems (especially impacting time and cost).

 

General advice for making data as ‘linkage ready’ as possible

click for more

Data linkage uses information about ‘who’ a person is more than data about ‘what has happened’ to him/her.
The most useful fields for linkage are:

  • Given name;
  • Surname;
  • Date of birth;
  • Street address;
  • Suburb; and
  • Postcode.

Multi-component fields such as name and address are best split into their subcomponents and provided as separate fields:

  • First name;
  • Middle name;
  • Surname;
  • Street address;
  • Suburb;
  • Postcode.

Information that ‘deterministically connects’ (matches exactly) the new data to another dataset is very valuable. Examples include Unit Medical Record Number (UMRN) and Elector Number.

The data formats the DLB accepts are character-delimited (e.g. by commas or tabs), fixed width text files and Excel spreadsheets. If a Data Provider cannot provide data in this format, then they should contact the DLB to discuss.

Every record must have a unique ID that maps back to the original system/data collection. If the data does not have a unique record ID, Data Providers should consider creating one, or providing the DLB with a list of fields which, when taken together, are unique (e.g. person ID and event date).

If the new data is ‘person based’ rather than ‘event based’ (i.e. everyone has one record with all of their information contained within), this could be problematic, particularly if the Data Provider overwrites old values with new ones (e.g. updating addresses). This should be discussed with DLB.

Non-human records need to be removed before linkage, noting some datasets include animals, vehicles, etc. DLB requests that Data Providers endeavour to remove these records before supplying it to the DLB.

Text equivalents of NULL, such as ‘no fixed permanent address’ and ‘N/A’ can be problematic, as the DLB wants to avoid mistaking them for a match to one another. Data will be easier to link if these values are left empty instead.

Metadata is very useful when determining how to use fields for linkage, particularly when they are understood by only a select group of people with specialised knowledge. Data Providers must ensure they provide the DLB with data dictionaries and code lists, if they exist.

Data should be checked closely for errors after exporting it from the original system.

Some of the most common errors that the DLB encounters are:

  • single records wrapping onto two lines;
  • inconsistent number of delimiters or fields;
  • values being stored in the wrong field (e.g., dates of birth in the given name field);
  • inclusion of the separating character in the field value (e.g. addresses with commas in them, within a comma-delimited file);
  • the same record ID being assigned to more than one record; and
  • inadvertently reformatting fields (e.g. expressing a number in truncated scientific notation in Excel).

A Data Provider should cross-check the field list in the relevant Data Agreement with the DLB to ensure they have supplied all approved fields.

If ongoing data updates will be provided to the DLB, these will be easiest to process if to the Data Provider can provide the DLB with records that are new or have been modified since their previous provision. If this is impossible, they can theoretically provide the DLB with a full refresh of all records, however, this could be problematic if the data has undergone any changes since the last update.

Data changes such as the addition of new fields, moving the data to a new storage system, or reallocation of record IDs, will impact processing requirements for updates. It is preferable that DLB is provided with as much information as possible to bridge the gap between the old version of the data and the new one (e.g. new-to-old ID mapping files, change of field name information, etc.). It is helpful if a data provider can format the data to look as similar as possible to the previous version (e.g. same format, same field order, same naming conventions, with newly added fields appended to the end of each record).

Data Providers should ensure old records do not disappear from their system. It is also vital that Data Providers thoroughly document any related data processes to prevent knowledge loss at the source. This is especially important during periods of staff changeover or system redevelopment/replacement.