Research data management
Research data management relates to the organisation, documentation, storage and preservation of data resulting from the research process.
LJMU actively promotes responsible research data management practices and encourages researchers to make their valuable data openly available for reuse via OpenAIRE in future research projects worldwide.
The LJMU Research and Knowledge Data Management Policy (PDF, 221KB) recognises that research data are an essential component of any research project involving LJMU staff and postgraduate students, regardless of whether the project is supported by external or internal funds. We want to help you establish best practice in research data management throughout the data lifecycle.
We are committed to providing access to services and facilities for the storage and backup, deposit and retention of research data and records that allow researchers to meet their requirements and the funders of their research.
We also provide researchers with access to training support and advice in research data management planning and records management.
Open research data case studies
We are excited to share the first four case studies from researchers across the university who are embracing open data in their work. Whether using publicly available datasets or generating and sharing their own, these stories highlight the diverse ways open data is driving innovation, transparency, and collaboration.
Faq Items
360 Degree Gait capture: A diverse and multi-modal gait dataset of indoor and outdoor walks acquired using multiple video cameras and sensors study
Researchers
Luke Topham ORCID: 0000-0002-6689-7944
Luke Topham is a Senior Lecturer in Software Engineering at the School of Computer Science and Mathematics at Liverpool John Moores University. He is actively involved in cutting-edge research and scholarly activities and has diverse research interests in Artificial Intelligence (AI).
Dr Wasiq Khan ORCID: 0000-0002-7511-3873
Dr Wasiq Khan is an expert in Artificial Intelligence and Data Sciences within the School of Computer Science at Liverpool John Moores University.
Project overview
In response to the lack of diversity and real-world variables in existing public datasets, the project consisted of collecting a novel dataset of gait data from 65 diverse participants, including eight viewing angles, indoor and outdoor real-world environments, changes in participant appearance (clothes), and digital goniometer sensor data. The dataset comprises 3,120 videos, each containing approximately 748,800 image frames with detailed annotations. These annotations include approximately 56,160,000 bodily keypoint annotations, identifying 75 keypoints per video frame, as well as approximately 1,026,480 motion data points captured from a digital goniometer for three limb segments (thigh, upper arm, and head).
Creating new data
By creating this dataset, Luke enabled further investigations into real-world factors that currently limit the applications of gait-based person identification, such as occlusion, appearance changes, and viewing angles. As a result of this work, he successfully completed his PhD and published several high-quality Q1 journal papers.
It was clear that, despite several high-quality datasets being available, no public dataset provided the diversity and variables needed to investigate problems such as occlusion and appearance change. Therefore, in addition to achieving our own research aims, we wanted to stimulate the related research community, encourage innovation, and increase collaboration. Therefore, we chose a CC BY 4.0 licence to ensure that our dataset could be freely shared and adapted for research needs.
We made our dataset openly available in the LJMU Data Repository as it helped us meet the open data requirements for our publications. A related publication can be read via Nature.com.
Challenges
The greatest challenge in generating the dataset was finding and recruiting participants. Luke utilised internal calls for participants, as well as professional and personal networks. Recruitment, recording, and processing also took a significant amount of time to complete. However, the results were worth the effort, and we are grateful to everyone who participated.
Outcomes and successes
Luke expresses that there are several benefits to making the data open. For example, it ensures scientific rigour as other researchers are able to replicate our studies, as well as those of others. It helps to create a benchmark study for various applications. Moreover, the repository eliminated many manual steps, such as processing data-sharing requests, which made sharing the data convenient and imposed no additional workload. Also, seeing the usage statistics is highly motivating, as it shows our work is having a positive impact and reaching others.
Lessons learned and advice
Luke encourages others to openly share their datasets as it promotes transparency, reproducibility, and collaboration across disciplines. As it not only allows others to validate and build upon our work but also increases the visibility and impact of our research. By making data accessible, we contribute to a more efficient and trustworthy scientific ecosystem.
Additionally, preparing, processing, and documenting datasets can be time-consuming; therefore, it's essential to plan ahead and allocate sufficient time before publication or conference deadlines!
Leveraging Open Data in Alzheimer’s Disease research
Researcher
Dr Davide Bruno ORCID: 0000-0003-1943-9905
Originally from Italy, Dr Bruno earned a degree in Psychology from the University of Parma and went on to complete a PhD at Keele University in 2007. His academic journey has included roles at the University of Southampton, the University of Massachusetts – Amherst, the Nathan Kline Institute, New York University, and Liverpool Hope University. He later joined LJMU where he is now a Reader in the School of Psychology.
Project overview
Dr Davide Bruno's research focuses on Alzheimer’s disease using openly available research data. As a solo researcher not embedded in a large research group or hospital centre, access to open research datasets has provided a significant opportunity to conduct high-impact research at a lower cost. This strategy significantly reduced barriers to entry and enhanced research efficiency.
Use of Open Data
Dr Bruno accessed several open databases to download research data related to Alzheimer’s disease. These sources were available upon request and were essential to conducting the research. By submitting a formal project proposal, Dr Bruno was able to obtain access to these valuable datasets.
Challenges
One of the primary challenges encountered during the research was the occasional lack of information in data dictionaries.
Outcomes and successes
Highlights include:
- The publication of 15+ papers utilising open databases
- Securing $1.4 million research grant funding
Lessons learned and advice
Dr Bruno emphasises the importance of looking for available data before initiating data collection efforts, as many valuable datasets already exist and use the associated data dictionary or metadata to maintain consistency and data accuracy.
Future considerations
While open data has provided significant opportunities, Dr Bruno notes the importance of improved data documentation and standardisation across repositories.
Using Open Data to Explore Careless Responding and Harmful Alcohol Use in Online Surveys
Researcher
Andrew Jones ORCID: 0000-0001-5951-889X
Andrew Jones is a Reader in the LJMU School of Psychology, specialising in psychology and statistics. With over a decade of research experience in substance use and obesity, he has authored more than 60 peer-reviewed journal articles and contributed to four book chapters. He has two early career awards from research societies, LJMU’s Open Research Award and has successfully secured combined grant income of ~£3million.
Project overview
This project set out to investigate whether individuals who respond carelessly in online surveys are more likely to report drinking alcohol at harmful levels. Rather than collecting new data, existing open datasets were used as additional data collection would be redundant and potentially wasteful when suitable data already existed.
Use of Open Data
Open data was sourced by emailing authors of relevant studies identified in the literature and scoured public databases to identify datasets that contained both alcohol use measures and indicators of inattentive or careless responding. Data were accessed directly from authors, often accompanied by a codebook to help interpret the variables and understand the context of the data. By combining data from different studies, Andrew was able to conduct a large-scale secondary analysis to examine patterns of alcohol use and data quality.
Challenges
Working with secondary data presented several challenges. Firstly, there were inconsistencies in variable labelling—for example, gender might be coded as 1 = male in one dataset and 1 = female in another, requiring careful interpretation. Secondly, measures of education and other demographic variables varied widely across studies, making it necessary to develop a standardised approach for cross-study comparisons. Finally, cross-country differences introduced additional complexity, as cultural and methodological variations across international datasets meant that some variables were not directly comparable.
Outcomes and successes
Highlights include:
- Published a well-cited paper, contributing to academic impact which you can read via PubMed.
- A catalyst for collaboration, leading to new research partnerships.
Lessons learned and advice
Andrew learned that data can be collected for one specific purpose but can be reused responsibility for other purposes. Lessons learned are that it’s very important to discus with the data owners what is or isn’t possible and that the use of codebooks to decode the structure and meaning of data entries played a great deal in being able to use and understand the data.
Future considerations
Don’t underestimate open data – it's a powerful resource that can reduce duplication and open up new avenues for discovery. Communicate with data providers early to clarify permissions, limitations, and variable meanings. Invest time in standardising data, especially when combining multiple datasets.
Using Open ECG Data to Enable AI-Driven Cardiology in the EU TARGET Project
Researchers
Professor Sandra Ortega-Martorell ORCID: 0000-0001-9927-3209
Professor of Data Science, in the LJMU School of Computer Science and Mathematics. Research interest and expertise is in Artificial Intelligence (AI) and Machine Learning (ML), especially AI/ML translation to healthcare.
Professor Ivan Olier ORCID: 0000-0002-5679-7501
Professor of Artificial Intelligence and Data Science in the LJMU School of Computer Science and Mathematics. Research interests lie in algorithms for Artificial Intelligence, with a specific focus on Causal AI, Digital Twins, and the modelling of large-scale, highly structured and/or relational data (including multivariate time series, graphs, and networks).
Zainab Mahmood is a postgraduate research student at Liverpool John Moores University, working on AI augmentation of whole-body MRI scans.
Project focus
Atrial Fibrillation Diagnosis, Treatment, and Rehabilitation through Digital Twins
Project overview
As part of the innovative EU TARGET Project, the research team are developing AI-powered digital twins and decision support tools to revolutionise how atrial fibrillation (AF) and AF-related strokes are diagnosed, treated, and rehabilitated. A key challenge the team faced was the scarcity of AF electrocardiogram (ECG) data, with only 9% of cases in available datasets indicating AF - a significant limitation for training machine learning models.
Use of Open Data
The team are leading in developing AI-powered digital twins and decision support tools to transform the way atrial fibrillation (AF) and strokes caused by AF are diagnosed, treated, and rehabilitated as part of the innovative EU TARGET project. Atrial fibrillation electrocardiogram (ECG) data is scarce; only approximately 9% of recordings in available datasets such as PhysioNet indicate AF, which was a major obstacle we had to overcome. To get around this, they used the PTB-XL ECG dataset to train generative models using available data to generate synthetic data. Because of its huge scope, openness, and potential to promote innovation free of proprietary hurdles, we opted for open data.
With 21,837 clinical 12-lead ECGs collected from 18,885 patients, the PTB-XL dataset holds the title for largest publicly available ECG dataset. It provided formats that were suitable for machine learning and was sourced from PhysioNet. We used the Python code that was given by PhysioNet to input raw waveform data and metadata into our local server after accessing it through their website.
Challenges
The PTB-XL dataset is remarkably easy to use, but the low prevalence of AF—similar to the 9% in other datasets—posed a challenge for training robust models, unlike the abundant ECGs for other classes like normal and MI, which have excellent representation. We also encountered minor data consistency issues, such as potential duplicates and varying annotation quality, as some records had single-cardiologist validation while others had dual validation. These were easily managed with rigorous data cleaning pipelines, cross-referencing metadata and SCP codes to eliminate duplicates and ensure accurate AF labelling. The dataset’s open-source nature, free of licensing restrictions, allowed us to focus on technical solutions without hassle. PhysioNet’s clear documentation and reference papers made data extraction and preprocessing a smooth process, delivering high-quality inputs for our models with minimal effort.
Crucially, the absence of licensing restrictions simplified the workflow, while PhysioNet’s detailed documentation eased onboarding and preprocessing.
Outcomes and successes
The team have been able to use this dataset to make good progress with AF, even though there isn't quite as much of data, and with normal and MI cases, where the data prevalence is strong. The team is excited to work with the synthetic data sets to train the digital twin models; noting that they want to create realistic heart simulations for individualised treatment plans. The potential influence in AI-driven cardiology could be magnified by the dataset's user-friendly design and open-access nature, which ensure reproducibility and stimulate collaboration. A simplified synthetic data pipeline and high-quality data preparation are measurable accomplishments that have been achieved thus far, paving the way for future clinical applications.
Although the final AI models are still in development, the project has already delivered:
- A simplified synthetic data pipeline
- Cleaned and enriched datasets for AF, MI, and normal heart conditions
- Readiness for future clinical simulations and personalised treatment plans
Lessons learned and advice
The project team recommends:
- Embracing open data for its flexibility and ease of access
- Prioritising data cleaning to enhance accuracy and consistency
- Leveraging provided tools (such as Python scripts, metadata, SCP codes) to deepen understanding
Even for newcomers, the user-friendly format and documentation of PTB-XL made integration straightforward. The team plans to expand synthetic data generation using advanced generative models and additional open datasets to further address the AF data gap.
Future considerations
The TARGET Project continues to demonstrate the transformative potential of open data in AI-driven cardiology. By integrating synthetic ECGs into digital twin models, the team is paving the way for more personalised, scalable, and equitable healthcare. Their ongoing work serves as a blueprint for future projects seeking to maximise impact through open science and collaboration.
The team are more dedicated than ever to open data because of the PTB-XL dataset and how easy it is to use. Thanks to its code, availability extensive documentation, and reference materials, it was easy for even a newcomer like me to start using it. The abundance of non-AF ECGs allowed them to tackle difficult problems, with hopes that they will overcome the low AF prevalence using modern synthetic data approaches.
All skill levels will find it suitable due to its user-friendly forms and documentation. Prioritise comprehensive cleaning to resolve small discrepancies and make use of information to gain more profound understanding. Next steps are to increase AF synthetic data production in the future by making use of more complex generative models and more available datasets; this will help alleviate the scarcity of AF and pave the way for more reliable digital twin training.
This ongoing journey highlights its potential to drive smarter, faster, and more equitable healthcare solutions, and we encourage others to explore its possibilities.
