It is the international love your data week. The aim of the week is to pay attention to research data and help researchers to improve their data management. This blog post is about data quality and documentation and description of research data. Data management is often considered just refer to quantitative data but qualitative data must also be documented and described to ensure data quality. However, researchers with qualitative data seem to be more concerned with ethical issues such as anonymization, confidentiality and problems of someone using qualitative data to answer research questions the data was not collected for.
Data quality is about the quality of the content, the values of a dataset. This means that data has to be complete (all data must be there), accurate and current. Data quality is also about completeness, validity, consistency, timeliness and accuracy. Furthermore, data quality ensures that data is useful, documented and reproducible/verifiable.
Collecting data, storage/access and formatting are activities which affect data quality. The responsibility of these activities is on both the data provider and data curator. Data curation is often taken care of by archivists and librarians. Archivists make sure datasets are preserved and librarians add metadata to datasets. It is also often librarians making data available for others. Just remember a dataset is not automatically available even if it is archived and preserved.
Documentation of data is about to increased transparency, and trust, in the research process. It has to do with validation, reproducibility and reusability. It is important to document data to contribute to data quality and usability of the data for the researcher him/herself, colleagues, students and others. The write-up process can become more efficient and stress-free when the data is well described and has a well thought of structure. Also the work of a research group becomes easier when the data is thoroughly described and has a structure and possible questions during the peer review process are easier to answer.
Research today is measured with different indicators, one of which is the number of citations. If datasets are citable it might give a researcher advantage in the research fund application process and promotion process. Documentation also increases integrity of research because the process becomes more transparent. Furthermore public trust in research might increase as well. If public trust does not increase it might at least affect colleagues in research.
Harvard Business Review (HBR) has published an article where IBM estimates the yearly cost of bad data up to $3.1 trillion in USA alone in 2016. So there is a lot to be done when it comes to data quality. Cost estimate is based on the time and costs generated from the work decision makers, managers, knowledge workers and data scientists use to correct bad data in order to be able to do their work. This cost is related to costs in organizations where e.g. sales department gets an order wrong and the faulty data is then inherited by another department, not data produced in research context. Nevertheless, it is important to consider costs (not necessarily money) in research.
Retraction watch, a blog tracing retracted publications, reports of a case where a researcher noticed problems in the database he was using to investigate trends in extinction patterns. The problem the researcher noticed had impact on two of his publications. One of them is new retracted. In this case there were problems in data collection and the database used which impacted conclusions made. When the problems where corrected the results of the study changed.
Here you can find examples of bad data. Click on the image to come to an explanation what is the problem with the dataset.