General Guidelines And Understandings For Handing Off Data To Another Team Precisely Every Time
It is critical when working with other teams to have a smooth and seamless transition of data to one team to the next that is processing the data downstream in the analysis pipeline. In the realm of data science, there are some de facto methods for good data management practices established already and some of these are as follows.
- Establish a single source of truth of data origin
- Catalog data with strong file names
- Include standard format date and initials
- Document the data
- Include comments in the code in the scripts included in the data bundle
- Data security: encrypt all data when it is appropriate
- Safely archive unused data
Without establishing and following correct guidelines for handing off data, the consequences are serious such as misunderstandings and misinterpretations that cost the company financial loss, data and data integrity loss, and loss in critical work time. The following builds upon the good data management practices already established and lists some best practices for handing off data to another team either within the same company or external to the company. It is important to note that, although the general area of data listed below is built upon handing off genomic and other bioinformatics data, this list is data agnostic and could be applied to many areas of data science such as handing off data generated from machine learning and artificial intelligence teams.
Missing or incomplete data must be acknowledged and documented
It is industry practice to “never leave blanks” in data and replace all blanks with a “NA” (or established other standards for NAs that include “null” or “NAN”) that can be later interpreted and analyzed with pandas or SQL. Using different and unknown (e.g. unmapped) notations for NAs is not appropriate and causes data and time loss later on. All data extrapolations and data inferences by the analyst(s) must be documented with their data sources. All formulas or scripts used to generate the missing data must be documented. It is incorrect to write “0” as a “NA” because “0” is still a value.
Data manifest or data dictionary
A manifest that maps each variable to its appropriate metadata or description must be included in the data bundle that has variables or headers that are appropriately named that could be understood by the analyst. The downstream team could use this data manifest or data dictionary to write a script that checks the validity of each data under the variable such as if the data format is correct or if the patient IDs are of the correct length before importing and ingesting the data into a database. For this reason, it is crucial to utilize correct data manifest or dictionary formats, especially those that are modern standards for industry, such as Excel, TSV/CSV or JSON/XML.
File naming conventions
Files in the final data bundle must be appropriately named and even mapped in an Excel sheet in a data manifest. For example, if the data bundle includes *.fastq.gz files, these must be appropriately named after patient IDs that are mapped to other identifiers that the downstream team can use.
Data collection, data cleaning and data processing steps must be reproducibly documented
All data collected in bioinformatics comes down to a lab that must be identified in a data manifest. All data processing steps used to generate the final data bundle must be outlined and documented somewhere so the downstream team is aware and can distinguish between data that are raw data or filtered, cleaned and processed data. In industry, data analyses are established through a DTA which is a Data Transfer Agreement signed by both parties that outlines how the data will be generated, processed and transferred to the other party. This ensures that, once transferred, the receiving party can quickly process the data as it would be exactly as they wanted it. All analysis pipelines including data cleanings are outlined in a DTA including headers and file naming conventions; DTAs are an industry standard and must find their way into academia when handing off data to another lab especially in bioinformatics.
Generate data integrity checks
The industry standard for data integrity are checksums generated per file that is in the data bundle. These could take some time to generate depending on the data (in bioinformatics, the file sizes are very large). Sometimes the receiving team will check some and not all of the checksums by generating their own to save time. Certainly, the file with checksums must be placed in the data bundle with the file name in one column so it can also be mapped back to the data manifest if appropriate.
Transfer the data and document that as well
It always is more difficult than it sounds. Main issues that arise in data transfer are as follows. Incorrect or outdated file transfer credentials, server issues, slow transfer speeds and incorrect account permissions. Sometimes it could take days, weeks and even months to resolve these issues. Once data is transferred, it is important to document the successful transfer.
Continuous communication and feedback
After the data is transferred, it is important to contact the receiving party about the successful transfer. Generally in industry, any contact with an external team is done through a project manager and is another practice that must find its way into academia for many reasons including protecting the employees involved. When issues arise in the data transfer, it is important to understand that the data hand off team is responsible for fixing data related issues and not the data receiving team.
Following these best practices for data hand off can help transition and foster trust and collaboration between teams and most importantly it can save time.