Research Data Management Guidelines for NGS
This section provides guidelines for effective research data management within our lab. By adopting these guidelines, we aim to improve data organization and naming conventions, leading to enhanced data governance and research efficiency. The guidelines include the following steps:
- Adhere to folder structure and naming conventions for
Assays
andProjects
folders. - Add relevant metadata to a
metadata.yml
in each folder - Create a database from metadata files in
Assays
andProjects
folders and browse it with a Panel python app. Projects
folders will be version controlled with Github and the Brickman organization.Projects
reports will be displayed under the Brickman organization GitHub Pages.Projects
will be syncronized and archived in Zenodo, which will give a DOI that can be used in a publication.- NGS
Assays
folder will be uploaded to GEO, with the information provided in the metadata file. - Create a Data Management Plan template that it is prefilled with repetitive information using DMPonline
1. Folder structure and organization
To ensure efficient data management, it is important to establish a consistent approach to organizing research data. We consider the following practices:
- Folder structure: we aim to a logical and intuitive folder structure that reflects the organization of research projects and experimental data. We use descriptive folder names to make it easy to locate and access specific data files.
- Subfolders: Use subfolders to further categorize data based on their contents, such as code notebooks, results, reports, etc. This helps to keep data organized and facilitates quick retrieval.
- File naming conventions: implement a standardized file naming convention to ensure consistency and clarity. Use descriptive names that include relevant information, such as type of plots, results tables, etc.
1.1 Template engine
We are currently using a cookiecutter template to generate a folder structure. Use cruft when generating assay and project folders to allow us to validate and sync old templates with the latest version.
See this section to get started with a new project/assay.
1.2 Assay folder
For each NGS experiment there should be an Assay
folder that will contain all experimental datasets (raw files and pipeline processed files).
Inside Assay
there will be subfolders named after a unique NGS ID and the date it was created:
Assay ID code names
CHIP
: ChIP-seqRNA
: RNA-seqATAC
: ATAC-seqSCR
: scRNA-seqPROT
: Mass Spectrometry AssayCAT
: Cut&TagCAR
: Cut&RunRIME
: Rapid Immunoprecipitation Mass spectrometry of Endogenous proteins
For example CHIP_20230101
is a ChIPseq assay made on 1st January 2023.
Folder Structure
CHIP_20230424
├── description.yaml
├── metadata.yaml
├── pipeline.md
├── processed
└── raw
├── .fastq.gz
└── samplesheet.csv
- description.yaml: short and long descriptions of the assay in yaml format.
- metadata.yaml: metadata file for the assay describing different keys (see below).
- pipeline.md: description of the pipeline used to process raw data.
- processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.
- raw: folder with the raw data.
- .fastq.gz:In the case of NGS assays, there should be fastq files.
- samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. Ideally, it will also contain a column with info regarding the experimental variables and batches so it can be used for down stream analysis as well.
1.3 Project folder
There should be another folder called Projects
that will contain project information and data analysis.
A project may use one or more assays to answer a scientific question. This should be, for example, all the data analysis related to a publication.
The project folder should be named after a unique identifier, such as:
<Project-ID>
should be the initials of the owner of the project folder and the publication year, e.g. JARH_et_al_20230101
.
Folder structure
<Project-ID>_20230424
├── data
│ ├── assays
│ ├── external
│ └── processed
├── documents
│ └── Non-sensitive_NGS_research_project_template.docx
├── notebooks
│ └── 01_data_analysis.rmd
├── README.md
├── reports
│ ├── figures
│ │ └── 01_data_analysis
│ └── 01_data_analysis.html
├── requirements.txt
├── results
│ └── 01_data_analysis/
├── scripts
├── description.yml
└── metadata.yml
- data: folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.
- documents: folder containing word documents, slides or pdfs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan.
- Non-sensitive_NGS_research_project_template.docx. This is a prefilled Data Management Plan based on the Horizon Europe guidelines.
- notebooks: folder containing Jupyter, R markdown or Quarto notebooks with the actual data analysis. Using annotated notebooks is ideal for reproducibility and readability purposes. Notebooks should be labeled numerically in order they were created e.g.
00_preprocessing
- README.md: detailed description of the project in markdown format.
- reports: notebooks rendered as html/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure.
- figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
- results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc. These results should be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which results.
- scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder
- description.yml: short description of the project.
- metadata.yml: metadata file for the assay describing different keys (see below).
1.4 Synchronization with DanGPU server
We will have to setup a cron job to perform one-way sync between the /projects
folder and NGS_data
folder. All the analysis will be done on danGPU server,
with no exceptions!
After project is done and published, it will be moved to NGS_data
.
1.5 General naming conventions and more info
- date format:
YYYYMMDD
- authors: initials
- file and folder names: No use of spaces. Field sections are separated by underscores
_
. Words in each section are written in camelCase. For example:field1_word1Word2.txt
.
Transcriptomics metadata standards and fields
More info on naming conventions for different types of files and analysis is in development.
name | description | naming_convention | file format | example |
---|---|---|---|---|
.fastq | raw sequencing reads | nan | nan | sampleID_run_read1.fastq |
.fastqc | quality control from fastqc | nan | nan | sampleID_run_read1.fastqc |
.bam | aligned reads | nan | nan | sampleID_run_read1.bam |
GTF | sequence annotation | nan | nan | one of https://www.gencodegenes.org/ |
GFF | sequence annotation | nan | nan | one of https://www.gencodegenes.org/ |
.bed | genome locations | nan | nan | nan |
.bigwig | genome coverage | nan | nan | nan |
.fasta | sequence data (nucleotide/aminoacid) | nan | nan | one of https://www.gencodegenes.org/ |
Multiqc report | QC aggregated report | <assayID>_YYYYMMDD.multiqc | multiqc | RNA_20200101.multiqc |
Count matrix | final count matrix | <assayID>_cm_aligner_YYYYMMDD.tsv | tsv | RNA_cm_salmon_20200101.tsv |
DEA | differential expression analysis results | DEA_<condition1-condition2>_LFC<absolute_threshold>_p<pvalue decimals>_YYYYMMDD.tsv | tsv | DEA_treat-untreat_LFC1_p01_20200101.tsv |
DBA | differential binding analysis results | DBA_<condition1-condition2>_LFC<absolute_threshold>_p<pvalue decimals>_YYYYMMDD.tsv | tsv | DBA_treat-untreat_LFC1_p01_20200101.tsv |
MAplot | MA plot | MAplot_<condition1-condition2>_YYYYMMDD.jpeg | jpeg | MAplot_treat-untreat_20200101.jpeg |
Heatmap plot | Heatmap plot of anything | heatmap_<type>_YYYYMMDD.jpeg | jpeg | Heatmap_sampleCor_20200101.jpeg |
Volcano plot | Volcano plot | volcano_<condition1-condition2>_YYYYMMDD.jpeg | jpeg | volcano_treat-untreat_20200101.jpeg |
Venn diagram | Venn diagram | venn_<type>_YYYYMMDD.jpeg | jpeg | venn_consensus_20200101.jpeg |
Enrichment table | Enrichment results | nan | tsv | nan |
2. Metadata and documentation
Accurate documentation and metadata play a crucial role in facilitating data discovery and interpretation. Consider the following guidelines:
- Metadata capture: Record essential metadata for each dataset, including type of experiment, date, organisms, etc. This information provides context and helps others understand and reuse the data effectively.
- Readme files: Create readme files for each project or dataset. These files should provide a brief overview of the project, list the files and their descriptions, and explain any specific instructions or dependencies required for data analysis.
2.1 Assay metadata fields
Metadata field | Definition | Format | Example |
---|---|---|---|
assay_id | Identifier for the assay | <assay>_<codename>_YYYYMMDD | CHIP_Oct4_20200101 |
assay | What kind of NGS was used in your experiment? | ["CHIP", "RNA", "ATAC", "SCR", "PROT", "CAT", "CAR", "RIME", "TAP"] | ChIPseq |
owner | Who performed the experiment? | <First Name> <Last Name> | Jose Romero |
date | Date of sequencing, should be the same as defined by Genomics Platform in YYYYMMDD format! | YYYYMMDD | 20200101 |
codename | Your name initials [Example: JB for Josh Brickman] | <Initials OR keyword> | JR |
eln_id | Optional: Electronic lab notebook ID | Free text | 12345 |
technology | What technology was used? [Example: 10X Genomics if you used SCR] | Free text | 10X genomics |
sequencer | What sequencing machine was used? [Example: NovaSeq 2000/NextSeq 2000/NextSeq 500] | Free text | NextSeq 2000 |
seq_kit | What sequencing kit did you use? Please provide product number if available | Free text | nan |
n_samples | How many samples have been sequenced? | <integer> | 9 |
is_paired | Paired fastq files or not | <single-end OR paired-end> | single-end |
pipeline | Pipeline name [Example: nf-core/rnaseq 3.12.0 or custom] | Free text | nf-core/chipseq -r 1.0 |
processed_by | Person responsible for pre-processing (pipeline execution) | <First Name> <Last Name> | Sarah Lundregan |
organism | What organism is this? | <mouse OR human OR other> | mouse |
organism_version | Which version of genome was used [Example: mm10, hg38] | Free text | mm10 |
organism_subgroup | In vitro or in vivo? | <in vivo OR in vitro> | in vitro |
origin | Is this internal experiment of external (collaborator/publication)? | <internal OR external> | internal |
note | Optional: Was there something worth knowing? | Free text | Low quality experiment/Indexes are swapped ... |
genomics_path | Path to where the data is | </path/to/file> | smb:/path/to/file |
2.2 Project metadata fields
In development.
Metadata field | Definition | Format | Example |
---|---|---|---|
project | Project name | <name>_<keyword>_YYYY | lundregan_oct4_2023 |
author | Owner of the project | <First name> <Surname> | Sarah Lundregran |
date | Date of creation | YYYYMMDD | 20230101 |
description | Short description of the project | Plain text | This is a project describing the effect of Oct4 perturbation after pERK activation |
3. Data catalogue and browser
@SLundregan is in the process of building a prototype for Assay
, using the metadata contained in all description.yml
and metadata.yml
files in the assay folder.
This will be in the form of an SQLite database that that is easily updatable by running a helper script.
@SLundregan is also working on a browsable database using Panel python app. The app will display the latest version of the SQLite database. Clicking on an item from the database will open a tab containing all available metadata for the assay.
Also, it would be nice if you can create an Assay
folder directly from there,
making it easy to fill up the info for the metadata and GEO submission (see below)
In the future, you could ideally visualize an analysed single cell RNAseq dataset by opening Cirrocumulus session.
4. Projects
version control
All projects should be version controlled using GitHub under the Brickman organization. After creating a cookiecutter template, initiate a git repository on the folder. The Git repository can stay private until it is ready for publication.
5. Projects
GitHub pages
Using GitHub pages, it is possible to display your data analyses (or anything related to the project) inside the Projects
folder so that they are open to the public in a html format.
This is great for transparency and reproducibility purposes. This can be done after the paper has been made public (it is not possible to do with a private repository without paying).
Info on how this is done should be put here
6. Project
archiving in Zenodo
Before submitting, link the repository to Zenodo and then create a Git release. This release will be caught by Zenodo and will give you a DOI that you can submit along the manuscript.
7. Data upload to GEO
The raw data from NGS experiments will be uploaded to the Gene Expression Omnibus (GEO). Whenever a new Assay folder is created, the data owner must fill up the required documentation and information needed to make the GEO submission as smooth as possible.
8. Create a Data Management Plan
From the University of Copenhagen RDM team
A Data Management Plan (DMP) is a planning tool that helps researchers to establish good practices for working with physical material and data in a research project. A DMP covers all relevant aspects of research data management throughout the project. Writing a DMP early on in a project helps:
- identify potential issues with the management of research data.
- comply with relevant legislation, policies, and funder requirements.
- document agreements related to the collection, usage, and dissemination of research data between project partners or between student and supervisor.
We are have written a DMP template that it is prefilled with repetitive information using DMPonline and the Horizon Europe guidelines. This template contains all the necessary information regarding common practices that we will use, the repositories we use for NGS, etc. The template is part of the project
folder template, under documents
. You can check the file here.
The Horizon Europe template is mostly focused on digital data and so, it is maybe not the best option regarding the needs of the Brickman Lab, due to the fact that it is mostly a wet lab with some bioinformatics. We will start working on another DMP based on the KU template, which is designed for both physical and digital data.