import pandas as pd01 - NGS collection [human]
0.1 Datasets
| Dataset | Stages | Technology |
|---|---|---|
| Meistermann2021 | E0,E2,E2.5,E3.25,E3.5,E3.75,E4.5 | SMART-seq2 |
| Petropoulos2016 | E3,E4,E5,E6,E7 | SMART-Seq2 |
| Yan2013 | MII-Oocyte, Zygote, 2C,4C,8C,morula,blastocyst | SMART-seq |
| Yanagida | E6,E7 | SMART-seq |
| Nakamura | Cynomolgus Monkey E6 onwards | SC3-seq |
| Blakeley | E6/7 | SMARTer-seq |
| Hang | E6,7,8,9,10,12,14 | SMART-seq2 |
| Xue | Tang et al. method |
| Root Dataset | Dataset | Technology | Download | Notes |
|---|---|---|---|---|
| Radley et al., 2022 | ||||
| X | Meistermann et al, 2021 | SMART-SEQ2 | PRJEB30442 | |
| X | Petropoulos et al, 2016 | SMART-SEQ2 | E-MTAB-3929 | SCPORTAL |
| X | Yan et al, 2013 | SMART-SEQ | GSE36552 | SCPORTAL |
| X | Yanagida et al, 2021 | SMART-SEQ2 | GSE171820 | |
| X | Nakamura et al, 2017 | SMART-SEQ2 | ||
| Blakeley et al, 2015 | SMARTer Ultra Low RNA Kit | GSE66507 | ||
| Tysen et al, 2021 | SMARTSEQ2 | Portal | GASTRULATION (CS7) | |
| Hang et al, 2019 | SMART-SEQ2 | GSE136447 | ||
| Xue | Tang et al. method | GSE44183 |
- Meistermann 2021
- Petropoulos 2016
- Xiang 2020
- Yan 2013
- Yanagida 2021
- Xue 2013
0.2 Meistermann et al., 2021 PRJEB30442
MEISTERMANN_ENA_URL = "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB30442&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,sample_alias&format=tsv&limit=0"
meistermann_metadata = pd.read_table(MEISTERMANN_ENA_URL)meistermann_sample_annotation = pd.read_csv("../data/external/human/Meistermann_et_al_2021/sampleAnnot.tsv", index_col=0, sep="\t")meistermann_sample_annotation = meistermann_sample_annotation[meistermann_sample_annotation['Dataset'] == 'ThisPaper']meistermann_sample_annotation = meistermann_sample_annotation.merge(meistermann_metadata, left_on = 'Name', right_on = 'sample_alias')meistermann_sample_annotation.run_accession.to_csv("../pipeline/fetchngs/human_PRJEB30442.txt", index=None, header=None)nf-core_tower.sh \
Meistermann_2021 \
nextflow run nf-core/fetchngs \
-r 1.10.0 \
--input /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/fetchngs/human_PRJEB30442.txtnf-core_tower.sh Meistermann_2021 nextflow run brickmanlab/scrnaseq \
-r feature/smartseq \
-c /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/smartseq.human.config \
--input /scratch/Brickman/pipelines/Meistermann_2021/results/samplesheet/samplesheet.csv0.3 Petropoulos et al., 2016 E-MTAB-3929
PETROPOULOS_URL = "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB11202&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,sample_alias&format=tsv&limit=0"
petropoulos_metadata = pd.read_table(PETROPOULOS_URL)petropoulos_metadata.sample_alias = petropoulos_metadata.sample_alias.str.extract("(E[0-9].*$)")petropoulos_metadata_short = petropoulos_metadata[['run_accession', 'sample_alias']]petropoulos_sample_annotation = pd.read_csv("../data/external/human/Meistermann_et_al_2021/sampleAnnot.tsv", index_col=0, sep="\t")petropoulos_sample_annotation = petropoulos_sample_annotation.loc[petropoulos_sample_annotation.Dataset == 'Petropoulos2016']petropoulos_sample_annotation = petropoulos_sample_annotation.merge(petropoulos_metadata_short, left_on = 'Name', right_on = 'sample_alias')petropoulos_sample_annotation.run_accession.to_csv("../pipeline/fetchngs/human_E-MTAB-3929.txt", index=None, header=None)nf-core_tower.sh \
Petropoulos_2016 \
nextflow run nf-core/fetchngs \
-r 1.10.0 \
--input /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/fetchngs/human_E-MTAB-3929.txtnf-core_tower.sh Petropoulos_2016 nextflow run brickmanlab/scrnaseq \
-r feature/smartseq \
-c /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/smartseq.human.config \
--input /scratch/Brickman/pipelines/Petropoulos_2016/results/samplesheet/samplesheet.csv0.4 Xiang et al., 2020 [GSE136447]
xiang_metadata_1 = pd.read_table("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136447/matrix/GSE136447-GPL20795_series_matrix.txt.gz",
skiprows=29, nrows=1, index_col = 0).Txiang_metadata_2 = pd.read_table("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136447/matrix/GSE136447-GPL23227_series_matrix.txt.gz",
skiprows=29, nrows=1, index_col = 0).Txiang_metadata = pd.concat([xiang_metadata_1, xiang_metadata_2])xiang_metadata['Sample_name'] = xiang_metadata.index.to_list()xiang_metadata['Sample_name'] = xiang_metadata['Sample_name'].str.extract("_(.*$)")xiang_metadata| !Sample_title | !Sample_geo_accession | Sample_name |
|---|---|---|
| Embryo_D6A1S1 | GSM4050122 | D6A1S1 |
| Embryo_D6A1S2 | GSM4050123 | D6A1S2 |
| Embryo_D6A1S3 | GSM4050124 | D6A1S3 |
| Embryo_D6A1S4 | GSM4050125 | D6A1S4 |
| Embryo_D6A1B1 | GSM4050126 | D6A1B1 |
| ... | ... | ... |
| Embryo_D14A1S5 | GSM4050628 | D14A1S5 |
| Embryo_D14A1S6 | GSM4050634 | D14A1S6 |
| Embryo_D14A1S7 | GSM4050635 | D14A1S7 |
| Embryo_D14A1S8 | GSM4050636 | D14A1S8 |
| Embryo_D14A1S9 | GSM4050637 | D14A1S9 |
555 rows × 2 columns
xiang_sample_annotation = pd.read_excel("../data/external/human/Xiang_et_al_2019/41586_2019_1875_MOESM10_ESM.xlsx", skiprows=2, index_col=0)xiang_sample_annotation = xiang_sample_annotation.merge(xiang_metadata, left_on='Sample ID', right_on = 'Sample_name')xiang_sample_annotation.columns = ['Day', 'Embryo ID', 'Group', 'GEO_accession', 'Sample_name']xiang_sample_annotation.GEO_accession.to_csv("../pipeline/fetchngs/human_GSE136447.txt", index=None, header=None)nf-core_tower.sh \
Xiang_2020 \
nextflow run nf-core/fetchngs \
-r 1.10.0 \
--input /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/fetchngs/human_GSE136447.txtnf-core_tower.sh Xiang_2020_human nextflow run brickmanlab/scrnaseq \
-r feature/smartseq \
-c /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/smartseq.human.config \
--input /scratch/Brickman/pipelines/Xiang_2020_human/results/samplesheet/samplesheet.csv0.5 Yan et al., 2013
YAN_MATRIX_URL = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE36nnn/GSE36552/matrix/GSE36552_series_matrix.txt.gz"yam_metadata = pd.read_table(YAN_MATRIX_URL, skiprows=52, index_col = 0).Tyam_annotations = metadata = pd.read_csv("../data/external/human/Meistermann_et_al_2021/sampleAnnot.tsv", index_col=0, sep="\t")
yam_annotations = yam_annotations[yam_annotations.Dataset == 'Yan2013'].copy()yam_metadata = yam_metadata[~yam_metadata.index.str.contains('hESC')].copy()yam_metadata['SampleNames'] = yam_metadata.index.valuesyam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("#",".")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace(" -Cell", "")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("Late blastocyst ", "lateBlasto")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("Morulae ", "Morula")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("Oocyte ", "Oocyte")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("Zygote ", "Zygote")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("2-cell embryo", "e2C")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("4-cell embryo", "e4C")
yam_metadata['SampleNames'] = yam_metadata['SampleNames'].str.replace("8-cell embryo", "e8C")yam_metadata = yam_metadata.loc[:,['!Sample_geo_accession','SampleNames']].copy()yam_metadata.columns = ['Geo_accession', 'SampleNames']yam_annotations = yam_annotations.merge(yam_metadata, left_index = True, right_on='SampleNames', how = 'right')yam_annotations.Geo_accession.to_csv("../pipeline/fetchngs/human_GSE36552.txt", index=None, header=None)nf-core_tower.sh \
Yan_2013_human \
nextflow run nf-core/fetchngs \
-r 1.10.0 \
--input /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/fetchngs/human_GSE36552.txtnf-core_tower.sh Yan_2013_human nextflow run brickmanlab/scrnaseq \
-r feature/smartseq \
-c /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/smartseq.human.config \
--input /scratch/Brickman/pipelines/Yan_2013_human/results/samplesheet/samplesheet.csv0.6 Yanagida et al., 2021 [GSE171820]
YANAGIDA_URL = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171820/matrix/GSE171820_series_matrix.txt.gz'yanagida_metadata = pd.read_table(YANAGIDA_URL, skiprows=30, index_col = 0).Tyanagida_metadata = yanagida_metadata[yanagida_metadata['!Sample_source_name_ch1'] != 'Blastoid'].copy()yanagida_metadata['lineage'] = yanagida_metadata[['!Sample_characteristics_ch1']].agg(' '.join, axis=1).str.extract("lineage: (.*) polar_mural")
yanagida_metadata['day'] = yanagida_metadata[['!Sample_characteristics_ch1']].agg(' '.join, axis=1).str.extract("time point: Embryonic day ([0-9]{1})")
yanagida_metadata['side'] = yanagida_metadata[['!Sample_characteristics_ch1']].agg(' '.join, axis=1).str.extract("polar_mural: ([a-z]*)")yanagida_metadata = yanagida_metadata[['lineage','day','side']].copy()yanagida_metadata['Geo_accession'] = yanagida_metadata.index.valuesyanagida_metadata.Geo_accession.to_csv("../pipeline/fetchngs/human_GSE171820.txt", index=None, header=None)nf-core_tower.sh \
Yanagida_2021_human \
nextflow run nf-core/fetchngs \
-r 1.10.0 \
--input /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/fetchngs/human_GSE171820.txtnf-core_tower.sh Yanagida_2021_human nextflow run brickmanlab/scrnaseq \
-r feature/smartseq \
-c /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/smartseq.human.config \
--input /scratch/Brickman/pipelines/Yanagida_2021_human/results/samplesheet/samplesheet.csv0.7 Xue et al., 2013 [GSE44183]
XUE_URL = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE44nnn/GSE44183/matrix/GSE44183-GPL11154_series_matrix.txt.gz'xue_metadata = pd.read_table(XUE_URL, skiprows=36, index_col = 0).Txue_metadata = xue_metadata[xue_metadata['!Sample_source_name_ch1'].isin(['oocyte','pronucleus','zygote','2-cell blastomere','4-cell blastomere','8-cell blastomere', 'morula'])].copy()
xue_metadata = xue_metadata[['!Sample_geo_accession','!Sample_source_name_ch1']].copy()reannotate_dict = {
'oocyte': 'Oocyte',
'pronucleus': 'Pronucleus',
'zygote': 'Zygote',
'2-cell blastomere': '2C',
'4-cell blastomere': '4C',
'8-cell blastomere': '8C'
}
xue_metadata.replace(reannotate_dict, inplace=True)xue_metadata['!Sample_geo_accession'].to_csv("../pipeline/fetchngs/human_GSE44183.txt", index=None, header=None)nf-core_tower.sh \
Xue_2013_human \
nextflow run nf-core/fetchngs \
-r 1.10.0 \
--input /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/fetchngs/human_GSE44183.txtnf-core_tower.sh Xue_2013_human nextflow run brickmanlab/scrnaseq \
-r feature/smartseq \
-c /projects/dan1/data/Brickman/projects/proks-salehin-et-al-2023/pipeline/smartseq.human.config \
--input /scratch/Brickman/pipelines/Meistermann_2021/results/samplesheet/samplesheet.csv1 Import data after nf-core single cell RNA-seq pipeline
import scanpy as sc
import anndata as ad
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns1.0.1 Annotations
Annotations (adata.obs)
- EmbryonicDay
- Lineage
- Dataset
- Technology
1.1 Datasets
| Dataset | Stages | Technology |
|---|---|---|
| Meistermann2021 | E0,E2,E2.5,E3.25,E3.5,E3.75,E4.5 | SMART-seq2 |
| Petropoulos2016 | E3,E4,E5,E6,E7 | SMART-Seq2 |
| Yan2013 | MII-Oocyte, Zygote, 2C,4C,8C,morula,blastocyst | SMART-seq |
| Yanagida | E6,E7 | SMART-seq |
| Nakamura | Cynomolgus Monkey E6 onwards | SC3-seq |
| Blakeley | E6/7 | SMARTer-seq |
| Hang | E6,7,8,9,10,12,14 | SMART-seq2 |
| Xue | Tang et al. method |
| Root Dataset | Dataset | Technology | Download | Notes |
|---|---|---|---|---|
| Radley et al., 2022 | ||||
| X | Meistermann et al, 2021 | SMART-SEQ2 | PRJEB30442 | |
| X | Petropoulos et al, 2016 | SMART-SEQ2 | E-MTAB-3929 | SCPORTAL |
| X | Yan et al, 2013 | SMART-SEQ | GSE36552 | SCPORTAL |
| X | Yanagida et al, 2021 | SMART-SEQ2 | GSE171820 | |
| X | Nakamura et al, 2017 | SMART-SEQ2 | ||
| Blakeley et al, 2015 | SMARTer Ultra Low RNA Kit | GSE66507 | ||
| Tysen et al, 2021 | SMARTSEQ2 | Portal | GASTRULATION (CS7) | |
| Hang et al, 2019 | SMART-SEQ2 | GSE136447 | ||
| Xue | Tang et al. method | GSE44183 |
1.2 Initial setup
For the pipeline, need to normalise using TPM. This requires average gene lengths. The original iteration of the notebooks linked GENE SYMBOL to MEAN GENE LENGTH. This time, I will instead link ENSEMBL GENE CODE to MEAN GENE LENGTH.
python3.10 ../data/external/human/gtftools.py -l ../data/external/human/Homo_sapiens.GRCh38.110.gene_length.tsv /scratch/Brickman/references/homo_sapiens/ensembl/GRCh38_110/Homo_sapiens.GRCh38.110.gtfgtf = pd.read_table("../data/external/human/Homo_sapiens.GRCh38.110.gene_length.tsv", index_col=0)gene_lengths = gtf[['mean']].copy()
gene_lengths.columns = ['length']def normalize_smartseq(adata: sc.AnnData, gene_len: pd.DataFrame) -> sc.AnnData:
print("SMART-SEQ: Normalization")
common_genes = adata.var_names.intersection(gene_len.index)
print(f"SMART-SEQ: Common genes {common_genes.shape[0]}")
lengths = gene_len.loc[common_genes, "length"].values
normalized = sc.AnnData(adata[:, common_genes].X, obs=adata.obs, dtype=np.float32)
normalized.var_names = common_genes
normalized.X = normalized.X / lengths * np.median(lengths)
normalized.X = np.rint(normalized.X)
return normalized1.3 Meistermann et al., 2021
For annotation, I will be wiping the published annotations of the Meistermann dataset. Setting everything to ‘Unknown’. The annotations do not contain an ICM.
meistermann_h5ad = sc.read_h5ad("../data/external/human/meistermann_2021_reprocessed.h5ad")MEISTERMANN_ENA_URL = "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB30442&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,sample_alias&format=tsv&limit=0"
meistermann_metadata = pd.read_table(MEISTERMANN_ENA_URL)
meistermann_sample_annotation = pd.read_csv("../data/external/human/Meistermann_et_al_2021/sampleAnnot.tsv", index_col=0, sep="\t")
meistermann_sample_annotation = meistermann_sample_annotation[meistermann_sample_annotation['Dataset'] == 'ThisPaper']
meistermann_sample_annotation = meistermann_sample_annotation.merge(meistermann_metadata, left_on = 'Name', right_on = 'sample_alias')meistermann_sample_annotation.columnsIndex(['Embryo', 'Branches', 'EmbryoDay', 'Stage', 'BlastoDissectionSide',
'Dataset', 'Treatment', 'Stirparo.lineage', 'Author.lineage',
'Pseudotime', 'totalCounts', 'totalGenesExpr', 'clusterUmap',
'study_accession', 'sample_accession', 'experiment_accession',
'run_accession', 'tax_id', 'scientific_name', 'fastq_ftp',
'submitted_ftp', 'sra_ftp', 'sample_alias'],
dtype='object')
meistermann_sample_annotation.clusterUmap.unique()array(['B1.EPI', 'B1_B2', 'early_TE', 'EPI', 'medium_TE', 'late_TE',
'EPI.PrE', 'Morula', 'EightCells', 'EPI.PrE.TE'], dtype=object)
meistermann_sample_annotation.Branches.unique()array(['6.Epiblast', '3.Early blastocyst', '5.Early trophectoderm',
'4.Inner cell mass', '8.TE.NR2F2-', '9.TE.NR2F2+', '2.Morula',
'1.Pre-morula', '7.Primitive endoderm'], dtype=object)
meistermann_h5ad.obs.loc[:,['sample','run_accession']].merge(meistermann_sample_annotation, left_on='run_accession', right_on='run_accession').loc[:,['Stage','Dataset', 'clusterUmap','EmbryoDay']]| Stage | Dataset | clusterUmap | EmbryoDay | |
|---|---|---|---|---|
| 0 | B2+ | ThisPaper | B1.EPI | 5.0 |
| 1 | B2+ | ThisPaper | B1_B2 | 5.0 |
| 2 | B2+ | ThisPaper | B1.EPI | 5.0 |
| 3 | B2+ | ThisPaper | B1.EPI | 5.0 |
| 4 | B2+ | ThisPaper | B1_B2 | 5.0 |
| ... | ... | ... | ... | ... |
| 145 | B5 | ThisPaper | medium_TE | 6.0 |
| 146 | B5 | ThisPaper | medium_TE | 6.0 |
| 147 | B5 | ThisPaper | medium_TE | 6.0 |
| 148 | B5 | ThisPaper | medium_TE | 6.0 |
| 149 | B4 | ThisPaper | early_TE | 6.0 |
150 rows × 4 columns
meistermann = meistermann_h5ad.copy()meistermann.obs = meistermann_h5ad.obs.loc[:,['sample','run_accession']].reset_index().merge(meistermann_sample_annotation, left_on='run_accession', right_on='run_accession').set_index('index')meistermann_reannotation = meistermann.obs[['EmbryoDay','clusterUmap']]meistermann_reannotation.head()| EmbryoDay | clusterUmap | |
|---|---|---|
| index | ||
| ERX3015937_ERX3015937 | 5.0 | B1.EPI |
| ERX3015939_ERX3015939 | 5.0 | B1_B2 |
| ERX3015940_ERX3015940 | 5.0 | B1.EPI |
| ERX3015941_ERX3015941 | 5.0 | B1.EPI |
| ERX3015936_ERX3015936 | 5.0 | B1_B2 |
lineage_renaming = {
'early_TE': 'Trophectoderm',
'late_TE': 'Trophectoderm',
'medium_TE':'Trophectoderm',
'EPI':'Epiblast',
'PrE':'Primitive Endoderm',
'PrE.TE':'Unknown',
'B1.EPI':'Unknown',
'EPI.PrE': 'Unknown',
'EPI.PrE.TE':'Unknown',
'EPI.early_TE':'Unknown',
'B1_B2':'Blastocyst',
'EightCells': '8C',
'Morula': 'Morula',
}meistermann_reannotation = meistermann_reannotation.replace({
'clusterUmap':lineage_renaming
})meistermann_reannotation.columns = ['day', 'ct']
meistermann_reannotation['ct'] = 'Unknown'
meistermann_reannotation['experiment'] = 'Meistermann_2021'
meistermann_reannotation['technology'] = 'SMARTSeq2'
meistermann_reannotation.head()| day | ct | experiment | technology | |
|---|---|---|---|---|
| index | ||||
| ERX3015937_ERX3015937 | 5.0 | Unknown | Meistermann_2021 | SMARTSeq2 |
| ERX3015939_ERX3015939 | 5.0 | Unknown | Meistermann_2021 | SMARTSeq2 |
| ERX3015940_ERX3015940 | 5.0 | Unknown | Meistermann_2021 | SMARTSeq2 |
| ERX3015941_ERX3015941 | 5.0 | Unknown | Meistermann_2021 | SMARTSeq2 |
| ERX3015936_ERX3015936 | 5.0 | Unknown | Meistermann_2021 | SMARTSeq2 |
meistermann.obs = meistermann_reannotationnormalize_smartseq(meistermann, gene_lengths)SMART-SEQ: Normalization
SMART-SEQ: Common genes 62663
AnnData object with n_obs × n_vars = 150 × 62663
obs: 'day', 'ct', 'experiment', 'technology'
sc.pp.filter_cells(meistermann, min_counts=10)
sc.pp.filter_cells(meistermann, min_genes=10)
meistermann.layers["counts"] = meistermann.X.copy()
sc.pp.normalize_total(meistermann, target_sum=10_000)
sc.pp.log1p(meistermann)
meistermann.raw = meistermann1.4 Petropoulos et al., 2016
petropoulos_h5ad = sc.read_h5ad("../data/external/human/petropoulos_2016_reprocesses.h5ad")petropoulos_h5adAnnData object with n_obs × n_vars = 1496 × 62754
obs: 'sample', 'fastq_1', 'run_accession', 'experiment_accession', 'sample_accession', 'secondary_sample_accession', 'study_accession', 'secondary_study_accession', 'submission_accession', 'run_alias', 'experiment_alias', 'sample_alias', 'study_alias', 'library_layout', 'library_selection', 'library_source', 'library_strategy', 'library_name', 'instrument_model', 'instrument_platform', 'scientific_name', 'sample_title', 'experiment_title', 'study_title', 'sample_description', 'fastq_md5', 'fastq_ftp', 'fastq_galaxy', 'fastq_aspera'
var: 'gene_symbol'
PETROPOULOS_URL = "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB11202&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,sample_alias&format=tsv&limit=0"
petropoulos_metadata = pd.read_table(PETROPOULOS_URL)
petropoulos_metadata.sample_alias = petropoulos_metadata.sample_alias.str.extract("(E[0-9].*$)")
petropoulos_metadata_short = petropoulos_metadata[['run_accession', 'sample_alias']]
petropoulos_sample_annotation = pd.read_csv("../data/external/human/Meistermann_et_al_2021/sampleAnnot.tsv", index_col=0, sep="\t")
petropoulos_sample_annotation = petropoulos_sample_annotation.loc[petropoulos_sample_annotation.Dataset == 'Petropoulos2016']
petropoulos_sample_annotation = petropoulos_sample_annotation.merge(petropoulos_metadata_short, left_on = 'Name', right_on = 'sample_alias')petropoulos = petropoulos_h5ad.copy()
petropoulos.obs = petropoulos_h5ad.obs.loc[:,['sample','run_accession']].reset_index().merge(petropoulos_sample_annotation, left_on='run_accession', right_on='run_accession').set_index('index')petropoulosAnnData object with n_obs × n_vars = 1496 × 62754
obs: 'sample', 'run_accession', 'Embryo', 'Branches', 'EmbryoDay', 'Stage', 'BlastoDissectionSide', 'Dataset', 'Treatment', 'Stirparo.lineage', 'Author.lineage', 'Pseudotime', 'totalCounts', 'totalGenesExpr', 'clusterUmap', 'sample_alias'
var: 'gene_symbol'
petropoulos_reannotation = petropoulos.obspd.crosstab(petropoulos_reannotation['Stirparo.lineage'], petropoulos_reannotation.Stage)| Stage | 8C | B | B2+ | M | MC |
|---|---|---|---|---|---|
| Stirparo.lineage | |||||
| EPI | 0 | 44 | 0 | 0 | 0 |
| ICM | 0 | 65 | 0 | 0 | 0 |
| TE | 0 | 927 | 0 | 0 | 0 |
| intermediate | 0 | 66 | 0 | 0 | 0 |
| prE | 0 | 28 | 0 | 0 | 0 |
| undefined | 78 | 43 | 24 | 120 | 47 |
pd.crosstab(petropoulos_reannotation.Branches, petropoulos_reannotation.EmbryoDay)| EmbryoDay | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 |
|---|---|---|---|---|---|
| Branches | |||||
| 1.Pre-morula | 80 | 8 | 0 | 0 | 0 |
| 2.Morula | 0 | 150 | 16 | 0 | 0 |
| 3.Early blastocyst | 0 | 29 | 116 | 1 | 0 |
| 4.Inner cell mass | 0 | 0 | 13 | 6 | 0 |
| 5.Early trophectoderm | 0 | 0 | 97 | 27 | 2 |
| 6.Epiblast | 0 | 0 | 114 | 33 | 16 |
| 7.Primitive endoderm | 0 | 0 | 7 | 45 | 42 |
| 8.TE.NR2F2- | 0 | 0 | 3 | 226 | 141 |
| 9.TE.NR2F2+ | 0 | 0 | 0 | 71 | 253 |
pd.crosstab(petropoulos_reannotation.clusterUmap, petropoulos_reannotation.Branches)| Branches | 1.Pre-morula | 2.Morula | 3.Early blastocyst | 4.Inner cell mass | 5.Early trophectoderm | 6.Epiblast | 7.Primitive endoderm | 8.TE.NR2F2- | 9.TE.NR2F2+ |
|---|---|---|---|---|---|---|---|---|---|
| clusterUmap | |||||||||
| B1.EPI | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 |
| B1_B2 | 0 | 25 | 136 | 2 | 7 | 1 | 0 | 0 | 0 |
| EPI | 0 | 0 | 0 | 0 | 0 | 104 | 29 | 0 | 0 |
| EPI.PrE | 0 | 0 | 0 | 0 | 0 | 19 | 9 | 0 | 0 |
| EPI.PrE.TE | 0 | 0 | 0 | 1 | 2 | 8 | 19 | 11 | 3 |
| EPI.early_TE | 0 | 0 | 0 | 0 | 1 | 14 | 0 | 0 | 0 |
| EightCells | 83 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Morula | 5 | 141 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
| PrE | 0 | 0 | 0 | 0 | 0 | 9 | 32 | 4 | 3 |
| PrE.TE | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 15 |
| early_TE | 0 | 0 | 3 | 15 | 98 | 5 | 0 | 0 | 0 |
| late_TE | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 39 | 290 |
| medium_TE | 0 | 0 | 0 | 0 | 18 | 0 | 1 | 316 | 13 |
clusterUmap_renaming = {
'early_TE': 'Trophectoderm',
'late_TE': 'Trophectoderm',
'medium_TE':'Trophectoderm',
'EPI':'Epiblast',
'PrE':'Primitive Endoderm',
'PrE.TE':'Unknown',
'B1.EPI':'Unknown',
'EPI.PrE': 'Unknown',
'EPI.PrE.TE':'Unknown',
'EPI.early_TE':'Unknown',
'B1_B2':'Blastocyst',
'EightCells': '8C',
'Morula': 'Morula',
}
petropoulos_reannotation = petropoulos_reannotation.replace({
'clusterUmap':clusterUmap_renaming
})stirparoLineage_renaming = {
'EPI':'Epiblast',
'prE':'Primitive Endoderm',
'ICM':'Inner Cell Mass',
'TE': 'Trophectoderm',
'intermediate': 'Unknown',
'undefined': 'Unknown'
}
petropoulos_reannotation = petropoulos_reannotation.replace({
'Stirparo.lineage':stirparoLineage_renaming
})np.sum(petropoulos_reannotation['Stirparo.lineage'].isna())54
petropoulos_reannotation.loc[petropoulos_reannotation['Stirparo.lineage'].isna(),['Stirparo.lineage']] = 'Unknown'petropoulos_reannotation = petropoulos_reannotation[['EmbryoDay','Dataset','Stirparo.lineage']].copy()
petropoulos_reannotation.columns = ['day','experiment','ct']
petropoulos_reannotation['experiment'] = 'Petropoulos_2016'
petropoulos_reannotation['technology'] = 'SMARTSeq2'
petropoulos_reannotation.head()| day | experiment | ct | technology | |
|---|---|---|---|---|
| index | ||||
| ERX1120888_ERX1120888 | 3.0 | Petropoulos_2016 | Unknown | SMARTSeq2 |
| ERX1120887_ERX1120887 | 3.0 | Petropoulos_2016 | Unknown | SMARTSeq2 |
| ERX1120886_ERX1120886 | 3.0 | Petropoulos_2016 | Unknown | SMARTSeq2 |
| ERX1120885_ERX1120885 | 3.0 | Petropoulos_2016 | Unknown | SMARTSeq2 |
| ERX1120890_ERX1120890 | 3.0 | Petropoulos_2016 | Unknown | SMARTSeq2 |
petropoulos.obs = petropoulos_reannotationnormalize_smartseq(petropoulos, gene_lengths)SMART-SEQ: Normalization
SMART-SEQ: Common genes 62663
AnnData object with n_obs × n_vars = 1496 × 62663
obs: 'day', 'experiment', 'ct', 'technology'
sc.pp.filter_cells(petropoulos, min_counts=10)
sc.pp.filter_cells(petropoulos, min_genes=10)
petropoulos.layers["counts"] = petropoulos.X.copy()
sc.pp.normalize_total(petropoulos, target_sum=10_000)
sc.pp.log1p(petropoulos)
petropoulos.raw = petropoulos1.5 Xiang 2020
xiang_h5ad = sc.read_h5ad("../data/external/human/xiang_2020_reprocessed.h5ad")xiang_metadata_1 = pd.read_table("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136447/matrix/GSE136447-GPL20795_series_matrix.txt.gz",
skiprows=29, nrows=1, index_col = 0).T
xiang_metadata_2 = pd.read_table("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136447/matrix/GSE136447-GPL23227_series_matrix.txt.gz",
skiprows=29, nrows=1, index_col = 0).T
xiang_metadata = pd.concat([xiang_metadata_1, xiang_metadata_2])
xiang_metadata['Sample_name'] = xiang_metadata.index.to_list()
xiang_metadata['Sample_name'] = xiang_metadata['Sample_name'].str.extract("_(.*$)")
xiang_sample_annotation = pd.read_excel("../data/external/human/Xiang_et_al_2019/41586_2019_1875_MOESM10_ESM.xlsx", skiprows=2, index_col=0)
xiang_sample_annotation = xiang_sample_annotation.merge(xiang_metadata, left_on='Sample ID', right_on = 'Sample_name')
xiang_sample_annotation.columns = ['Day', 'Embryo ID', 'Group', 'GEO_accession', 'Sample_name']xiang_sample_annotation| Day | Embryo ID | Group | GEO_accession | Sample_name | |
|---|---|---|---|---|---|
| 0 | D6 | D6A1 | ICM | GSM4050122 | D6A1S1 |
| 1 | D6 | D6A1 | EPI | GSM4050123 | D6A1S2 |
| 2 | D6 | D6A1 | ICM | GSM4050124 | D6A1S3 |
| 3 | D6 | D6A1 | ICM | GSM4050125 | D6A1S4 |
| 4 | D6 | D6A1 | ICM | GSM4050126 | D6A1B1 |
| ... | ... | ... | ... | ... | ... |
| 550 | D14 | D14A3 | EPI | GSM4050672 | D14A3S29 |
| 551 | D14 | D14A3 | EVT | GSM4050673 | D14A3S30 |
| 552 | D14 | D14A3 | CTB | GSM4050674 | D14A3S5 |
| 553 | D14 | D14A3 | EVT | GSM4050675 | D14A3S7 |
| 554 | D14 | D14A3 | EVT | GSM4050676 | D14A3S8 |
555 rows × 5 columns
xiang = xiang_h5ad.copy()xiang.obs = xiang_h5ad.obs.loc[:,['sample','sample_alias']].reset_index().merge(xiang_sample_annotation, left_on='sample_alias', right_on='GEO_accession').set_index('index')xiangAnnData object with n_obs × n_vars = 555 × 62754
obs: 'sample', 'sample_alias', 'Day', 'Embryo ID', 'Group', 'GEO_accession', 'Sample_name'
var: 'gene_symbol'
xiang_reannotation = xiang.obsday_renaming = {
'D10':10,
'D12':12,
'D14':14,
'D6':6,
'D7':7,
'D8':8,
'D9':9,
}
group_renaming = {
'CTB':'Trophectoderm',
'EPI':'Epiblast',
'EVT':'Trophectoderm',
'ICM':'Inner Cell Mass',
'PSA-EPI':'PostImplantation-Epiblast',
'PrE':'Primitive Endoderm',
'STB':'Trophectoderm'
}
xiang_reannotation = xiang_reannotation.replace({'Day':day_renaming, 'Group': group_renaming})xiang_reannotation = xiang_reannotation[['Day', 'Group']].copy()
xiang_reannotation.columns = ['day', 'ct']
xiang_reannotation['experiment'] = 'Xiang_2020'
xiang_reannotation['technology'] = 'SMARTSeq2'
xiang_reannotation.head()| day | ct | experiment | technology | |
|---|---|---|---|---|
| index | ||||
| SRX6774526_SRX6774526 | 12 | Trophectoderm | Xiang_2020 | SMARTSeq2 |
| SRX6774449_SRX6774449 | 6 | Epiblast | Xiang_2020 | SMARTSeq2 |
| SRX6774468_SRX6774468 | 6 | Epiblast | Xiang_2020 | SMARTSeq2 |
| SRX6774508_SRX6774508 | 12 | Trophectoderm | Xiang_2020 | SMARTSeq2 |
| SRX6774478_SRX6774478 | 10 | Epiblast | Xiang_2020 | SMARTSeq2 |
xiang.obs = xiang_reannotationnormalize_smartseq(xiang, gene_lengths)SMART-SEQ: Normalization
SMART-SEQ: Common genes 62663
AnnData object with n_obs × n_vars = 555 × 62663
obs: 'day', 'ct', 'experiment', 'technology'
sc.pp.filter_cells(xiang, min_counts=10)
sc.pp.filter_cells(xiang, min_genes=10)
xiang.layers["counts"] = xiang.X.copy()
sc.pp.normalize_total(xiang, target_sum=10_000)
sc.pp.log1p(xiang)
xiang.raw = xiang1.6 Yan 2013
yan_h5ad = sc.read_h5ad("../data/external/human/yan_2013_reprocessed.h5ad")YAN_MATRIX_URL = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE36nnn/GSE36552/matrix/GSE36552_series_matrix.txt.gz"
yan_metadata = pd.read_table(YAN_MATRIX_URL, skiprows=52, index_col = 0).T
yan_annotations = metadata = pd.read_csv("../data/external/human/Meistermann_et_al_2021/sampleAnnot.tsv", index_col=0, sep="\t")
yan_annotations = yan_annotations[yan_annotations.Dataset == 'Yan2013'].copy()
yan_metadata = yan_metadata[~yan_metadata.index.str.contains('hESC')].copy()
yan_metadata['SampleNames'] = yan_metadata.index.values
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("#",".")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace(" -Cell", "")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("Late blastocyst ", "lateBlasto")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("Morulae ", "Morula")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("Oocyte ", "Oocyte")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("Zygote ", "Zygote")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("2-cell embryo", "e2C")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("4-cell embryo", "e4C")
yan_metadata['SampleNames'] = yan_metadata['SampleNames'].str.replace("8-cell embryo", "e8C")
yan_metadata = yan_metadata.loc[:,['!Sample_geo_accession','SampleNames']].copy()
yan_metadata.columns = ['Geo_accession', 'SampleNames']
yan_annotations = yan_annotations.merge(yan_metadata, left_index = True, right_on='SampleNames', how = 'right')yan = yan_h5ad.copy()yan.obs = yan_h5ad.obs.loc[:,['sample','sample_alias']].reset_index().merge(yan_annotations, left_on='sample_alias', right_on='Geo_accession').set_index('index')The Yan 2013 data is encoded by stage (Oocyte, Zygote, etc). To convert to Embryonic day, the samples were encoded as follows:
- Zygote –> E0.75; Collected 17h post-IVF
- e2C –> E1.25; Collected 27h post-IVF
- e4C –> E2.0; Collected 48h post-IVF
- e8C –> E3.0
yan_reannotation = yan.obsyan_reannotation.loc[yan_reannotation['SampleNames'].str.contains('Oocyte'),['EmbryoDay','clusterUmap', 'Stirparo.lineage']] = [0,'Oocyte','Oocyte']
yan_reannotation.loc[yan_reannotation['SampleNames'].str.contains('Zygote'),['EmbryoDay','clusterUmap', 'Stirparo.lineage']] = [0.75, 'Zygote', 'Zygote']
yan_reannotation.loc[yan_reannotation['SampleNames'].str.contains('e2C'),['EmbryoDay','clusterUmap', 'Stirparo.lineage']] = [1.25, '2C', '2C']
yan_reannotation.loc[yan_reannotation['SampleNames'].str.contains('e4C'),['EmbryoDay','clusterUmap', 'Stirparo.lineage']] = [2.0, '4C', '4C']
yan_reannotation.loc[yan_reannotation['SampleNames'].str.contains('e8C'),['EmbryoDay','clusterUmap', 'Stirparo.lineage']] = [3.0, '8C', '8C']
yan_reannotation.loc[yan_reannotation['SampleNames'].str.contains('Morula'),['clusterUmap', 'Stirparo.lineage']] = ['Morula', 'Morula']
clusterUmap_renaming = {
'early_TE': 'Trophectoderm',
'late_TE': 'Trophectoderm',
'medium_TE':'Trophectoderm',
'EPI':'Epiblast',
'PrE':'Primitive Endoderm',
'PrE.TE':'Unknown',
'B1.EPI':'Unknown',
'EPI.PrE': 'Unknown',
'EPI.PrE.TE':'Unknown',
'EPI.early_TE':'Unknown',
'B1_B2':'Blastocyst',
'EightCells': '8C',
'Morula': 'Morula',
}
yan_reannotation = yan_reannotation.replace({
'clusterUmap':clusterUmap_renaming
})
stirparoLineage_renaming = {
'EPI':'Epiblast',
'prE':'Primitive Endoderm',
'ICM':'Inner Cell Mass',
'TE': 'Trophectoderm',
'intermediate': 'Unknown',
'undefined': 'Unknown'
}
yan_reannotation = yan_reannotation.replace({
'Stirparo.lineage':stirparoLineage_renaming
})
yan_reannotation.loc[yan_reannotation['Stirparo.lineage'].isna(),['Stirparo.lineage']] = 'Unknown'yan_reannotation.head()| sample | sample_alias | Embryo | Branches | EmbryoDay | Stage | BlastoDissectionSide | Dataset | Treatment | Stirparo.lineage | Author.lineage | Pseudotime | totalCounts | totalGenesExpr | clusterUmap | Geo_accession | SampleNames | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||||||||||
| SRX144398_SRX144398 | SRX144398 | GSM922204 | lateBlasto.1 | 4.Inner cell mass | 6.0 | B | NaN | Yan2013 | NO | Trophectoderm | TE | 32.314831 | 11403434.0 | 12724.0 | Trophectoderm | GSM922204 | lateBlasto.1.11 |
| SRX144343_SRX144343 | SRX144343 | GSM922149 | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | 4C | NaN | NaN | NaN | NaN | 4C | GSM922149 | e4C.1.4 |
| SRX144359_SRX144359 | SRX144359 | GSM922165 | 8C.2 | 1.Pre-morula | 3.0 | 8C | NaN | Yan2013 | NO | 8C | NaN | 2.626271 | 18039226.0 | 16018.0 | 8C | GSM922165 | e8C.2.4 |
| SRX144408_SRX144408 | SRX144408 | GSM922214 | lateBlasto.2 | 4.Inner cell mass | 6.0 | B | NaN | Yan2013 | NO | Trophectoderm | TE | 35.919631 | 21209246.0 | 12577.0 | Unknown | GSM922214 | lateBlasto.2.9 |
| SRX144361_SRX144361 | SRX144361 | GSM922167 | 8C.2 | 1.Pre-morula | 3.0 | 8C | NaN | Yan2013 | NO | 8C | NaN | 0.329333 | 17166536.0 | 17739.0 | 8C | GSM922167 | e8C.2.6 |
yan_reannotation = yan_reannotation[['EmbryoDay', 'Stirparo.lineage']].copy()
yan_reannotation.columns = ['day','ct']
yan_reannotation['experiment'] = 'Yan_2013'
yan_reannotation['technology'] = 'SMARTSeq'
yan_reannotation.head()| day | ct | experiment | technology | |
|---|---|---|---|---|
| index | ||||
| SRX144398_SRX144398 | 6.0 | Trophectoderm | Yan_2013 | SMARTSeq |
| SRX144343_SRX144343 | 2.0 | 4C | Yan_2013 | SMARTSeq |
| SRX144359_SRX144359 | 3.0 | 8C | Yan_2013 | SMARTSeq |
| SRX144408_SRX144408 | 6.0 | Trophectoderm | Yan_2013 | SMARTSeq |
| SRX144361_SRX144361 | 3.0 | 8C | Yan_2013 | SMARTSeq |
yan.obs = yan_reannotationnormalize_smartseq(yan, gene_lengths)SMART-SEQ: Normalization
SMART-SEQ: Common genes 62663
AnnData object with n_obs × n_vars = 90 × 62663
obs: 'day', 'ct', 'experiment', 'technology'
sc.pp.filter_cells(yan, min_counts=10)
sc.pp.filter_cells(yan, min_genes=10)
yan.layers["counts"] = yan.X.copy()
sc.pp.normalize_total(yan, target_sum=10_000)
sc.pp.log1p(yan)
yan.raw = yan1.7 Yanagida 2021
yanagida_h5ad = sc.read_h5ad("../data/external/human/yanagida_2021_reprocessed.h5ad")YANAGIDA_URL = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171820/matrix/GSE171820_series_matrix.txt.gz'
yanagida_metadata = pd.read_table(YANAGIDA_URL, skiprows=30, index_col = 0).T
yanagida_metadata = yanagida_metadata[yanagida_metadata['!Sample_source_name_ch1'] != 'Blastoid'].copy()
yanagida_metadata['lineage'] = yanagida_metadata[['!Sample_characteristics_ch1']].agg(' '.join, axis=1).str.extract("lineage: (.*) polar_mural")
yanagida_metadata['day'] = yanagida_metadata[['!Sample_characteristics_ch1']].agg(' '.join, axis=1).str.extract("time point: Embryonic day ([0-9]{1})")
yanagida_metadata['side'] = yanagida_metadata[['!Sample_characteristics_ch1']].agg(' '.join, axis=1).str.extract("polar_mural: ([a-z]*)")
yanagida_metadata = yanagida_metadata[['lineage','day','side']].copy()
yanagida_metadata['Geo_accession'] = yanagida_metadata.index.valuesyanagida = yanagida_h5ad.copy()yanagida_metadata| !Sample_geo_accession | lineage | day | side | Geo_accession |
|---|---|---|---|---|
| GSM5234744 | Trophectoderm | 7 | polar | GSM5234744 |
| GSM5234745 | Trophectoderm | 7 | polar | GSM5234745 |
| GSM5234746 | Trophectoderm | 7 | polar | GSM5234746 |
| GSM5234747 | Epiblast | 6 | polar | GSM5234747 |
| GSM5234748 | Epiblast | 6 | polar | GSM5234748 |
| ... | ... | ... | ... | ... |
| GSM5235116 | Trophectoderm | 6 | polar | GSM5235116 |
| GSM5235117 | Trophectoderm | 6 | polar | GSM5235117 |
| GSM5235118 | Trophectoderm | 6 | polar | GSM5235118 |
| GSM5235119 | Trophectoderm | 6 | mural | GSM5235119 |
| GSM5235128 | Trophectoderm | 6 | mural | GSM5235128 |
228 rows × 4 columns
yanagida.obs = yanagida_h5ad.obs.loc[:,['sample','sample_alias']].reset_index().merge(yanagida_metadata, left_on='sample_alias', right_on='Geo_accession').set_index('index')yanagida_reannotation = yanagida.obs[['lineage','day']]yanagida_reannotation| lineage | day | |
|---|---|---|
| index | ||
| SRX10567995_SRX10567995 | Trophectoderm | 6 |
| SRX10567984_SRX10567984 | Trophectoderm | 6 |
| SRX10568025_SRX10568025 | Trophectoderm | 6 |
| SRX10567983_SRX10567983 | Trophectoderm | 6 |
| SRX10567987_SRX10567987 | Trophectoderm | 7 |
| ... | ... | ... |
| SRX10568348_SRX10568348 | Trophectoderm | 6 |
| SRX10568337_SRX10568337 | Trophectoderm | 6 |
| SRX10568339_SRX10568339 | Trophectoderm | 6 |
| SRX10568338_SRX10568338 | Trophectoderm | 6 |
| SRX10568336_SRX10568336 | Trophectoderm | 6 |
228 rows × 2 columns
yanagida_reannotation.lineage.unique()array(['Trophectoderm', 'Epiblast', 'Unknown', 'Early Trophectoderm',
'Inner Cell Mass', 'Inner Cell Mass-Trophectoderm Transition',
'Primitive Endoderm'], dtype=object)
lineage_renaming = {
'Early Trophectoderm': 'Trophectoderm',
'Inner Cell Mass-Trophectoderm Transition': 'Unknown',
}
yanagida_reannotation = yanagida_reannotation.replace({'lineage':lineage_renaming})yanagida_reannotation = yanagida_reannotation[['day', 'lineage']]
yanagida_reannotation.columns = ['day','ct']
yanagida_reannotation['experiment'] = 'Yanagida_2021'
yanagida_reannotation['technology'] = 'SMARTSeq2'
yanagida_reannotation.head()| day | ct | experiment | technology | |
|---|---|---|---|---|
| index | ||||
| SRX10567995_SRX10567995 | 6 | Trophectoderm | Yanagida_2021 | SMARTSeq2 |
| SRX10567984_SRX10567984 | 6 | Trophectoderm | Yanagida_2021 | SMARTSeq2 |
| SRX10568025_SRX10568025 | 6 | Trophectoderm | Yanagida_2021 | SMARTSeq2 |
| SRX10567983_SRX10567983 | 6 | Trophectoderm | Yanagida_2021 | SMARTSeq2 |
| SRX10567987_SRX10567987 | 7 | Trophectoderm | Yanagida_2021 | SMARTSeq2 |
yanagida.obs = yanagida_reannotationnormalize_smartseq(yanagida, gene_lengths)SMART-SEQ: Normalization
SMART-SEQ: Common genes 62663
AnnData object with n_obs × n_vars = 228 × 62663
obs: 'day', 'ct', 'experiment', 'technology'
sc.pp.filter_cells(yanagida, min_counts=10)
sc.pp.filter_cells(yanagida, min_genes=10)
yanagida.layers["counts"] = yanagida.X.copy()
sc.pp.normalize_total(yanagida, target_sum=10_000)
sc.pp.log1p(yanagida)
yanagida.raw = yanagida1.8 Xue 2013
xue_h5ad = sc.read_h5ad("../data/external/human/xue_2013_reprocessed.h5ad")XUE_URL = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE44nnn/GSE44183/matrix/GSE44183-GPL11154_series_matrix.txt.gz'
xue_metadata = pd.read_table(XUE_URL, skiprows=36, index_col = 0).T
xue_metadata = xue_metadata[xue_metadata['!Sample_source_name_ch1'].isin(['oocyte','pronucleus','zygote','2-cell blastomere','4-cell blastomere','8-cell blastomere', 'morula'])].copy()
xue_metadata = xue_metadata[['!Sample_geo_accession','!Sample_source_name_ch1']].copy()
reannotate_dict = {
'oocyte': 'Oocyte',
'pronucleus': 'Pronucleus',
'zygote': 'Zygote',
'2-cell blastomere': '2C',
'4-cell blastomere': '4C',
'8-cell blastomere': '8C',
'morula': 'Morula',
}
xue_metadata.replace(reannotate_dict, inplace=True)xue = xue_h5ad.copy()xue.obs = xue.obs.loc[:,['sample','sample_alias']].reset_index().merge(xue_metadata, left_on='sample_alias', right_on='!Sample_geo_accession').set_index('index')This dataset contains an additional Pronuclei stage. According to Capmany, et al. (1996), the average time for pronuclei formation is 8h post-IVF. We therefore annotate these cells as {'EmbryonicDay': '0.33', 'Lineage': 'Pronucleus'}
xue.obs| sample | sample_alias | !Sample_geo_accession | !Sample_source_name_ch1 | |
|---|---|---|---|---|
| index | ||||
| SRX300891_SRX300891 | SRX300891 | GSM1160130 | GSM1160130 | 8C |
| SRX300889_SRX300889 | SRX300889 | GSM1160128 | GSM1160128 | 8C |
| SRX300873_SRX300873 | SRX300873 | GSM1160112 | GSM1160112 | Oocyte |
| SRX300899_SRX300899 | SRX300899 | GSM1160138 | GSM1160138 | Morula |
| SRX300883_SRX300883 | SRX300883 | GSM1160122 | GSM1160122 | 2C |
| SRX300895_SRX300895 | SRX300895 | GSM1160134 | GSM1160134 | 8C |
| SRX300892_SRX300892 | SRX300892 | GSM1160131 | GSM1160131 | 8C |
| SRX300901_SRX300901 | SRX300901 | GSM1160140 | GSM1160140 | Morula |
| SRX300900_SRX300900 | SRX300900 | GSM1160139 | GSM1160139 | Morula |
| SRX300885_SRX300885 | SRX300885 | GSM1160124 | GSM1160124 | 4C |
| SRX300875_SRX300875 | SRX300875 | GSM1160114 | GSM1160114 | Oocyte |
| SRX300897_SRX300897 | SRX300897 | GSM1160136 | GSM1160136 | 8C |
| SRX300879_SRX300879 | SRX300879 | GSM1160118 | GSM1160118 | Zygote |
| SRX300896_SRX300896 | SRX300896 | GSM1160135 | GSM1160135 | 8C |
| SRX300890_SRX300890 | SRX300890 | GSM1160129 | GSM1160129 | 8C |
| SRX300894_SRX300894 | SRX300894 | GSM1160133 | GSM1160133 | 8C |
| SRX300881_SRX300881 | SRX300881 | GSM1160120 | GSM1160120 | 2C |
| SRX300874_SRX300874 | SRX300874 | GSM1160113 | GSM1160113 | Oocyte |
| SRX300887_SRX300887 | SRX300887 | GSM1160126 | GSM1160126 | 4C |
| SRX300880_SRX300880 | SRX300880 | GSM1160119 | GSM1160119 | Zygote |
| SRX300893_SRX300893 | SRX300893 | GSM1160132 | GSM1160132 | 8C |
| SRX300878_SRX300878 | SRX300878 | GSM1160117 | GSM1160117 | Pronucleus |
| SRX300888_SRX300888 | SRX300888 | GSM1160127 | GSM1160127 | 8C |
| SRX300884_SRX300884 | SRX300884 | GSM1160123 | GSM1160123 | 4C |
| SRX300876_SRX300876 | SRX300876 | GSM1160115 | GSM1160115 | Pronucleus |
| SRX300882_SRX300882 | SRX300882 | GSM1160121 | GSM1160121 | 2C |
| SRX300886_SRX300886 | SRX300886 | GSM1160125 | GSM1160125 | 4C |
| SRX300877_SRX300877 | SRX300877 | GSM1160116 | GSM1160116 | Pronucleus |
xue_reannotation = xue.obs[['!Sample_source_name_ch1', 'sample_alias']].copy()
xue_reannotation.columns = ['Lineage', 'alias']xue_reannotation| Lineage | alias | |
|---|---|---|
| index | ||
| SRX300891_SRX300891 | 8C | GSM1160130 |
| SRX300889_SRX300889 | 8C | GSM1160128 |
| SRX300873_SRX300873 | Oocyte | GSM1160112 |
| SRX300899_SRX300899 | Morula | GSM1160138 |
| SRX300883_SRX300883 | 2C | GSM1160122 |
| SRX300895_SRX300895 | 8C | GSM1160134 |
| SRX300892_SRX300892 | 8C | GSM1160131 |
| SRX300901_SRX300901 | Morula | GSM1160140 |
| SRX300900_SRX300900 | Morula | GSM1160139 |
| SRX300885_SRX300885 | 4C | GSM1160124 |
| SRX300875_SRX300875 | Oocyte | GSM1160114 |
| SRX300897_SRX300897 | 8C | GSM1160136 |
| SRX300879_SRX300879 | Zygote | GSM1160118 |
| SRX300896_SRX300896 | 8C | GSM1160135 |
| SRX300890_SRX300890 | 8C | GSM1160129 |
| SRX300894_SRX300894 | 8C | GSM1160133 |
| SRX300881_SRX300881 | 2C | GSM1160120 |
| SRX300874_SRX300874 | Oocyte | GSM1160113 |
| SRX300887_SRX300887 | 4C | GSM1160126 |
| SRX300880_SRX300880 | Zygote | GSM1160119 |
| SRX300893_SRX300893 | 8C | GSM1160132 |
| SRX300878_SRX300878 | Pronucleus | GSM1160117 |
| SRX300888_SRX300888 | 8C | GSM1160127 |
| SRX300884_SRX300884 | 4C | GSM1160123 |
| SRX300876_SRX300876 | Pronucleus | GSM1160115 |
| SRX300882_SRX300882 | 2C | GSM1160121 |
| SRX300886_SRX300886 | 4C | GSM1160125 |
| SRX300877_SRX300877 | Pronucleus | GSM1160116 |
embryonictime_annotation = {
'Oocyte': 0,
'Pronucleus': 0.33,
'Zygote': 0.75,
'2C': 1.25,
'4C': 2,
'8C':3,
'Morula':4,
}xue_reannotation['EmbryonicDay'] = xue_reannotation['Lineage'].map(embryonictime_annotation)xue_reannotation = xue_reannotation[['EmbryonicDay', 'Lineage']]
xue_reannotation.columns = ['day','ct']
xue_reannotation['experiment'] = 'Xue_2013'
xue_reannotation['technology'] = 'Tang2009'
xue_reannotation.head()| day | ct | experiment | technology | |
|---|---|---|---|---|
| index | ||||
| SRX300891_SRX300891 | 3.00 | 8C | Xue_2013 | Tang2009 |
| SRX300889_SRX300889 | 3.00 | 8C | Xue_2013 | Tang2009 |
| SRX300873_SRX300873 | 0.00 | Oocyte | Xue_2013 | Tang2009 |
| SRX300899_SRX300899 | 4.00 | Morula | Xue_2013 | Tang2009 |
| SRX300883_SRX300883 | 1.25 | 2C | Xue_2013 | Tang2009 |
xue.obs = xue_reannotationnormalize_smartseq(xue, gene_lengths)SMART-SEQ: Normalization
SMART-SEQ: Common genes 62663
AnnData object with n_obs × n_vars = 28 × 62663
obs: 'day', 'ct', 'experiment', 'technology'
sc.pp.filter_cells(xue, min_counts=10)
sc.pp.filter_cells(xue, min_genes=10)
xue.layers["counts"] = xue.X.copy()
sc.pp.normalize_total(xue, target_sum=10_000)
sc.pp.log1p(xue)
xue.raw = xue2 Merge Datasets
list_of_datasets = [
meistermann,
petropoulos,
xiang,
yan,
yanagida,
xue,
]human_adata = ad.concat(list_of_datasets)human_adata.obs.day = pd.to_numeric(human_adata.obs.day)human_adataAnnData object with n_obs × n_vars = 2547 × 62754
obs: 'day', 'ct', 'experiment', 'technology', 'n_counts', 'n_genes'
layers: 'counts'
2.1 Reannotation
2.1.0.1 Concatenated cell type and embryonic day
human_adata.obs['ct_fine'] = human_adata.obs.ct.astype(str) + '_' + human_adata.obs.day.astype(str)human_adata.obs.loc[human_adata.obs.ct == 'Unknown','ct_fine'] = 'Unknown'human_adata.obs| day | ct | experiment | technology | n_counts | n_genes | ct_fine | |
|---|---|---|---|---|---|---|---|
| index | |||||||
| ERX3015937_ERX3015937 | 5.00 | Unknown | Meistermann_2021 | SMARTSeq2 | 708313.0 | 5761 | Unknown |
| ERX3015939_ERX3015939 | 5.00 | Unknown | Meistermann_2021 | SMARTSeq2 | 402557.0 | 5689 | Unknown |
| ERX3015940_ERX3015940 | 5.00 | Unknown | Meistermann_2021 | SMARTSeq2 | 511338.0 | 6039 | Unknown |
| ERX3015941_ERX3015941 | 5.00 | Unknown | Meistermann_2021 | SMARTSeq2 | 994383.0 | 8383 | Unknown |
| ERX3015936_ERX3015936 | 5.00 | Unknown | Meistermann_2021 | SMARTSeq2 | 1389486.0 | 7762 | Unknown |
| ... | ... | ... | ... | ... | ... | ... | ... |
| SRX300884_SRX300884 | 2.00 | 4C | Xue_2013 | Tang2009 | 13308292.0 | 14096 | 4C_2.0 |
| SRX300876_SRX300876 | 0.33 | Pronucleus | Xue_2013 | Tang2009 | 16438437.0 | 16542 | Pronucleus_0.33 |
| SRX300882_SRX300882 | 1.25 | 2C | Xue_2013 | Tang2009 | 11549318.0 | 12071 | 2C_1.25 |
| SRX300886_SRX300886 | 2.00 | 4C | Xue_2013 | Tang2009 | 10497600.0 | 7149 | 4C_2.0 |
| SRX300877_SRX300877 | 0.33 | Pronucleus | Xue_2013 | Tang2009 | 13025184.0 | 17779 | Pronucleus_0.33 |
2547 rows × 7 columns
2.1.0.2 Remove Day 12 and Day 14 datasets
human_adata = human_adata[human_adata.obs.day < 12].copy()2.1.0.3 Set 4C and earlier stages as ‘Prelineage’
human_adata.obs.loc[human_adata.obs.day <= 2, 'ct_fine'] = 'Prelineage'human_adata.obs.ct_fine.value_counts()Unknown 609
Trophectoderm_7.0 462
Trophectoderm_6.0 403
Trophectoderm_5.0 246
Inner Cell Mass_5.0 87
Epiblast_6.0 76
Trophectoderm_10.0 60
Trophectoderm_9.0 53
Epiblast_7.0 46
Trophectoderm_8.0 46
Prelineage 39
Inner Cell Mass_6.0 33
Primitive Endoderm_7.0 32
8C_3.0 30
Morula_4.0 19
Primitive Endoderm_6.0 19
Inner Cell Mass_7.0 18
Epiblast_10.0 14
Epiblast_8.0 11
Epiblast_9.0 10
Primitive Endoderm_10.0 3
Primitive Endoderm_9.0 3
Primitive Endoderm_8.0 2
Inner Cell Mass_9.0 2
Name: ct_fine, dtype: int64
2.1.0.4 Combine Epiblast E8, E9 and E10 into Late Epiblast
human_adata.obs.loc[(human_adata.obs.day >= 8) & (human_adata.obs.ct == 'Epiblast'),'ct_fine'] = 'Late epiblast'2.1.0.5 Combine PrE from all days into PrE
human_adata.obs.loc[human_adata.obs.ct == 'Primitive Endoderm','ct_fine'] = 'Primitive Endoderm'2.1.0.6 Combine all ICM into one category
human_adata.obs.loc[(human_adata.obs.ct == 'Inner Cell Mass'),'ct_fine'] = 'Inner Cell Mass'2.2 Write out human data
human_adata.obs.ct_fine.value_counts()Unknown 609
Trophectoderm_7.0 462
Trophectoderm_6.0 403
Trophectoderm_5.0 246
Inner Cell Mass 140
Epiblast_6.0 76
Trophectoderm_10.0 60
Primitive Endoderm 59
Trophectoderm_9.0 53
Epiblast_7.0 46
Trophectoderm_8.0 46
Prelineage 39
Late epiblast 35
8C_3.0 30
Morula_4.0 19
Name: ct_fine, dtype: int64
human_adata.write_h5ad('../data/processed/32_human_adata.h5ad')