Sharing data from the Human Tumor Atlas Network through standards, infrastructure and community engagement

Sharing data from the Human Tumor Atlas Network through standards, infrastructure and community engagement

The HTAN data-submission process

The DCC has developed a standardized data-submission process (Fig. 3a). The process begins when a data curator or scientist at an HTAN Center uploads their data to cloud buckets connected to Synapse. Once the data are uploaded, the submitter needs to provide metadata for each file, including information about its processing and the research participant and biospecimen that it applies to. These metadata are critical for data access and reuse. Metadata are submitted through the Data Curator App (DCA) (Fig. 3b), which creates a metadata template on the basis of the data model, validates the provided metadata against the data model, and uploads it to Synapse. Centers also have the option of submitting a filled metadata template describing individual publications and all data associated with a publication.

Fig. 3: HTAN Data submission and release process.
figure 3

a, An HTAN data curator or scientist uploads data to AWS, Google Cloud or Synapse, provides metadata about each file, and confirms metadata validation. The DCC performs additional quality-control checks and releases data to the public. b, The DCA performs metadata validation. c, The HTAN Dashboard performs additional quality-control data checks and checks for overall data completeness. d, The DCC releases the data to the public.

After submitting metadata, a second set of validation checks is automatically performed. These checks examine the HTAN Center’s dataset as a whole, verify that all assay data can be linked to parent biospecimens and research participants, and assess data for overall completeness. The results of these checks are made available through the HTAN Dashboard (Fig. 3c).

After a new data submission, HTAN DCC members review the HTAN Dashboard and relay validation issues to the data submitters at the respective HTAN Center. This feedback cycle continues until all validation errors are resolved. Once both the DCC and the center sign off, all files intended for release are queued. An HTAN Portal preview instance is generated with all data for the next release. After a final manual check, all release data are deployed to the public HTAN Portal. Higher-level processed data are made publicly available on Synapse. Lower-level access-controlled data are submitted to the CRDC6,7, where they are made available in subsequent CRDC releases. Data are also submitted in a parallel process to other platforms, including CellxGene14, cBioPortal11,12,13 and ISB-CGC8, each with its own release cycle. A future goal is to automate this broader dissemination process.

Setting deadlines for major data releases helps to incentivize centers to submit data in a timely manner. Major releases are completed twice per year, with minor releases on an as-needed basis. A complete log of data releases is maintained on the HTAN Portal. Although HTAN aims to release data upon generation, in practice, we have found that most centers submit data closer to manuscript submission as incentivized by publishers’ data access requirements and the desire to ensure high data quality before release.

Synapse

Sage Bionetworks uses its data-management platform, Synapse (RRID SCR_006307), as the central repository for the HTAN DCC. Each HTAN Center has a dedicated Synapse project, providing a secure environment for uploading, organizing and annotating data and metadata before public release. Synapse streamlines this process through multiple features, including wikis, entity annotations, tabular annotation views for file exploration, and finely tuned access control settings, creating a user- and machine-friendly data-management ecosystem.

Project access on Synapse is regulated through team membership, with adjustable permission levels to ensure appropriate access for both data contributors and DCC staff. Moreover, HTAN’s Synapse projects integrate with external storage solutions, such as AWS S3 and Google Cloud Storage, allowing Centers to choose their preferred storage provider, which can minimize egress costs. This is particularly advantageous for contributors who already have data stored with these providers. The platform supports the synchronization of directly added storage objects into Synapse using serverless architectures, such as AWS Lambda and Google Cloud Functions. This integration facilitates efficient data uploads through cloud provider clients while preserving the user-friendly experience of Synapse’s web UI, command line interface and language-specific clients in Python and R. For HTAN, the only requirement around folder structure for each center is that all submissions are grouped into top-level folders categorized by data type, such as scRNA-seq FASTQ files, imaging OME-TIFFs or demographic information. File naming is minimally restrictive because essential information is captured in the metadata rather than the file names themselves.

Data curator app

The DCA (Fig. 3b), hosted on AWS Fargate, enables data submitters to associate metadata with their assay data files through a wizard-style interface in the browser. The application backend leverages a Python tool, Schematic, to validate the metadata files against the HTAN data standards and submit data to Synapse. Both DCA and Schematic were developed to support multiple data-coordination projects at Sage Bionetworks. The separation of UI (DCA) and programmatic schema validation logic (Schematic) simplifies the reuse of these tools across different projects.

In the metadata submission wizard, data contributors select a template (for example, metadata for clinical demographics or level 1 scRNA-seq). A Google Sheets link is generated, allowing users to directly fill out the metadata template online using Google Sheets’ functionalities. The Google Sheets template includes checks for the correctness of particular columns. If preferred, the sheet can also be exported as a delimited text file or Excel spreadsheet. Should a specific template be unavailable, a minimal metadata template is used, with the provision to contact a DCC liaison for further guidance. After completing the template, users submit it, and the DCA then uses Schematic to do an additional check for schema correctness and submits it to Synapse. DCA also allows for existing metadata to be updated, accommodating corrections, compliance adjustments or additions for new files.

HTAN Dashboard

The HTAN Dashboard (Fig. 3c) is a web application developed to help data submitters across the HTAN Centers and the DCC to track submitted data and associated metadata. For each HTAN Center, the dashboard performs various checks, including tracing and validating all links from files to samples to research participants and ensuring that HTAN ID numbers adhere to specified guidelines. It also calculates a metadata completeness score to assess how complete the provided metadata are in terms of supplied values compared with empty fields. The dashboard provides summary statistics, including file counts and sizes per atlas and the number of remaining data submission errors. The HTAN Dashboard is written in Python and leverages the Synapse client to programmatically retrieve each center’s metadata and file counts.

Image visualization on the HTAN Portal

HTAN Centers generate imaging data using a broad array of multiplex imaging assays. To enable initial visualization and exploration of these data directly on the HTAN Portal, we deployed narrative guides using Minerva, a lightweight tool suite for interactive viewing and fast sharing of large image data9. Although extensively curated and interactive guides with manual channel thresholds, waypoints and regions of interest can be generated, we implemented an automatic channel thresholding and grouping approach to generate good first defaults, enabling the rapid generation of prerendered Minerva stories, which can be enhanced with interactive channel selection and embedded metadata. To facilitate recognition and recall of images and tissue features from multiplexed tissue images, we developed Miniature, a new approach for creating informative and visually appealing thumbnails from multiplexed tissue images.

HTAN data in CZ CellxGene discover

Single-cell sequencing data are submitted to CZ CellxGene Discover. The platform enables users to find, explore, visualize and analyze published datasets. To ensure integration with other single-cell datasets, HTAN data are harmonized to adhere to the CellxGene schema and data format requirements. The HTAN data-ingestion workflow captures much of the same information, including raw counts, normalized counts, demographics (for example age, sex and ethnicity), assay type, tissue site, disease type and embeddings (for example uniform manifold approximation and projection and t-distributed stochastic neighbor embedding). A key additional requirement is the annotation of cell types using terms from the Cell Ontology initiative (CL, which currently is performed by manual mapping of annotations (cell phenotypes) provided by data contributors to the closest CL terms. For example, there was no term for lymphomyeloid primed progenitor-like blasts22; instead, hematopoietic multipotent progenitor cell (CL_0000837) was selected. Precancer- and cancer-cell mapping posed a challenge because CL is largely based on classifications of normal cells. Cancer cells are annotated according to the presumed healthy cell type from which they originated. For cases in which no appropriate cell type terms are available, the most relevant parent ontology is used to describe the cell type. The CL version is 2024-04-05, based on CellXGene’s v1.0.5 schema requirements. We curated 17 HTAN datasets for CellxGene. In general, we found that data submitters are willing to do this additional work to facilitate the reuse of their data. We plan to provide cell-type annotations for all HTAN single-cell data submissions in the future, manually or through automated pipelines, and reannotate them as CL’s coverage and quality improve.

Integration with the Cancer Research Data Commons

HTAN data ingress and standardization processes are integrated with the Cancer Research Data Commons (CRDC) ecosystem, with multiple services supporting HTAN data download, queries and processing. Specifically, CDS provides access to HTAN controlled-access sequence and imaging files; SB-CGC provides mechanisms to run a variety of processing workflows on HTAN data at CDS; and ISB-CGC contains HTAN tabular metadata and assay data for flexible queries.

HTAN imaging data are available through CDS in original contributed formats, including OME-TIFF and SVS files. Preserving contributor-provided formats facilitates both reproducibility of published studies and interoperability with common processing and visualization tools, including processing suites, such as MCMICRO23, and analysis tools, such as Napari24 and QuPath25. A subset of HTAN imaging data has been deposited in the NCI’s Imaging Data Commons26, where data have been converted to DICOM27 to provide interoperability with other medical imaging datasets and tools.

The NCI’s cloud resources allow processing of HTAN data on the cloud. For example, SB-CGC28 facilitates selection and processing of HTAN scRNA-seq read-level files, image data files and read-level spatial transcriptomic data. Within ISB-CGC8, HTAN data are available as Google BigQuery tables, allowing flexible SQL query access. More than 850 assay files are queryable through Google BigQuery, encapsulating data from imaging level 4 and scRNA-seq level 4 assays, collectively spanning more than 200 million cells across spatial and single-cell datasets. Computational notebooks are provided to illustrate cloud-based querying and processing of HTAN data.

link