Skip to content

3 Submitting Data in Standard Format

3.1 Standard Metadata for Dataset Contents and Attributes

The SDTMIG provides standard descriptions of some of the most commonly used data domains, with metadata attributes. The descriptive metadata attributes that should be included in a define.xml as applied in the domain models are:

  • The SDTMIG -standard variable name (standardized for all submissions, even though sponsors may be using other variable names internally in their operational database)
  • The SDTMIG -standard variable label
  • Expected data types (the SDTMIG uses character or numeric to conform to the data types consistent with SAS V5 transport file format, but define.xml allows for more descriptive data types, such as integer or float)
  • The actual controlled terms and formats used by the sponsor (do not include the asterisk (*) included in the CDISC domain models to indicate when controlled terminology applies)
  • The origin or source of the data (e.g., CRF, derived; see definitions in Section 4: 4.1.1.8, Origin Metadata )
  • The role of the variable in the dataset corresponding to the role in the SDTM if desired. Since these roles are predefined for all standard domains that follow the general observation classes, they do not need to be specified by sponsors in their define.xml for these domains.
  • Any Comments provided by the sponsor that may be useful to the Reviewer in understanding the variable or the data in it.

In addition to these metadata attributes, the CDISC domain models include three other shaded columns that are not sent to the FDA. These columns assist sponsors in preparing their datasets:

  • "CDISC Notes" is for notes to the sponsor regarding the relevant to the use of each variable
  • "Core" indicates how a variable is classified as a CDISC Core Variable [see Section 4: 4.1.1.5, CDISC Core Variables]
  • "References" provides references to relevant section of the SDTM or the SDTMIG.), and one to provide references to relevant section of the SDTM or the SDTMIG.

The domain models in Section 6.1 – Interventions , Section 6.2 – Events , Section 6.3 – Findings , and Section 6.4 ­FA Domain illustrate how to apply the SDTM when creating a specific domain dataset. In particular, these models illustrate the selection of a subset of the variables offered in one of the general observation classes along with applicable timing variables. The models also show how a standard variable from a general observation class should be adjusted to meet the specific content needs of a particular domain, including making the label more meaningful, specifying controlled terminology, and creating domain-specific notes and examples. Thus the domain models demonstrate not only how to apply the model for the most common domains, but also give insight on how to apply general model concepts to other domains not yet defined by CDISC.

3.2 Using the CDISC Domain Models in Regulatory Submissions — Dataset Metadata

The define.xml that accompanies a submission should also describe each dataset that is included in the submission and describe the natural key structure of each dataset. While most studies will include DM and a set of safety domains based on the three general observation classes (typically including EX, CM, AE, DS, MH, IE, LB, and VS), the actual choice of which data to submit will depend on the protocol and the needs of the regulatory reviewer. Dataset definition metadata should include dataset filenames, descriptions, locations, structures, class, purpose, keys, and comments as described below in Table 3.2.1.

In the event that no records are present in a dataset (e.g., a small PK study where no subjects took concomitant medications), the empty dataset should not be submitted and should not be described in the define.xml document. The annotated CRF will show the data that would have been submitted had data been received; it need not be re-annotated to indicate that no records exist.

3.2.1 Table 3.2.1 SDTM Submission Dataset-Definition Metadata Example

Dataset Description Class Structure Purpose Keys Location
DM Demographics Special Purpose Domains One record per subject Tabulation STUDYID, USUBJID dm.xpt
CO Comments Special Purpose Domains One record per comment per subject Tabulation STUDYID, USUBJID, COSEQ co.xpt
SE Subject Elements Special Purpose Domains One record per actual Element per subject Tabulation STUDYID, USUBJID, ETCD, SESTDTC se.xpt
SV Subject Visits Special Purpose Domains One record per actual visit per subject Tabulation STUDYID, USUBJID, VISITNUM sv.xpt
CM Concomitant Medications Interventions One record per recorded medication occurrence or constant-dosing interval per subject. Tabulation STUDYID, USUBJID, CMTRT, CMSTDTC cm.xpt
EX Exposure Interventions One record per constant dosing interval per subject Tabulation STUDYID, USUBJID, EXTRT, EXSTDTC ex.xpt
SU Substance Use Interventions One record per substance type per reported occurrence per subject Tabulation STUDYID, USUBJID, SUTRT, SUSTDTC su.xpt
AE Adverse Events Events One record per adverse event per subject Tabulation STUDYID, USUBJID, AEDECOD, AESTDTC ae.xpt
DS Disposition Events One record per disposition status or protocol milestone per subject Tabulation STUDYID, USUBJID, DSDECOD, DSSTDTC ds.xpt
MH Medical History Events One record per medical history event per subject Tabulation STUDYID, USUBJID, MHDECOD mh.xpt
DV Protocol Deviations Events One record per protocol deviation per subject Tabulation STUDYID, USUBJID, DVTERM, DVSTDTC dv.xpt
Dataset Description Class Structure Purpose Keys Location
CE Clinical Events Events One record per event per subject Tabulation STUDYID, USUBJID, CETERM, CESTDTC ce.xpt
EG ECG Test Results Findings One record per ECG observation per time point per visit per subject Tabulation STUDYID, USUBJID, EGTESTCD, VISITNUM, EGTPTREF, EGTPTNUM eg.xpt
IE Inclusion/ Exclusion Criteria Not Met Findings One record per inclusion/exclusion criterion not met per subject Tabulation STUDYID, USUBJID, IETESTCD ie.xpt
LB Laboratory Test Results Findings One record per analyte per planned time point number per time point reference per visit per subject Tabulation STUDYID, USUBJID, LBTESTCD, LBSPEC, VISITNUM, LBTPTREF, LBTPTNUM lb.xpt
PE Physical Examination Findings One record per body system or abnormality per visit per subject Tabulation STUDYID, USUBJID, PETESTCD, VISITNUM pe.xpt
QS Questionnaire Findings One record per questionnaire per question per time point per visit per subject Tabulation STUDYID, USUBJID, QSCAT, QSTESTCD, VISITNUM, QSTPTREF, QSTPTNUM qs.xpt
SC Subject Characteristics Findings One record per characteristic per subject Tabulation STUDYID, USUBJID, SCTESTCD sc.xpt
VS Vital Signs Findings One record per vital sign measurement per time point per visit per subject Tabulation STUDYID, USUBJID, VSTESTCD, VISITNUM, VSTPTREF, VSTPTNUM vs.xpt
DA Drug Accountability Findings One record per drug accountability finding per subject Tabulation STUDYID, USUBJID, DATESTCD, DADTC da.xpt
MB Microbiology Specimen Findings One record per microbiology specimen finding per time point per visit per subject Tabulation STUDYID, USUBJID, MBTESTCD, VISITNUM, MBTPTREF, MBTPTNUM mb.xpt
MS Microbiology Susceptibility Findings One record per microbiology susceptibility test (or other organism-related finding) per organism found in MB Tabulation STUDYID, USUBJID, MSTESTCD, VISITNUM, MSTPTREF, MSTPTNUM ms.xpt
Dataset Description Class Structure Purpose Keys Location
PC Pharmacokinetic Concentrations Findings One record per analyte per planned time point number per time point reference per visit per subject" Tabulation STUDYID, USUBJID, PCTESTCD, VISITNUM, PCTPTREF, PCTPTNUM pc.xpt
PP Pharmacokinetic Parameters Findings One record per PK parameter per time-concentration profile per modeling method per subject Tabulation STUDYID, USUBJID, PPTESTCD, PPCAT, VISITNUM, PPTPTREF pp.xpt
FA Findings About Events or Interventions Findings One record per finding per object per time point per time point reference per visit per subject Tabulation STUDYID, USUBJID, FATESTCD, FAOBJ, VISITNUM, FATPTREF, FATPTNUM fa.xpt
TA Trial Arms Trial Design One record per planned Element per Arm Tabulation STUDYID, ARMCD, TAETORD ta.xpt
TE Trial Elements Trial Design One record per planned Element Tabulation STUDYID, ETCD te.xpt
TV Trial Visits Trial Design One record per planned Visit per Arm Tabulation STUDYID, VISITNUM, ARMCD tv.xpt
TI Trial Inclusion/ Exclusion Criteria Trial Design One record per I/E criterion Tabulation STUDYID, IETESTCD ti.xpt
TS Trial Summary Trial Design One record per trial summary parameter value Tabulation STUDYID, TSPARMCD, TSSEQ ts.xpt
RELREC Related Records Special Purpose Datasets One record per related record, group of records or datasets Tabulation STUDYID, RDOMAIN, USUBJID, IDVAR, IDVARVAL, RELID relrec.xpt
SUPP-­** Supplemental Qualifiers for [domain name] Special-Purpose Datasets One record per IDVAR, IDVARVAL, and QNAM value per subject Tabulation STUDYID, RDOMAIN, USUBJID, IDVAR, IDVARVAL, QNAM supp–.xpt or suppqual.xpt

* Note that the key variables shown in this table are examples only. A sponsor's actual key structure may be different.

** Separate Supplemental Qualifier datasets of the form supp–.xpt are recommended. See Section 8: 8.4, Relating Non-Standard Variables Values To A Parent Domain.

3.2.1.1 Primary Keys

Table 3.2.1, SDTM Submission Dataset-Definition Metadata Example above shows examples of what a sponsor might submit as variables that comprise the primary key for SDTM datasets. Since the purpose of this column is to aid reviewers in understanding the structure of a dataset, sponsors should list all of the natural keys (see definition below) for the dataset. These keys should define uniqueness for records within a dataset, and may define a record sort order. The naming of these keys should be consistent with the description of the structure in the Structure column. For all the general-observation-class domains (and for some special-purpose domains), the –SEQ variable was created so that a unique record could be identified consistently across all of these domains via its use, along with STUDYID, USUBJID, DOMAIN. In most domains, –SEQ will be a surrogate key (see definition below) for a set of variables which comprise the natural key. In certain instances, a Supplemental Qualifier (SUPP–) variable might also contribute to the natural key of a record for a particular domain. See Section 4: 4.1.1.9,Assigning Natural Keys In The Metadata for how this should be represented, and for additional information on keys.

A natural key is a piece of data (one or more columns of an entity) that uniquely identify that entity, and distinguish it from any other row in the table. The advantage of natural keys is that they exist already, and one does not need to introduce a new "unnatural" value to the data schema. One of the difficulties in choosing a natural key is that just about any natural key one can think of has the potential to change. Because they have business meaning, natural keys are effectively coupled to the business, and they may need to be reworked when business 312Hrequirements change.An example of such a change in clinical trials data would be the addition of a position or location that becomes a key in a new study, but wasn't collected in previous studies.

A surrogate key is a single-part, artificially established identifier for a record. Surrogate key assignment is a special case of derived data, one where a portion of the primary key is derived. A surrogate key is immune to changes in business needs. In addition, the key depends on only one field, so it's compact. A common way of deriving surrogate key values is to assign integer values sequentially. The –SEQ variable in the SDTM datasets is an example of a surrogate key for most datasets; in some instances, however, –SEQ might be a part of a natural key as a replacement for what might have been a key (e.g. a repeat sequence number) in the sponsor's database

3.2.1.2 CDISC Submission Value-Level Metadata

In general, the CDISC V3.x Findings data models are closely related to normalized, relational data models in a vertical structure of one record per observation. Since the V3.x data structures are fixed, sometimes information that might have appeared as columns in a more horizontal (denormalized) structure in presentations and reports will instead be represented as rows in an SDTM Findings structure. Because many different types of observations are all presented in the same structure, there is a need to provide additional metadata to describe the expected differences that differentiate, for example, hematology lab results from serum chemistry lab results in terms of data type, standard units and other attributes.

For example, the Vital Signs data domain could contain subject records related to diastolic and systolic blood pressure, height, weight, and body mass index (BMI). These data are all submitted in the normalized SDTM Findings structure of one row per vital signs measurement. This means that there could be five records per subject (one for each test or measurement) for a single visit or time point, with the parameter names stored in the Test Code/Name variables, and the parameter values stored in result variables. Since the unique Test Code/Names could have different attributes (i.e., different origins, roles, and definitions) there would be a need to provide value-level metadata for this information.

The value-level metadata should be provided as a separate section of the Case Report Tabulation Data Definition Specification (Define-XML). This information, which historically has been submitted as a pdf document named "define.pdf", should henceforth be submitted in an XML format. For details on the CDISC specification for submitting define.xml, see http://www.cdisc.org/define-xml

3.2.2 Conformance

Conformance with the SDTMIG Domain Models is minimally indicated by:

  • Following the complete metadata structure for data domains
  • Following SDTMIG domain models wherever applicable
  • Using SDTM-specified standard domain names and prefixes where applicable
  • Using SDTM-specified standard variable names
  • Using SDTM-specified variable labels for all standard domains
  • Using SDTM-specified data types for all variables
  • Following SDTM-specified controlled terminology and format guidelines for variables, when provided
  • Including all collected and relevant derived data in one of the standard domains, special-purpose datasets, or general-observation-class structures
  • Including all Required and Expected variables as columns in standard domains, and ensuring that all Required variables are populated
  • Ensuring that each record in a dataset includes the appropriate Identifier and, Timing variables, as well as a Topic variable
  • Conforming to all business rules described in the CDISC Notes column and general and domain-specific assumptions.