3 Submitting Data in Standard Format¶

3.1 Standard Metadata for Dataset Contents and Attributes¶

The SDTMIG provides standard descriptions of some of the most commonly used data domains, with metadata attributes. The descriptive metadata attributes that should be included in a define.xml as applied in the domain models are:

The SDTMIG -standard variable name (standardized for all submissions, even though sponsors may be using other variable names internally in their operational database)
The SDTMIG -standard variable label
Expected data types (the SDTMIG uses character or numeric to conform to the data types consistent with SAS V5 transport file format, but define.xml allows for more descriptive data types, such as integer or float)
The actual controlled terms and formats used by the sponsor (do not include the asterisk (*) included in the CDISC domain models to indicate when controlled terminology applies)
The origin or source of the data (e.g., CRF, derived; see definitions in Section 4: 4.1.1.8, Origin Metadata )
The role of the variable in the dataset corresponding to the role in the SDTM if desired. Since these roles are predefined for all standard domains that follow the general observation classes, they do not need to be specified by sponsors in their define.xml for these domains.
Any Comments provided by the sponsor that may be useful to the Reviewer in understanding the variable or the data in it.

In addition to these metadata attributes, the CDISC domain models include three other shaded columns that are not sent to the FDA. These columns assist sponsors in preparing their datasets:

"CDISC Notes" is for notes to the sponsor regarding the relevant to the use of each variable
"Core" indicates how a variable is classified as a CDISC Core Variable [see Section 4: 4.1.1.5, CDISC Core Variables]
"References" provides references to relevant section of the SDTM or the SDTMIG.), and one to provide references to relevant section of the SDTM or the SDTMIG.

The domain models in Section 6.1 – Interventions , Section 6.2 – Events , Section 6.3 – Findings , and Section 6.4 FA Domain illustrate how to apply the SDTM when creating a specific domain dataset. In particular, these models illustrate the selection of a subset of the variables offered in one of the general observation classes along with applicable timing variables. The models also show how a standard variable from a general observation class should be adjusted to meet the specific content needs of a particular domain, including making the label more meaningful, specifying controlled terminology, and creating domain-specific notes and examples. Thus the domain models demonstrate not only how to apply the model for the most common domains, but also give insight on how to apply general model concepts to other domains not yet defined by CDISC.

3.2 Using the CDISC Domain Models in Regulatory Submissions — Dataset Metadata¶

The define.xml that accompanies a submission should also describe each dataset that is included in the submission and describe the natural key structure of each dataset. While most studies will include DM and a set of safety domains based on the three general observation classes (typically including EX, CM, AE, DS, MH, IE, LB, and VS), the actual choice of which data to submit will depend on the protocol and the needs of the regulatory reviewer. Dataset definition metadata should include dataset filenames, descriptions, locations, structures, class, purpose, keys, and comments as described below in Table 3.2.1.

In the event that no records are present in a dataset (e.g., a small PK study where no subjects took concomitant medications), the empty dataset should not be submitted and should not be described in the define.xml document. The annotated CRF will show the data that would have been submitted had data been received; it need not be re-annotated to indicate that no records exist.

3.2.1 Table 3.2.1 SDTM Submission Dataset-Definition Metadata Example

Dataset	Description	Class	Structure	Purpose	Keys	Location
DM	Demographics	Special Purpose Domains	One record per subject	Tabulation	STUDYID, USUBJID	dm.xpt
CO	Comments	Special Purpose Domains	One record per comment per subject	Tabulation	STUDYID, USUBJID, COSEQ	co.xpt
SE	Subject Elements	Special Purpose Domains	One record per actual Element per subject	Tabulation	STUDYID, USUBJID, ETCD, SESTDTC	se.xpt
SV	Subject Visits	Special Purpose Domains	One record per actual visit per subject	Tabulation	STUDYID, USUBJID, VISITNUM	sv.xpt
CM	Concomitant Medications	Interventions	One record per recorded medication occurrence or constant-dosing interval per subject.	Tabulation	STUDYID, USUBJID, CMTRT, CMSTDTC	cm.xpt
EX	Exposure	Interventions	One record per constant dosing interval per subject	Tabulation	STUDYID, USUBJID, EXTRT, EXSTDTC	ex.xpt
SU	Substance Use	Interventions	One record per substance type per reported occurrence per subject	Tabulation	STUDYID, USUBJID, SUTRT, SUSTDTC	su.xpt
AE	Adverse Events	Events	One record per adverse event per subject	Tabulation	STUDYID, USUBJID, AEDECOD, AESTDTC	ae.xpt
DS	Disposition	Events	One record per disposition status or protocol milestone per subject	Tabulation	STUDYID, USUBJID, DSDECOD, DSSTDTC	ds.xpt
MH	Medical History	Events	One record per medical history event per subject	Tabulation	STUDYID, USUBJID, MHDECOD	mh.xpt
DV	Protocol Deviations	Events	One record per protocol deviation per subject	Tabulation	STUDYID, USUBJID, DVTERM, DVSTDTC	dv.xpt

Dataset	Description	Class	Structure	Purpose	Keys	Location
CE	Clinical Events	Events	One record per event per subject	Tabulation	STUDYID, USUBJID, CETERM, CESTDTC	ce.xpt
EG	ECG Test Results	Findings	One record per ECG observation per time point per visit per subject	Tabulation	STUDYID, USUBJID, EGTESTCD, VISITNUM, EGTPTREF, EGTPTNUM	eg.xpt
IE	Inclusion/ Exclusion Criteria Not Met	Findings	One record per inclusion/exclusion criterion not met per subject	Tabulation	STUDYID, USUBJID, IETESTCD	ie.xpt
LB	Laboratory Test Results	Findings	One record per analyte per planned time point number per time point reference per visit per subject	Tabulation	STUDYID, USUBJID, LBTESTCD, LBSPEC, VISITNUM, LBTPTREF, LBTPTNUM	lb.xpt
PE	Physical Examination	Findings	One record per body system or abnormality per visit per subject	Tabulation	STUDYID, USUBJID, PETESTCD, VISITNUM	pe.xpt
QS	Questionnaire	Findings	One record per questionnaire per question per time point per visit per subject	Tabulation	STUDYID, USUBJID, QSCAT, QSTESTCD, VISITNUM, QSTPTREF, QSTPTNUM	qs.xpt
SC	Subject Characteristics	Findings	One record per characteristic per subject	Tabulation	STUDYID, USUBJID, SCTESTCD	sc.xpt
VS	Vital Signs	Findings	One record per vital sign measurement per time point per visit per subject	Tabulation	STUDYID, USUBJID, VSTESTCD, VISITNUM, VSTPTREF, VSTPTNUM	vs.xpt
DA	Drug Accountability	Findings	One record per drug accountability finding per subject	Tabulation	STUDYID, USUBJID, DATESTCD, DADTC	da.xpt
MB	Microbiology Specimen	Findings	One record per microbiology specimen finding per time point per visit per subject	Tabulation	STUDYID, USUBJID, MBTESTCD, VISITNUM, MBTPTREF, MBTPTNUM	mb.xpt
MS	Microbiology Susceptibility	Findings	One record per microbiology susceptibility test (or other organism-related finding) per organism found in MB	Tabulation	STUDYID, USUBJID, MSTESTCD, VISITNUM, MSTPTREF, MSTPTNUM	ms.xpt

Dataset	Description	Class	Structure	Purpose	Keys	Location
PC	Pharmacokinetic Concentrations	Findings	One record per analyte per planned time point number per time point reference per visit per subject"	Tabulation	STUDYID, USUBJID, PCTESTCD, VISITNUM, PCTPTREF, PCTPTNUM	pc.xpt
PP	Pharmacokinetic Parameters	Findings	One record per PK parameter per time-concentration profile per modeling method per subject	Tabulation	STUDYID, USUBJID, PPTESTCD, PPCAT, VISITNUM, PPTPTREF	pp.xpt
FA	Findings About Events or Interventions	Findings	One record per finding per object per time point per time point reference per visit per subject	Tabulation	STUDYID, USUBJID, FATESTCD, FAOBJ, VISITNUM, FATPTREF, FATPTNUM	fa.xpt
TA	Trial Arms	Trial Design	One record per planned Element per Arm	Tabulation	STUDYID, ARMCD, TAETORD	ta.xpt
TE	Trial Elements	Trial Design	One record per planned Element	Tabulation	STUDYID, ETCD	te.xpt
TV	Trial Visits	Trial Design	One record per planned Visit per Arm	Tabulation	STUDYID, VISITNUM, ARMCD	tv.xpt
TI	Trial Inclusion/ Exclusion Criteria	Trial Design	One record per I/E criterion	Tabulation	STUDYID, IETESTCD	ti.xpt
TS	Trial Summary	Trial Design	One record per trial summary parameter value	Tabulation	STUDYID, TSPARMCD, TSSEQ	ts.xpt
RELREC	Related Records	Special Purpose Datasets	One record per related record, group of records or datasets	Tabulation	STUDYID, RDOMAIN, USUBJID, IDVAR, IDVARVAL, RELID	relrec.xpt
SUPP-**	Supplemental Qualifiers for [domain name]	Special-Purpose Datasets	One record per IDVAR, IDVARVAL, and QNAM value per subject	Tabulation	STUDYID, RDOMAIN, USUBJID, IDVAR, IDVARVAL, QNAM	supp–.xpt or suppqual.xpt

* Note that the key variables shown in this table are examples only. A sponsor's actual key structure may be different.

** Separate Supplemental Qualifier datasets of the form supp–.xpt are recommended. See Section 8: 8.4, Relating Non-Standard Variables Values To A Parent Domain.

3.2.1.1 Primary Keys

Table 3.2.1, SDTM Submission Dataset-Definition Metadata Example above shows examples of what a sponsor might submit as variables that comprise the primary key for SDTM datasets. Since the purpose of this column is to aid reviewers in understanding the structure of a dataset, sponsors should list all of the natural keys (see definition below) for the dataset. These keys should define uniqueness for records within a dataset, and may define a record sort order. The naming of these keys should be consistent with the description of the structure in the Structure column. For all the general-observation-class domains (and for some special-purpose domains), the –SEQ variable was created so that a unique record could be identified consistently across all of these domains via its use, along with STUDYID, USUBJID, DOMAIN. In most domains, –SEQ will be a surrogate key (see definition below) for a set of variables which comprise the natural key. In certain instances, a Supplemental Qualifier (SUPP–) variable might also contribute to the natural key of a record for a particular domain. See Section 4: 4.1.1.9,Assigning Natural Keys In The Metadata for how this should be represented, and for additional information on keys.

A natural key is a piece of data (one or more columns of an entity) that uniquely identify that entity, and distinguish it from any other row in the table. The advantage of natural keys is that they exist already, and one does not need to introduce a new "unnatural" value to the data schema. One of the difficulties in choosing a natural key is that just about any natural key one can think of has the potential to change. Because they have business meaning, natural keys are effectively coupled to the business, and they may need to be reworked when business 312H requirements change.An example of such a change in clinical trials data would be the addition of a position or location that becomes a key in a new study, but wasn't collected in previous studies.

A surrogate key is a single-part, artificially established identifier for a record. Surrogate key assignment is a special case of derived data, one where a portion of the primary key is derived. A surrogate key is immune to changes in business needs. In addition, the key depends on only one field, so it's compact. A common way of deriving surrogate key values is to assign integer values sequentially. The –SEQ variable in the SDTM datasets is an example of a surrogate key for most datasets; in some instances, however, –SEQ might be a part of a natural key as a replacement for what might have been a key (e.g. a repeat sequence number) in the sponsor's database

3.2.1.2 CDISC Submission Value-Level Metadata

In general, the CDISC V3.x Findings data models are closely related to normalized, relational data models in a vertical structure of one record per observation. Since the V3.x data structures are fixed, sometimes information that might have appeared as columns in a more horizontal (denormalized) structure in presentations and reports will instead be represented as rows in an SDTM Findings structure. Because many different types of observations are all presented in the same structure, there is a need to provide additional metadata to describe the expected differences that differentiate, for example, hematology lab results from serum chemistry lab results in terms of data type, standard units and other attributes.

For example, the Vital Signs data domain could contain subject records related to diastolic and systolic blood pressure, height, weight, and body mass index (BMI). These data are all submitted in the normalized SDTM Findings structure of one row per vital signs measurement. This means that there could be five records per subject (one for each test or measurement) for a single visit or time point, with the parameter names stored in the Test Code/Name variables, and the parameter values stored in result variables. Since the unique Test Code/Names could have different attributes (i.e., different origins, roles, and definitions) there would be a need to provide value-level metadata for this information.

The value-level metadata should be provided as a separate section of the Case Report Tabulation Data Definition Specification (Define-XML). This information, which historically has been submitted as a pdf document named "define.pdf", should henceforth be submitted in an XML format. For details on the CDISC specification for submitting define.xml, see http://www.cdisc.org/define-xml

3.2.2 Conformance

Conformance with the SDTMIG Domain Models is minimally indicated by:

Following the complete metadata structure for data domains
Following SDTMIG domain models wherever applicable
Using SDTM-specified standard domain names and prefixes where applicable
Using SDTM-specified standard variable names
Using SDTM-specified variable labels for all standard domains
Using SDTM-specified data types for all variables
Following SDTM-specified controlled terminology and format guidelines for variables, when provided
Including all collected and relevant derived data in one of the standard domains, special-purpose datasets, or general-observation-class structures
Including all Required and Expected variables as columns in standard domains, and ensuring that all Required variables are populated
Ensuring that each record in a dataset includes the appropriate Identifier and, Timing variables, as well as a Topic variable
Conforming to all business rules described in the CDISC Notes column and general and domain-specific assumptions.