2 Fundamentals of the SDTM¶

2.1 Observations and Variables¶

The V3.x Submission Data Standards are based on the SDTM's general framework for organizing clinical trials information that is to be submitted to the FDA. The SDTM is built around the concept of observations collected about subjects who participated in a clinical study. Each observation can be described by a series of variables, corresponding to a row in a dataset or table. Each variable can be classified according to its Role. A Role determines the type of information conveyed by the variable about each distinct observation and how it can be used. Variables can be classified into five major roles:

Identifier variables, such as those that identify the study, subject, domain, and sequence number of the record
Topic variables, which specify the focus of the observation (such as the name of a lab test)
Timing variables, which describe the timing of the observation (such as start date and end date)
Qualifier variables, which include additional illustrative text or numeric values that describe the results or additional traits of the observation (such as units or descriptive adjectives)
Rule variables, which express an algorithm or executable method to define start, end, and branching or looping conditions in the Trial Design model The set of Qualifier variables can be further categorized into five sub-classes:
Grouping Qualifiers are used to group together a collection of observations within the same domain. Examples include –CAT and –SCAT.
Result Qualifiers describe the specific results associated with the topic variable in a Findings dataset. They answer the question raised by the topic variable. Result Qualifiers are –ORRES, –STRESC, and –STRESN.
Synonym Qualifiers specify an alternative name for a particular variable in an observation. Examples include –MODIFY and –DECOD, which are equivalent terms for a –TRT or –TERM topic variable, –TEST and –LOINC which are equivalent terms for a –TESTCD.
Record Qualifiers define additional attributes of the observation record as a whole (rather than describing a particular variable within a record). Examples include –REASND, AESLIFE, and all other SAE flag variables in the AE domain; AGE, SEX, and RACE in the DM domain; and –BLFL, –POS, –LOC, –SPEC and –NAM in a Findings domain
Variable Qualifiers are used to further modify or describe a specific variable within an observation and are only meaningful in the context of the variable they qualify. Examples include –ORRESU, –ORNRHI, and –ORNRLO, all of which are Variable Qualifiers of –ORRES; and –DOSU, which is a Variable Qualifier of –DOSE.

For example, in the observation, "Subject 101 had mild nausea starting on Study Day 6, " the Topic variable value is the term for the adverse event, "NAUSEA". The Identifier variable is the subject identifier, "101". The Timing variable is the study day of the start of the event, which captures the information, "starting on Study Day 6", while an example of a Record Qualifier is the severity, the value for which is "MILD". Additional Timing and Qualifier variables could be included to provide the necessary detail to adequately describe an observation.

2.2 Datasets and Domains¶

Observations about study subjects are normally collected for all subjects in a series of domains. A domain is defined as a collection of logically related observations with a common topic. The logic of the relationship may pertain to the scientific subject matter of the data or to its role in the trial. Each domain is represented by a single dataset.

Each domain dataset is distinguished by a unique, two-character code that should be used consistently throughout the submission. This code, which is stored in the SDTM variable named DOMAIN, is used in four ways: as the dataset name, the value of the DOMAIN variable in that dataset, as a prefix for most variable names in that dataset, and as a value in the RDOMAIN variable in relationship tables [Section 8 -Representing Relationships and Data].

All datasets are structured as flat files with rows representing observations and columns representing variables. Each dataset is described by metadata definitions that provide information about the variables used in the dataset. The metadata are described in a data definition document named "define" that is submitted with the data to regulatory authorities. (See the Case Report Tabulation Data Definition Specification [Define-XML], available at www.CDISC.org ). Define-XML specifies seven distinct metadata attributes to describe SDTM data:

The Variable Name (limited to 8 characters for compatibility with the SAS Transport format)
A descriptive Variable Label, using up to 40 characters, which should be unique for each variable in the dataset
The data Type (e.g., whether the variable value is a character or numeric)
The set of controlled terminology for the value or the presentation format of the variable (Controlled Terms, Codelist, r Format)
The Origin f each variable [see Section 4: 4.1.1.8, Origin Metadata]
The Role f the variable, which determines how the variable is used in the dataset. For the V3.x domain models, Roles are used to represent the categories of variables such as Identifier, Topic, Timing, or the five types of Qualifiers.
Comments r other relevant information about the variable or its data included by the sponsor as necessary to communicate information about the variable or its contents to a regulatory agency.

Data stored in SDTM datasets include both raw (as originally collected) and derived values (e.g., converted into standard units, or computed on the basis of multiple values, such as an average). The SDTM lists only the name, label, and type, with a set of brief CDISC guidelines that provide a general description for each variable used for a general observation class.

The domain dataset models included in Section 5 – Models For Special-Purpose Domains and Section 6 – Domain Models Based On The General Observation Classes of this document provide additional information about Controlled Terms or Format, notes on proper usage, and examples. Controlled terminology (CT) is now represented one of four ways:

A single asterisk when there is no specific CT available at the current time, but the SDS Team expects that sponsors may have their own CT and/or the CDISC Controlled Terminology Team may be developing CT.
A list of controlled terms for the variable when values are not yet maintained externally
The name of an external codelist whose values can be found via the hyperlinks in either the domain or by accessing the CDISC Controlled Terminology as outlined in Appendix C – Controlled Terminology.
A common format such as ISO 8601

The CDISC Controlled Terminology team will be publishing additional guidance on use of controlled terminology separately.

2.3 Special-Purpose Datasets¶

The SDTM includes three types of special-purpose datasets:

Domain datasets, consisting of Demographics (DM), Comments (CO), Subject Elements (SE), and Subject Visits (SV) 1, all of which include subject-level data that do not conform to one of the three general observation classes. These are described in Section 5 – Models For Special-Purpose Domains.
Trial Design Model (TDM) datasets, such as Trial Arms (TA) and Trial Elements (TE), which represent information about the study design but do not contain subject data. These are described in Section 7 -Trial Design Datasets.
Relationship datasets, which include the RELREC and SUPP–datasets described in Section 8 Representing Relationships and Data.

2.4 The General Observation Classes¶

Most subject-level observations collected during the study should be represented according to one of the three SDTM general observation classes: Interventions, Events, or Findings. The lists of variables allowed to be used in each of these can be found in the SDTM.

The Interventions class captures investigational, therapeutic and other treatments that are administered to the subject (with some actual or expected physiological effect) either as specified by the study protocol (e.g., exposure to study drug), coincident with the study assessment period (e.g., concomitant medications), or self-administered by the subject (such as use of alcohol, tobacco, or caffeine).
The Events class captures planned protocol milestones such as randomization and study completion, and occurrences, conditions, or incidents independent of planned study evaluations occurring during the trial (e.g., adverse events) or prior to the trial (e.g., medical history).
The Findings class captures the observations resulting from planned evaluations to address specific tests or questions such as laboratory tests, ECG testing, and questions listed on questionnaires.

In most cases, the choice of observation class appropriate to a specific collection of data can be easily determined according to the descriptions provided above. The majority of data, which typically consists of measurements or responses to questions usually at specific visits or time points, will fit the Findings general observation class. Additional guidance on choosing the appropriate general observation class is provided in Section 8: 8.6.1, Guidelines For Determining The General Observation Class.

General assumptions for use with all domain models and custom domains based on the general observation classes are described in Section 4 -Assumptions For Domain Models of this document; specific assumptions for individual domains are included with the domain models.

1 SE and SV were included as part of the Trial Design Model in earlier versions of the SDTMIG.

2.5 The SDTM Standard Domain Models¶

The following standard domains, listed in alphabetical order by Domain Code, with their respective domain codes have been defined or referenced by the CDISC SDS Team in this document. Note that other domain models may be posted separately for comment after this publication.

Special-Purpose Domains (defined in** *Section 5 – Models For Special-Purpose Domains* ):**

Comments (CO) • Demographics (DM)
Subject Elements (SE) • Subject Visits (SV)

Interventions General Observation Class (defined in** *Section 6.1 -Interventions* ):**

Concomitant Medications (CM) • Exposure as Collected (EC)
Exposure (EX) • Substance Use (SU)
Procedures (PR)

Events General Observation Class (defined in** *Section 6.2 -Events* ):**

Adverse Events (AE) • Clinical Events (CE)
Disposition (DS) • Protocol Deviations (DV)
Healthcare Encounters (HO) • Medical History (MH)

Findings General Observation Class (defined in** *Section 6.3 -Findings* ):**

Drug Accountability (DA) • Death Details (DD)
ECG Test Results (EG) • Inclusion/Exclusion Criterion Not Met (IE)
Immunogenicity Specimen • Laboratory Test Results (LB) Assessments (IS)
Microbiology Specimen (MB) • Microscopic Findings (MI)
Morphology (MO) • Microbiology Susceptibility Test (MS)
PK Concentrations (PC) • PK Parameters (PP)
Physical Examination (PE) • Questionnaires (QS)
Reproductive System Findings • Disease Response (RS) (RP)
Subject Characteristics (SC) • Subject Status (SS)
Tumor Identification (TU) • Tumor Results (TR)
Vital Signs (VS)

Findings About (defined in** *Section 6.4 -FA Domain* )**

• Findings About (FA) • Skin Response (SR)

Trial Design Domains (defined in** *Section 7 -Trial Design Datasets* ):**

Trial Arms (TA) • Trial Disease Assessment (TD)
Trial Elements (TE) • Trial Visits (TV)
Trial Inclusion/Exclusion Criteria • Trial Summary (TS) (TI)

Relationship Datasets (defined in** *Section 8 -Representing Relationships and Data* ):**

• Supplemental Qualifiers (SUPP–datasets) • Related Records (RELREC)

A sponsor should only submit domain datasets that were actually collected (or directly derived from the collected data) for a given study. Decisions on what data to collect should be based on the scientific objectives of the study, rather than the SDTM. Note that any data that was collected and will be submitted in an analysis dataset must also appear in a tabulation dataset.

The collected data for a given study may use some or all of the SDS standard domains as well as additional custom domains based on the three general observation classes. A list of standard domain codes for many commonly used domains is provided in . Additional standard domain models will be published by CDISC as they are developed, and sponsors are encouraged to check the CDISC website for updates.

These general rules apply when determining which variables to include in a domain:

The Identifier variables, STUDYID, USUBJID, DOMAIN, and –SEQ are required in all domains based on the general observation classes. Other Identifiers may be added as needed.
Any Timing variables are permissible for use in any submission dataset based on a general observation class except where restricted by specific domain assumptions.
Any additional Qualifier variables from the same general observation class may be added to a domain model except where restricted by specific domain assumptions.
Sponsors may not add any other variables than those described in the preceding three bullets. The addition of non-standard variables will compromise the FDA's abilities to populate the data repository and to use standard tools. The SDTM allows for the inclusion of the sponsors non-SDTM variables using the Supplemental Qualifiers special-purpose dataset structure, described in Section 8: 8.4, Relating Non-Standard Variables Values To A Parent Domain. As the SDTM continues to evolve over time, certain additional standard variables may be added to the general observation classes. Therefore, Sponsors wishing to nominate such variables for future consideration should provide a rationale and description of the proposed variable(s) along with representative examples to the CDISC Public Discussion Forum.
Standard variables must not be renamed or modified for novel usage. Their metadata should not be changed.
As long as no data was collected for Permissible variables, a sponsor is free to drop them and the corresponding descriptions from the Define-XML.

2.6 Creating a New Domain¶

This section describes the overall process for creating a custom domain, which must be based on one of the three SDTM general observation classes. The number of domains submitted should be based on the specific requirements of the study. Follow the process below to create a custom domain:

Confirm that none of the existing published domains will fit the need. A custom domain may only be created if the data are different in nature and do not fit into an existing published domain.
Establish a domain of a common topic (i.e., where the nature of the data is the same), rather than by a specific method of collection (e.g. electrocardiogram -EG). Group and separate data within the domain using –CAT, –SCAT, –METHOD, –SPEC, –LOC, etc. as appropriate. Examples of different topics are: microbiology, tumor measurements, pathology/histology, vital signs, and physical exam results.
Do not create separate domains based on time, rather represent both prior and current observations in a domain (e.g., CM for all non-study medications). Note that AE and MH are an exception to this best practice because of regulatory reporting needs.
How collected data are used (e.g., to support analyses and/or efficacy endpoints) must not result in the creation of a custom domain. For example, if blood pressure measurements are endpoints in a hypertension study, they must still be represented in the VS (Vital Signs) domain as opposed to a custom "efficacy" domain. Similarly, if liver function test results are of special interest, they must still be represented in the LB (Laboratory Tests) domain.
Data that were collected on separate CRF modules or pages may fit into an existing domain (such as separate questionnaires into the QS domain, or prior and concomitant medications in the CM domain).
If it is necessary to represent relationships between data that are hierarchical in nature (e.g., a parent record must be observed before child records), then establish a domain pair (e.g., MB/MS, PC/PP). Note, domain pairs have been modeled for microbiology data (MB/MS domains) and PK data (PC/PP domains) to enable dataset-level relationships to be described using RELREC. The domain pair uses DOMAIN as an Identifier to group parent records (e.g., MB) from child records (e.g., MS) and enables a dataset-level relationship to be described in RELREC. Without using DOMAIN to facilitate description of the data relationships, RELREC, as currently defined could not be used without introducing a variable that would group data like DOMAIN.
Check the Submission Data Standards area of the CDISC website (Hhttp://www.cdisc.org/) for modelsadded after the last publication of the SDTMIG.
Look for an existing, relevant domain model to serve as a prototype. If no existing model seems appropriate, choose the general observation class (Interventions, Events, or Findings) that best fits the data by considering the topic of the observation The general approach for selecting variables for a custom domain is as follows (also see Figure 2.6, Creating A New Domain below)

a. Select and include the required Identifier variables (e.g., STUDYID, DOMAIN, USUBJID, –SEQ) and any permissible Identifier variables from SDTM: Table 2.2.4.

b. Include the Topic variable from the identified general observation class (e.g., –TESTCD for Findings) [SDTM: Tables 2.2.1, 2.2.2, or 2.2.3].

c. Select and include the relevant Qualifier variables from the identified general observation class [SDTM: Tables 2.2.1, 2.2.2, or 2.2.3]. Variables belonging to other general observation classes must not be added.

d. Select and include the applicable Timing variables [see SDTM: Table 2.2.5]. Determine the domain code. Check the CDISC Controlled Terminology [see Appendix C – Controlled Terminology] for reserved two-character domain identifiers or abbreviations. If one has not been assigned by CDISC, then the sponsor may select the unique two-character domain code to be used consistently throughout the submission.

e. Apply the two-character domain code to the appropriate variables in the domain. Replace all variable prefixes (shown in the models as two hyphens "–") with the domain code. If no domain code exists in the CDISC Controlled Terminology [see Appendix C – Controlled Terminology] for this data and if it desired to have this domain code as part of CDISC controlled terminology then submit a request to add the new domain via the CDISC website. Requests for new domain codes must include:

1) Two-letter domain code and description 2) Rationale for domain code 3) Domain model with assumptions 4) Examples

Upon receipt, the SDS Domain Code Subteam will review the package. If accepted, then the proposal will be submitted to the SDS Team for review. Upon approval, a response will be sent to the requestor and package processing will begin (i.e., prepare for inclusion in a next release of the SDTM and SDTMIG, mapping concepts to BRIDG, and posting an update to the CDISC website). If declined, then the Domain Code Subteam will draft a response for SDS Team review. Upon agreement, the response will be sent to the requestor and also posted to the CDISC website.

f. Set the order of variables consistent with the order defined in SDTM: Tables 2.2.1, 2.2.2, or 2.2.3 , depending upon the general observation class the custom domain is based on.

g. Adjust the labels of the variables only as appropriate to properly convey the meaning in the context of the data being submitted in the newly created domain. Use title case for all labels (title case means to capitalize the first letter of every word except for articles, prepositions, and conjunctions).

h. Ensure that appropriate standard variables are being properly applied by comparing the use of variables in standard domains.

i. Describe the dataset within the define.xml document [see Section 3: 3.2, Using The CDISC Domain Models In Regulatory Submissions -Dataset Metadata].

j. Place any non-standard (SDTM) variables in a Supplemental Qualifier dataset. Mechanisms for representing additional non-standard Qualifier variables not described in the general observation classes and for defining relationships between separate datasets or records are described in Section 8: 8.4, Relating Non-Standard Variables Values To A Parent Domain of this document.

2.7 SDTM variables Allowed in SDTMIG¶

This section identifies those SDTM variables that either 1) should not be used in SDTM-compliant data tabulations of clinical trials data or 2) have not yet been evaluated for use in human clinical trials.

The following SDTM variables, defined for use in non-clinical studies (SEND), must NEVER be used in the submission of SDTM-based data for human clinical trials:

• –DTHREL (Findings) • –EXCLFL (Findings) • –REASEX (Findings) • –DETECT (Findings)

The following variables can be used for non-clinical studies (SEND) but must NEVER be used in the Demographics domain for human clinical trials. However, the use of these variables is currently being evaluated in Findings general observation class domains being developed for use in the tabulations of virology data:

SPECIES (Demographics)
STRAIN (Demographics)
SBSTRAIN (Demographics)

The following variables have not been evaluated for use in human clinical trials and must therefore be used with extreme caution:

• –ANTREG (Findings)

• SETCD (Demographics) [Note: The use of SETCD additionally requires the use of the Trials Sets domain]

The following identifier variable can be used for non-clinical studies (SEND), and may be used in human clinical trials when appropriate:

• POOLID

[Note: The use of POOLID additionally requires the use of the Pool Definition dataset] Other variables defined in the SDTM are allowed for use as defined in this SDTMIG except when explicitly stated. Custom domains, created following the guidance in Section 2.6, Creating A New Domain , may utilize any appropriate Qualifier variables from the selected general observation class.