- Open Access
Comparison of hypothesis- and data-driven asthma phenotypes in NHANES 2007–2012: the importance of comprehensive data availability
Clinical and Translational Allergy volume 9, Article number: 17 (2019)
Half of the adults with current asthma among the US National Health and Nutrition Examination Survey (NHANES) participants could be classified in more than one hypothesis-driven phenotype. A data-driven approach applied to the same subjects may allow a more useful classification compared to the hypothesis-driven one.
To compare previously defined hypothesis-driven with newly derived data-driven asthma phenotypes, identified by latent class analysis (LCA), in adults with current asthma from NHANES 2007–2012.
Adults (≥ 18 years) with current asthma from the NHANES were included (n = 1059). LCA included variables commonly used to subdivide asthma. LCA models were derived independently according to age groups: < 40 and ≥ 40 years old.
Two data-driven phenotypes were identified among adults with current asthma, for both age groups. The proportions of the hypothesis-driven phenotypes were similar among the two data-driven phenotypes (p > 0.05). Class A < 40 years (n = 285; 75%) and Class A ≥ 40 years (n = 462; 73%), respectively, were characterized by a predominance of highly symptomatic asthma subjects with poor lung function, compared to Class B < 40 years (n = 94; 25%) and Class B ≥ 40 years (n = 170; 27%). Inflammatory biomarkers, smoking status, presence of obesity and hay fever did not markedly differ between the phenotypes.
Both data- and hypothesis-driven approaches using clinical and physiological variables commonly used to characterize asthma are suboptimal to identify asthma phenotypes among adults from the general population. Further studies based on more comprehensive disease features are required to identify asthma phenotypes in population-based studies.
Airways diseases, such as asthma and chronic obstructive pulmonary disease (COPD), comprise a heterogeneous set of subtypes with different underlying pathophysiological mechanisms [1,2,3]. Both hypothesis-driven and data-driven methods can be used to classify patients into sub-groups of airways diseases [4,5,6].
The hypothesis-driven approach classifies airways diseases based on pre-defined criteria following immunopathology concepts and asthma literature, while in data-driven methods no prior disease classification is required [7, 8]. Data-driven approaches have provided insights into “novel” phenotypes of complex disease pathogenesis, suggesting disease stratification depending on the individual pathophysiologic characteristics [8,9,10,11].
Most studies on asthma phenotyping using data-driven methods emphasize patients with moderate to severe asthma and/or clinically-based settings [12,13,14,15]. Therefore, the generalization to the general asthma population may be limited.
Different types of data-driven methods have been widely used in airway diseases, such as hierarchical , partitioning , and latent class analysis (LCA) . Notably, LCA appeared to account better for the heterogeneity of airways symptoms, compared to other commonly used data-driven approaches (e.g. partitioning around medoids) . Moreover, the application of the latent class assignments developed from a national data source has previously demonstrated higher degrees of generalizability .
Recently, we reported a significant overlap between five distinct hypothesis-driven asthma phenotypes in adults from the general population included in the US National Health and Nutrition Examination Survey (NHANES) . We have emphasized that a combination of clinical information and biomarkers, using a more comprehensive data analysis approach, such as data-driven methods, could provide a better taxonomy of non-severe asthma.
In this study, we aimed to compare previously defined hypothesis-driven asthma phenotypes  with data-driven asthma phenotypes derived by applying LCA to a sample of adults representative of the US general population.
Study setting and participants
We have included subjects that participated in the NHANES study, a nationally representative survey of the civilian, non-institutionalized US population performed with the aim of gathering data regarding health and nutritional status. Protocols were approved by the National Center for Health Statistics Research Ethics Review Board and all participants gave written informed consent. Detailed information can be found in the NHANES documentation (www.cdc.gov/nchs/nhanes.htm).
Data from three NHANES surveys was used (n = 30,442). We included adults (≥ 18 years old) with current asthma (n = 1059), defined by a positive answer to the questions : “Has a doctor ever told you that you have asthma?” together with “Do you still have asthma?”, and either “wheezing/whistling in the chest in the past 12 months” or “asthma attack in the past 12 months.”
Anthropometric and demographic characteristics, such as age, gender, body mass index (BMI), and smoking status were analysed, as well as blood eosinophils (B-Eos) count, fraction of exhaled nitric oxide (FeNO) and spirometric parameters. FeNO and spirometry were performed following ATS/ERS recommendations [19, 20]. Basal predicted values of forced expiratory volume during the first second (FEV1) and forced vital capacity (FVC) were calculated [21, 22] and abnormal values were defined as being below the lower limit of normal (LLN) .
Hypothesis-driven asthma phenotypes
The analysis based on the report of smoking status, presence of obesity and inflammatory markers enabled the definition of five asthma phenotypes : B-Eos-high asthma phenotype, if B-Eos ≥ 300/mm3; FeNO-high asthma, if FeNO ≥ 35 ppb; B-Eos&FeNO-low asthma, if B-Eos < 150/mm3 and FeNO < 20 ppb; asthma with obesity (AwObesity), if BMI ≥ 30 kg/m2; and asthma with concurrent COPD (AwCOPD), if subjects had self-reported chronic bronchitis/emphysema with age of diagnosis ≥ 40 years and being either a current or an ex-smoker (ever smoked). Subjects were considered as “non-classified” if they did not meet the criteria for any of the defined asthma phenotypes. Additionally, to account for individuals with probable co-existence of asthma and COPD and minimize age as a confounding variable, we conducted the analysis considering two age groups: < 40 and ≥ 40 years old .
Data-driven asthma phenotypes
LCA was used to identify asthma phenotypes in an unsupervised manner (data-driven approach). Two models for “current asthma” were developed (Additional file 1: Table S1): Model 1 was based on the 4 variables previously used to define the hypothesis-driven asthma phenotypes (BMI ≥ 30 kg/m2, ever-smoking status, FeNO ≥ 35 ppb, B-Eos ≥ 300/mm3) ; and in Model 2, we have added to the former 4 variables, sex, early asthma onset (< 16 years old), wheezing-related questions (presence/absence of at least one wheezing attack, wheezing with exercise, sleep disturbance by wheezing, limit activity by wheezing, absenteeism by wheezing), asthma-related emergency department (ED) visit in the previous 12 months, FEV1/FVC < LLN, FEV1 < LLN, and self-reported hay fever.
Additionally, to explore the results in different “asthma populations”, we’ve developed two other models using similar variables. For the “ever asthma” subgroup (model 3) we included subjects with a positive answer to “Has a doctor ever told you that you have asthma?” (n = 2611); and for the “difficult asthma” (model 4) we included subjects with poor asthma-related outcomes, defined as current asthma plus, at least, one of the following: asthma-related ED visit, FEV1 < LLN, or oral corticosteroids use in the past 30 days (n = 673) (Additional file 1: Table S1).
Latent class models were derived independently for each age group, using the same variables, and a secondary analysis without stratifying by age was done on the three asthma subgroups. The most appropriate number of clusters was determined by examining commonly used criteria . Further methodological details are found in the Additional file 1.
All analyses considered the complex multistage sampling and 6-year sampling weights provided by the NHANES documentation . LCA was performed with MPlus (version 6.12), that considered the complex survey design of NHANES when performing LCA-modelling. All other analysis was performed in Stata/IC 15.1 (Stata Corp, College Station, TX, USA). A p-value < 0.05 was considered statistically significant.
We included 1059 adults with current asthma. The weighted proportions of the previously defined hypothesis-driven asthma phenotypes, according to age groups (< 40 and ≥ 40 years old) were, respectively: 42% and 53% with AwObesity; 34% and 37% with B-Eos-high asthma; 26% and 21% for B-Eos&FeNO-low; 18% and 19% with FeNO-high asthma; and 19% AwCOPD, in the older group . In addition, 17% and 12% of the individuals in the < 40 and ≥ 40 years old groups, respectively, were categorized as “non-classified”.
In Model 1, LCA was not able to differentiate any asthma subgroup among subjects with current asthma (Additional file 1: Table S1). On the other hand, by adding more asthma-related variables (Model 2), LCA identified a two-class model as the best solution for both age groups (Table 1, Additional file 1: Table S1). Classes A < 40 years (n = 290; 75%) and A ≥ 40 years (n = 494; 73%) had marked predominance of highly symptomatic asthma subjects, with poorer lung function, compared to classes B < 40 years (n = 96; 25%) and B ≥ 40 years (n = 179; 27%), respectively (Table 1). Regarding inflammatory markers, the proportion of patients with high levels of B-Eos and FeNO was not significantly different between classes, both in the younger group (p = 0.99 and p = 0.82, respectively) and in the older group (p = 0.57 and p = 0.53).
Figure 1 shows that the distribution of the hypothesis-driven phenotypes is similar (p > 0.05) in both classes identified by LCA regardless age group.
Additionally, LCA identified 2 classes on the models for “ever-asthma” and “current asthma” without stratifying by age, but not for the difficult-asthma sub analysis where no subgroup was identified (Additional file 1: Table S1).
This was the first study comparing previously defined hypothesis-driven asthma phenotypes with data-driven ones in a sample representative of the US general population. The proportions of the hypothesis-driven phenotypes were similar between the two data-driven phenotypes obtained by LCA using clinical and physiological variables commonly used to characterize asthma.
Previous studies using data-driven approaches contributed to the definition of clusters/phenotypes based on similarities in clinical and inflammatory biomarkers [9, 12,13,14]. However, these approaches have been scarcely applied to adults with asthma from population-based studies. The studies from Siroux et al.  and Mäkikyrö et al.  provided further evidence for identifying subgroups of asthma based on clinical markers and questionnaire data commonly available in primary health care or large epidemiological studies and found a larger range of asthma phenotypes.
Our study showed that performing LCA with the variables used to define some of the most common hypothesis-driven asthma phenotypes, could not identify subgroups within adults with current asthma from the general population. By including additional clinical and physiological variables commonly used to classify asthma, LCA identified two data-driven phenotypes in the same subjects. Overall, these phenotypes only differed in symptom frequency and lung function parameters. Inflammatory biomarkers, presence of obesity, smoking status, age of asthma onset and self-reported hay fever were not different between classes.
Moreover, using a less stringent asthma definition (ever asthma) and in subjects with poor clinical outcomes (difficult asthma), these variables were also suboptimal to differentiate asthma subgroups.
In contrast to studies with severe asthma patients, our results suggest that, for the general asthma population, the clinical and physiological variables available to classify asthma and commonly used predefined cut-offs seem to be insufficient to identify specific phenotypes. The inclusion in data-driven models of additional easily measurable biomarkers that have already been shown to be helpful in discriminating asthma phenotypes in this population (e.g. serum IgE and/or periostin) [28, 29], combined with comprehensive clinical, physiologic, and/or disease features, might result in the identification of more precise phenotypes. Also, the identification of new, more accurate biomarkers could also improve phenotyping . Furthermore, the use of fixed cut-offs values, although common and more intuitive for daily clinical practice, may potentially miss more complex, and yet unidentified phenotypes. The use of absolute values (as seen in other studies [13, 31, 32]), or appropriate reference equations for predicted values [33, 34] could be more adequate.
Similarly, research efforts are being made to integrate clinical characteristics with available biomarkers to identify data-driven asthma phenotypes in children [35, 36]. However, the obtained phenotypes vary on key features that are more pronounced during childhood, including natural history of wheeze over time , suggesting that further work is required to compare data- and hypothesis-driven approaches to identify asthma phenotypes in children.
Limitations inherent to a survey study design must be acknowledged and the self-reported variables may lead to misclassifications and information biases; to account for these biases, we used previously validated definitions [38, 39]. Also, despite including the most commonly used variables for respiratory disease assessment available in the NHANES study, when using the less stringent asthma definition, the differentiation of asthma subgroups was not improved in this population. However, to reduce the risk of poor LCA-class differentiation, we did not include any of the variables used in the asthma groups definition into the LCA models. Finally, LCA modelling should comprehend all the domains relevant to the understanding of the disease to classify observations into discrete and mutually exclusive classes , suggesting that the use of predefined cut-offs and the lack of data regarding, for example, objective assessment of atopy, nasal and ocular symptoms (which have proved to be useful in the stratification of allergic respiratory diseases [10, 41]), may have limited the ability to differentiate specific asthma phenotypes using unsupervised analysis.
In conclusion, this brief communication extends our previous work on the need for a broader data analysis combining different asthma-related domains for differentiating phenotypes in the general asthma population . The clinical and physiological variables commonly used to subdivide asthma seem to be insufficient to differentiate specific asthma phenotypes among adults from the general population, irrespective of using data-driven or hypothesis-driven approaches. Further studies based on more comprehensive disease features are required to identify asthma phenotypes with the potential to be useful for clinicians and for population-based research.
American Thoracic Society/European Respiratory Society
asthma with concurrent COPD
asthma with obesity
body mass index
chronic obstructive pulmonary disease
fraction of exhaled nitric oxide
- FEV1 :
forced expiratory volume during the first second
forced vital capacity
latent class analysis
lower limit of normal
National Health and Nutrition Examination Survey
Pavord ID, Beasley R, Agusti A, Anderson GP, Bel E, Brusselle G, et al. After asthma: redefining airways diseases. Lancet. 2018;391(10118):350–400.
Pavord ID, Shaw DE, Gibson PG, Taylor DR. Inflammometry to assess airway diseases. Lancet. 2008;372(9643):1017–9.
Wurst KE, Kelly-Reif K, Bushnell GA, Pascoe S, Barnes N. Understanding asthma-chronic obstructive pulmonary disease overlap syndrome. Respir Med. 2016;110:1–11.
Wenzel S. Asthma: defining of the persistent adult phenotypes. Lancet. 2006;368:804–13.
Bousquet J, Anto JM, Sterk PJ, Adcock IM, Chung KF, Roca J, et al. Systems medicine and integrated care to combat chronic noncommunicable diseases. Genome Med. 2011;3(7):43.
Prosperi MCF, Sahiner UM, Belgrave D, Sackesen C, Buchan IE, Simpson A, et al. Challenges in identifying asthma subgroups using unsupervised statistical learning techniques. Am J Respir Crit Care Med. 2013;188(11):1303–12.
Han J, Kamber M, Pei J. Data mining: concepts and techniques. 3rd ed. Waltham: Morgan Kaufmann Publishers; 2012.
Yii ACA, Tay T-R, Choo XN, Koh MSY, Tee AKH, Wang D-Y. Precision medicine in united airways disease: a “treatable traits” approach. Allergy. 2018;73(10):1964–78.
Haldar P, Pavord ID, Shaw DE, Berry MA, Thomas M, Brightling CE, et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178(3):218–24.
Amaral R, Bousquet J, Pereira AM, Araújo LM, Sá-Sousa A, Jacinto T, et al. Disentangling the heterogeneity of allergic respiratory diseases by latent class analysis reveals novel phenotypes. Allergy. 2018. https://doi.org/10.1111/all.13670.
Anto JM, Bousquet J, Akdis M, Auffray C, Keil T, Momas I, et al. Mechanisms of the Development of Allergy (MeDALL): introducing novel concepts in allergy phenotypes. J Allergy Clin Immunol. 2017;139(2):388–99.
Moore WC, Meyers DA, Wenzel SE, Teague WG, Li H, Li X, et al. Identification of asthma phenotypes using cluster analysis in the Severe Asthma Research Program. Am J Respir Crit Care Med. 2010;181(4):315–23.
Wu W, Bleecker E, Moore W, Busse WW, Castro M, Chung KF, et al. Unsupervised phenotyping of Severe Asthma Research Program participants using expanded lung data. J Allergy Clin Immunol. 2014;133(5):1280–8.
Lefaudeux D, De Meulder B, Loza MJ, Peffer N, Rowe A, Baribaud F, et al. U-BIOPRED clinical adult asthma clusters linked to a subset of sputum omics. J Allergy Clin Immunol. 2017;139(6):1797–807.
Amelink M, de Nijs SB, de Groot JC, van Tilburg PMB, van Spiegel PI, Krouwels FH, et al. Three phenotypes of adult-onset asthma. Allergy. 2013;68(5):674–80.
Amaral R, Jacinto T, Pereira A, Almeida R, Fonseca J. A comparison of unsupervised methods based on dichotomous data to identify clusters of airways symptoms: latent class analysis and partitioning around medoids methods. Eur Respir J. 2018;. https://doi.org/10.1183/13993003.congress-2018.PA4429.
Evenson KR, Wen F, Howard AG, Herring AH. Applying latent class assignments for accelerometry data to external populations: data from the National Health and Nutrition Examination Survey 2003–2006. Data Br. 2016;9:926–30.
Amaral R, Fonseca JA, Jacinto T, Pereira AM, Malinovschi A, Janson C, et al. Having concomitant asthma phenotypes is common and independently relates to poor lung function in NHANES 2007–2012. Clin Transl Allergy. 2018;8(1):13.
Silkoff PE. ATS/ERS recommendations for standardized procedures for the online and offline measurement of exhaled lower respiratory nitric oxide and nasal nitric oxide, 2005. Am J Respir Crit Care Med. 2005;171(8):912–30.
Miller MR. Standardisation of spirometry. Eur Respir J. 2005;26(2):319–38.
Hankinson JL, Odencrantz JR, Fedan KB. Spirometric reference values from a sample of the general U.S. population. Am J Respir Crit Care Med. 1999;159(1):179–87.
Hankinson JL, Kawut SM, Shahar E, Smith LJ, Stukovsky KH, Barr RG. Performance of American thoracic society-recommended spirometry reference values in a multiethnic sample of adults. Chest. 2010;137(1):138–45.
Stanojevic S, Wade A, Stocks J, Hankinson J, Coates AL, Pan H, et al. Reference ranges for spirometry across all ages. Am J Respir Crit Care Med. 2008;177(3):253–60.
Muthén LK, Muthén BO. Mplus user’s guide. 7th ed. Los Angeles: Muthén & Muthén; 2012.
Specifying weightning parameters. https://www.cdc.gov/nchs/tutorials/nhanes/SurveyDesign/Weighting/intro.htm. Accessed 9 Dec 2018.
Siroux V, Basagaña X, Boudier A, Pin I, Garcia-Aymerich J, Vesin A, et al. Identifying adult asthma phenotypes using a clustering approach. Eur Respir J. 2011;38(2):310–7.
Mäkikyrö EMS, Jaakkola MS, Jaakkola JJK. Subtypes of asthma based on asthma control and severity: a latent class analysis. Respir Res. 2017;18(1):24.
Patelis A, Gunnbjörnsdottir M, Malinovschi A, Matsson P, Önell A, Högman M, et al. Population-based study of multiplexed IgE sensitization in relation to asthma, exhaled nitric oxide, and bronchial responsiveness. J Allergy Clin Immunol. 2012;130(2):397–402.e2.
James A, Janson C, Malinovschi A, Holweg C, Alving K, Ono J, et al. Serum periostin relates to type-2 inflammation and lung function in asthma: data from the large population-based cohort Swedish GA(2)LEN. Allergy. 2017;72(11):1753–60.
Carr TF, Kraft M. Use of biomarkers to identify phenotypes and endotypes of severe asthma. Ann Allergy Asthma Immunol. 2018;121(4):414–20.
Hsiao H-P, Lin M-C, Wu C-C, Wang C-C, Wang T-N. Sex-specific asthma phenotypes, inflammatory patterns, and asthma control in a cluster analysis. J Allergy Clin Immunol Pract. 2019;7(2):556–567.e15.
Sendín-Hernández MP, Ávila-Zarza C, Sanz C, García-Sánchez A, Marcos-Vadillo E, Muñoz-Bellido FJ, et al. Cluster analysis identifies 3 phenotypes within allergic asthma. J Allergy Clin Immunol Pract. 2018;6(3):955–961.e1.
Quanjer P, Stanojevic S. Multi-ethnic reference values for spirometry for the 3–95-yr age range: the global lung function 2012 equations. Eur Respir J. 2012;40:1324–43.
Jacinto T, Amaral R, Malinovschi A, Janson C, Fonseca J, Alving K. Exhaled NO reference limits in a large population-based sample using the Lambda-Mu-Sigma method. J Appl Physiol. 2018;125(5):1620–6.
Depner M, Fuchs O, Genuneit J, Karvonen AM, Hyvärinen A, Kaulek V, et al. Clinical and epidemiologic phenotypes of childhood asthma. Am J Respir Crit Care Med. 2014;189(2):129–38.
Collins SA, Pike KC, Inskip HM, Godfrey KM, Roberts G, Holloway JW, et al. Validation of novel wheeze phenotypes using longitudinal airway function and atopic sensitization data in the first 6 years of life: evidence from the Southampton Women’s survey. Pediatr Pulmonol. 2013;48(7):683–92.
Henderson J, Granell R, Heron J, Sherriff A, Simpson A, Woodcock A, et al. Associations of wheezing phenotypes in the first 6 years of life with atopy, lung function and airway responsiveness in mid-childhood. Thorax. 2008;63(11):974–80.
Sá-Sousa A, Jacinto T, Azevedo LF, Morais-Almeida M, Robalo-Cordeiro C, Bugalho-Almeida A, et al. Operational definitions of asthma in recent epidemiological studies are inconsistent. Clin Transl Allergy. 2014;4:24.
Halldin CN, Doney BC, Hnizdo E. Changes in prevalence of chronic obstructive pulmonary disease and asthma in the US population and associated risk factors. Chron Respir Dis. 2015;12(1):47–60.
Wang J, Wang X. Structural equation modeling: applications using Mplus. West Sussex: Wiley; 2012.
Bousquet J, Devillier P, Anto JM, Bewick M, Haahtela T, Arnavielhe S, et al. Daily allergic multimorbidity in rhinitis using mobile technology: a novel concept of the MASK study. Allergy. 2018;73(8):1622–31.
RA, AMP, JAF, contributed to study conception and design, analysis and interpretation of data, writing and revising the article. TJ, AM, CJ and KA contributed to data interpretation, writing and revising the article. All authors read and approved the final manuscript.
This article was supported by FEDER through the operation POCI-01-0145-FEDER-007746 funded by the Programa Operacional Competitividade e Internacionalização – COMPETE2020 and by National Funds through FCT - Fundação para a Ciência e a Tecnologia within CINTESIS, R&D Unit (reference UID/IC/4255/2013).
The authors declare that they have no competing interests.
Availability of data
Data and respective datasets are displayed at the NHANES website: https://www.cdc.gov/nchs/nhanes/Index.htm.
Consent of publication
Ethics approval and consent to participate
The NHANES survey operates under the approval of the National Center for Health Statistics Research Ethics Review Board (Protocols #2005-06, and #2011-17), available in www.cdc.gov/nchs/nhanes/irba98.htm. All the NHANES data meet the conditions described in Research Using Publicly Available Datasets (Secondary Analysis) - Policy #39 - for use without application to Institutional Review Board. All study participants provided written informed consent.
RA is supported by a Ph.D. grant (grant no. PD/BD/113659/2015), financed by the Fundação para a Ciência e Tecnologia, I.P., PhD program (reference no. PD/0003/2013: Doctoral Program in Clinical and Health Services Research).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Amaral, R., Pereira, A.M., Jacinto, T. et al. Comparison of hypothesis- and data-driven asthma phenotypes in NHANES 2007–2012: the importance of comprehensive data availability. Clin Transl Allergy 9, 17 (2019). https://doi.org/10.1186/s13601-019-0258-7
- Population-based study
- Unsupervised analysis