Study
Design
Instruments
Population(s)
Salient findings
Bialocerkowski et al. (2007)
Systematic review
Multiple instruments including AHA, Arm and Hand Function, Brachial Plexus Outcome Measure, PEDI, PODCI, MUUL
Children and adolescents with BPBP
PODCI, PEDI, and AHA have the strongest psychometric properties for BPBP
Additional psychometric studies are needed for more robust measures
Buffart et al. (a)
Repeated measures, psychometric study
AHA, PUFI, ABILHAND
Children, age 4–12, with radial deficiency (RD), types I–IV
Each of the measures has good psychometric properties for children with RD types I–IV
Unilateral and bilateral
AHA and PUFI have the strongest correlation with type of RD, hand grip, and global assessment of hand function
No instrument is optimal
Burger et al. (2004)
Cross-sectional, psychometric study
UNB, CAPP-FSIP, CAPP-FSI
Children with limb deficiency
No instrument is optimal
Chang et al. (2013)
Systematic review
Multiple instruments including AHA, ABILHAND, PEDI, PODCI
Children with BPBP
Serious void in validated instruments for population with BPBP
Disparity in use of classification systems and instruments limiting ability to evaluate outcomes
Christakou and Laiou (2013)
Critical review
PODCI and ASK
Children with orthopedic impairment
ASK has the stronger psychometric properties when compared to PODCI
Further psychometric work is needed on both instruments
Gilmore et al. (2010)
Systematic review
Multiple instruments including MUUL, AHA, ABILHAND, QUEST, SHUEE
Children, age 5–16, with hemiplegic cerebral palsy
The best performance measure with strong psychometric properties for bimanual function is AHA
The best performance measure with strong psychometric properties for unilateral hand function is MUUL
ABILHAND has strong psychometric properties and is both a measure of capability and parent report
Multiple measures may be needed based on purpose of use
Harvey et al. (2008)a
Systematic review
Multiple instruments including PODCI, PEDI, ASK
Children with cerebral palsy
ASK and GMFM have the strongest psychometric properties for measuring activity limitation
No one instrument addresses all aspects of the ICF; thus multiple instruments should be selected based on psychometric properties and purpose of measurement
Kleeper (2011)
Critical review
CHA-Q, ASK, PODCI, JAFAS
Children with rheumatic disease
Each of the measures contains activities relevant to children across broad age range
ASK is the only instrument that is child report
PODCI is the most comprehensive
Using two or more of these measures is recommended to obtain the best understanding of child’s functioning
Klingels et al. (2010)
Systematic review
Multiple instruments including QUEST, SHUEE, AHA, MUUL, PEDI, ABILHAND
Children, age 2–18 years, with hemiplegic cerebral palsy
MUUL is recommended for capacity measure
AHA is recommended for performance measure
ABILHAND is recommended for patient-reported outcome instrument
Lindner et al. (2010)
Critical review
Multiple pediatric and adult instruments including PUFI, UBET, CAPP-FSI, CAPP-PSI
Children and adults using upper limb prostheses
Multiple instruments are needed to assess the constructs across the ICF
The majority of pediatric instruments measure activity and participation domains of ICF
Assessment with children requires measures that evaluate social interactions
Mazzone et al. (2012)
Critical review
Multiple instruments including ABILHAND, Self-reported Scale of Activity Limitation, Muscular Dystrophy Functional Rating Scale, Jebsen Test of Hand Function, Upper Limb Functional Ability Test
Children with muscular dystrophy (MD)
Many of the instruments reviewed are psychometrically sound, although not with samples of children with MD
None covers the full ability range of children with MD (ceiling and floor potential)
Sakzewski et al. (2007)a
Systematic review
Multiple instruments including COPM, GAS
Children with CP
All instruments measure aspects of childhood participation
No one instrument covers all aspects of participation
Responsiveness of all instruments is unknown
Wagner and Davids (2012)
Systematic review
Multiple instruments including AHA, BB, MUUL, QUEST, SHUEE, PEDI-PODCI, ASK, COPM, GAS
Children with CP
The understanding of psychometric properties will assist clinicians with selecting the most useful instruments based on purpose of measurement
Wright (2009)
Systematic review
Multiple instrument including AHA, PUFI, ABILHAND, CAPP-FSI, PODCI
Adult and pediatric prosthetic users
Comprehensive summary of instruments used with adults and children with limb deficiency
Measures have varying degree of psychometric properties
Further research is needed
This chapter provides a description of criterion and norm-referenced instruments, describes psychometric properties associated with outcome instruments, provides an overview of research literature on outcome instruments that have particular relevance to the pediatric UE and are reported on repeatedly in research literature, and discusses computer adaptive testing (CAT) as an emerging measurement technology. Although they may be used to evaluate outcomes, classification systems such as the Manual Ability Classification System (MACS) (Eliasson et al. 2006) for children with CP, the Mallet Classification and Active Movement Scale (AMS) (Bae et al. 2003) for children with BPBP, and the International Standards for Neurological Classification of Spinal Cord Injury (ISNCSCI) (Kirshblum et al. 2011) and Classification of the Upper Extremity in Tetraplegia (Mulcahey and Weiss 2008) for children with spinal cord injury (SCI) are not outcome instruments per se and thus will not be discussed in this chapter. Methods such as imaging and electrodiagnostics as well as physical examination measures for muscle strength, sensibility, joint range of motion, pain, and spasticity are beyond the scope of this chapter but can be found in chapters within this text and in other excellent resources (Platz et al. 2008; Van den Beld et al. 2011; Gajdosik 2005; Mulcahey et al. 2007; Koman et al. 2008).
Considerations in Selection of Outcome Instruments
Generally, there are two types of instruments that differ in how scores are interpreted. Criterion-referenced tests are those for which the test score is interpreted relative to a continuum of possible scores that represents some level of performance (Hinderer and Hinderer 2005; Portney and Watkins 2009). In contrast, norm-referenced tests measure performance that is interpretable in terms of the individual relative to performance of some known group (Hinderer and Hinderer 2005). An excellent example of norm-referenced tests is the developmental motor scales where scores are interpreted against “normal” development. With norm-referenced instruments, the mean of the distribution of the scores from the reference sample is used as the standard and the variability is used to determine how an individual performs relative to the reference sample (Portney and Watkins 2009). Norm-referenced tests are usually used for diagnoses, while criterion-referenced tests are used to examine proficiency of performance along a continuum, for example, a continuum of “cannot do to can do,” and are felt to be more useful for developing and evaluating rehabilitation outcomes (Portney and Watkins 2009).
In addition to understanding the distinction between criterion- and norm-referenced tests, informed use of an outcome instrument requires the understanding of the psychometric properties of the instrument, particularly when the scores are used for decisions related to treatment and reimbursement. These properties include reliability, validity, and responsiveness. Reliability is concerned with the degree to which an instrument can distinguish differences among persons, despite measurement error. The reliability of scores on individual test items can be determined using the kappa and weighted kappa coefficients for nominal and ordinal level data, respectively (Portney and Watkiins 2009). Reliability of total scores is usually determined by the Intraclass Correlation Coefficient (ICC) (Streiner and Norman 1995). Internal consistency reflects the degree to which items within a scale measure a single construct and is usually assessed using Cronbach’s alpha (Portney and Watkins 2007). Agreement, which is a form of reliability, is concerned with how close the scores are on repeated testing in stable conditions (De Vet et al. 2006) and is a fundamental property if an instrument is used to detect change or determine treatment effectiveness. The ICC is recommended for studies of agreement and reproducibility. Interpretation of reliability estimates is not standardized but rather based on the context of the study and instrument (Portney and Watkins 2007; Streiner and Norman 1995). While reliability estimates of 0.7 and 0.9 are recommended for outcome instruments (Fitzpatirck et al. 1998), reliability estimates higher than 0.9 are preferred (Portney and Watkins 2007).
Measurement validity concerns the extent to which an instrument measures what it is intended to measure. Validity places an emphasis on the objectives of the instrument and the ability to make inferences from the test scores (Portney and Watkins 2007). Face and content validities are qualitative characteristics that indicate the instrument appears to measure what it is intended to measure (face) and that the instrument adequately covers the domain of interest (validity). Further evidence of face and content validities can be obtained from an understanding about how the test items were developed and field-tested (Guyatt and Cook 1994); development of test items should include “content experts,” including people who represent the intended responder (i.e., children with upper extremity impairments), and the items and response scale should undergo iterative cognitive testing, as conducted by Dumas et al. (2008) and Mulcahey et al. (2009, 2011). Validity can be quantitatively evaluated by comparing scores of a new instrument to scores of a similar, traditional, or gold standard instrument (criterion validity); by evaluating differences in scores among known groups (discriminant validity); or by evaluating if there are expected associations of scores with scores from instruments measuring similar attributes (convergent validity) or different attributes (divergent validity).
Responsiveness of an instrument addresses the degree to which an instrument is capable of detecting important changes in health status. While there is not consensus on the “best” method to establish responsiveness, common approaches include calculating the effect size, standard response mean (SEM), and minimally important difference (MID) (Guyatt et al. 2002). Most researchers use Cohen’s interpretation of effect size whereby values of .5 reflect a moderate effect and values of .8 reflect a large effect (Portney and Watkins 2007). There is ongoing dialogue about how best to interpret meaningful change measured by PRO instruments (McLeod et al. 2011; Wyrwich et al. 2013).
Functional Performance Measures
Functional performance measures refer to upper extremity assessments that require actual performance of arm and hand tasks. Usually, these measures are administered by a trained therapist and have procedural guidelines for scoring and interpretation; they can be criterion or norm referenced. There are many upper extremity performance measures (http://www.rehabmeasures.org; http://www.scireproject.com), and while some may be used with children, most have been developed and field-tested using adult clinical samples.
The Jebsen Test of Hand Function (Jebsen et al. 1969) and the Box and Block Test (Mathiowetz et al. 1985) are non-categorical or generic (not disease specific) upper extremity performance measures. The Jebsen Test of Hand Function is a timed test of hand dexterity that was originally established for adults (Jebsen et al. 1969) and subsequently field-tested in children (Taylor et al. 1973). It requires manipulation of objects that reflect everyday tasks and one writing task. Despite its use with children with varying diagnoses (Noronha et al. 1989; Mulcahey et al. 1995; Aliu et al. 2008; Klingels et al. 2013; Netscher et al. 2013; Lee et al. 2013a; Shingade et al. 2014), sound psychometric studies in samples of pediatric populations with upper extremity impairments are lacking (Gilmore et al. 2010). One study (Hiller and Wade 1992) established the discriminative validity of the Jebsen Test of Hand Function in children with Duchenne muscular dystrophy. In studies by Brandao et al. (2013), Shingade et al. (2014), and Lee et al. (2013b), the scores on the Jebsen Test of Hand Function were responsive to pediatric treatment, but others (Mulcahey et al. 1995; Staines et al. 2008; Aliu et al. 2008; Netscher et al. 2013; Noronha et al. 1989) reported limitations to the Jebsen Test of Hand Function when used with children. Bovend’Eerdt et al. (2004) described a modified Jebsen Test of Hand Function in which the number of items was reduced from seven to three items; a review of the literature did not reveal widespread use of the modified Jebsen Test of Hand Function.
The Box and Block Test (Mathiowetz et al. 1985a) is another generic performance measure that evaluates unilateral hand function as assessed by the number of blocks acquired, carried, and released in 1 minute. Although the majority of psychometric studies have been conducted with adults with neurologic and orthopedic impairments (Chen et al. 2009; Desrosiers et al. 1994; Lin et al. 2010; Platz et al. 2008), studies have also been done with children. Jongbloed-Pereboom (2013) established norms for children between 3 and 10 years of age; Mulcahey (2012a) showed that the Box and Block Test had strong discriminant validity in children with BPBP, noting that the scores discriminated among the three primary categories of brachial plexus injuries were predictive of classification of neurological deficits; and Ekblom et al. (2013) used the instrument with children with limb deficiencies.
The Assisting Hand Assessment (AHA) (Krumlinde-Sundholm and Eliasson 2003) is an upper extremity performance measure that evaluates the use of the assisting hand while performing bimanual play in usual environments. Based on the work by Gordon (2007) and supported by the International Classification of Functioning, Disability and Health (ICF) code assignment to the AHA items (Hoare et al. 2011), the AHA reflects what the child typically does in usual environments and thus may be more responsive to change and detecting effectiveness of treatment on typical activities in daily life. The AHA was developed using the Rasch model of measurement (Krumlinde-Sundholm et al. 2003) and has strong psychometric properties for children with spastic hemiplegia, cerebral palsy, and other orthopedic conditions (Krumlinde-Sundholm et al. 2007; Gordon 2007; Holmefur et al. 2007; Chang et al. 2013; Bialocerkowski et al. 2013). The Mini-AHA (Greaves et al. 2013) has been established for babies with CP between 8 and 18 months of age but has not been exposed to rigorous psychometric testing.
The Melbourne Assessment of Unilateral Upper Limb Function (MUUL) (Randall et al. 1999, 2008), the Quality of Upper Extremity Skills Test (QUEST) (DeMatteo et al. 1993, http://www.canchild.ca/en/measures/resources/1992_quest_manual.pdf), and the Shriners Hospitals Upper Extremity Evaluation (SHUEE) (http://www.greenvilleshrinershospital.org/2012/01/what-is-a-shuee) are performance measures that were developed to evaluate upper extremity function of children, primarily those with cerebral palsy. While they differ in administration and scoring, unlike the AHA, the MUUL, QUEST, and SHUEE are impairment or body structure-level measures (Hoare et al. 2011). All three instruments have strong psychometric properties when used with children with CP, provide important information about upper limb function, and have been used in treatment effectiveness studies (Klingels et al. 2008; Sakzewski et al. 2007; Bard et al. 2009; Randall et al. 2008; Klingels et al. 2010; Lee et al. 2013a; Thorley et al. 2012a; Thorley et al. 2012b; Davidson et al. 2006; Gilmore et al. 2010). Based on a systematic review of psychometric studies (Gilmore et al. 2010), for children with CP and upper limb involvement, the MUUL is recommended for assessment of unilateral performance and, when used with the AHA, is most effective at measuring change in unilateral and bimanual hand function over time or following treatment.
Patient-Reported Outcome Instruments
The use of patient-reported outcome (PRO) instruments has increased over the last several decades. They are now an integral element in outcomes research, including studies under the auspice of the US Food and Drug Administration (FDA) (2009), and longitudinal monitoring of usual care. Similar to the resources that are available on performance measures for the upper limb, there are notable resources on patient-reported outcome instruments (http://www.proqolid.org/about_proqolid; Lai et al. 2012; http://www.nihpromis.org; McPhail et al. 2012; Pencharz et al. 2001, http://www.neuroqol.org). When PRO instruments are used in pediatrics, a particular consideration in their selection involves the use of proxy reports (Magaziner et al. 1988) by parents. There is clear evidence that children as young as 6 years of age can report on their own health (Riley 2004) and that the information provided by children and parents is equally important, albeit many times, differs on perspectives of health outcomes (Eiser and Morse 2001; Majnemer et al. 2008; Forrest et al. 2004; Sheffler et al. 2009). Despite the lack of instruments developed and validated for child report, there is overwhelming agreement that child and parent outcomes should be obtained (Varni et al. 2005; Erhart et al. 2009; Tluczek et al. 2013).
Many PRO instruments that are designed for children evaluate global health-related outcomes and quality of life as opposed to outcomes specific to the upper extremity. As examples, the Pediatric Quality of Life Inventory™(PedsQL) (Varni et al. 1999), the Child Health Questionnaire (CHQ) (Landgraf et al. 1996), and the Pediatric Evaluation of Disability Inventory (PEDI) (Haley et al. 1992) are child/parent PRO through self-report or interview (PEDI) that captures functional activity associated with fine motor, self-care, school, play, and global health but do not focus on the upper extremity. The Pediatric Outcomes Data Collection Instrument (PODCI) and Activities Scale for Kids (ASK) are two other PRO instruments that also evaluate global health, but have subscales or specific items that address to the upper limb.
The PODCI (Daltroy et al. 1998) is a 114-item instrument with items focused on upper extremity (UE) function as well as physical function, activity and sports, mobility, pain, and happiness; it also has a satisfaction (with treatment) domain and normative values for comparison. The UE items focus on the difficulty encountered to complete self-care and school activities. As an example, the adolescent self-report version of the PODCI contains items such as “during the last week, was it easy or hard for you to comb your hair, use spoon or fork, and lift books.” The PODCI has undergone psychometric testing with a variety of clinical samples including children with chronic upper extremity conditions (Amor et al. 2011; Matsumoto et al. 2011; Lee et al., 2010; Nath et al. 2011; Dedini et al. 2008; Huffman et al. 2005) and in healthy children with isolated orthopedic injuries (Kunkel et al. 2011).
The Activities Scale for Kids (ASK) (Young et al. 2000; Plint et al. 2003), like the PODCI, was developed for children with musculoskeletal conditions and has items that address multiple domains of physical functioning. There are far fewer items on the ASK (n = 30) compared to the PODCI (n = 114) suggesting less burden for the child responder. Like the PODCI, the ASK has been used in treatment effectiveness studies involving chronic conditions as well as those who are without chronic conditions but who are being treated for isolated orthopedic impairments (von Keyserlingk et al.; Wai et al. 2005; Rabinovich et al. 2005; Wright et al. 2008; Boutis et al. 2010). The UE items of the ASK address the ability to perform self-care and play. Both the PODCI and ASK have strengths and limitations (Christakou and Laiou 2013) that should be considered in the context of evaluating outcomes associated with upper extremity function.
Perhaps the most widely used UE PRO is the Disabilities of the Arm, Shoulder and Hand (DASH) Outcome Measure (www.dash.iwh.on.ca/home). The DASH is a 30-item questionnaire designed to measure physical function and symptoms in patients with any or several musculoskeletal disorders of the upper limb. The DASH Outcome Measure contains two optional, four-item modules intended to measure symptoms and function in athletes, performing artists, and other workers whose jobs require a high degree of physical performance; these modules likely have little relevance to younger children. Because they may be having difficulties only at high performance levels – which are beyond the scope of the 30-item DASH Outcome Measure – clinicians may find the modules, which are scored separately from the DASH, useful in assessing these special patients. The QuickDASH is a shortened version of the DASH (11 items) and, despite question about its dimensionality (Gabel et al. 2009), it has good reliability and internal consistency in older children adolescents (Quatman-Yates et al. 2013). Further pediatric psychometric testing of the DASH and QuickDASH is needed to establish validity, reliability, and children’s ability to read and understand the items for self-report.
There are also disease-specific PRO instruments for children. The ABILHAND-Kids questionnaire is a 21-item measure of manual ability developed for and field-tested in children between 6 and 16 years of age with CP (Arnould et al. 2004). The strengths of the ABILHAND include its construction using the RASH measurement model (Penta et al. 1998), its linkage with the adult ABILHAND that allows for assessment of hand function across the pediatric-adult continuum using a common instrument (Vandervelde et al. 2012), and its use in randomized clinical trials (Aarts et al. 2010; Klingels et al. 2013; Sgandurra et al. 2013). Although the ABILHAND-Kids was developed for children with CP and despite lack of psychometric studies with other clinical populations, the ABILHAND-Kids questionnaire has been used with children with arthrogryposis (Foy et al. 2013), brachial plexus birth palsy (Spaargaren et al. 2011), limb deficiencies (Buffart et al. 2007b), and muscular dystrophy (Kumar and Phillips 2013).
The Prosthetic Upper Extremity Functional Index (PUFI) (Wright et al. 2001, 2003) and the Child Amputee Prosthetics Project-Functional Status Inventory (CAPP-FSI) are PRO instruments developed for children with limb deficiency. The PUFI has a version for children between 3 and 6 years old (n = 26 items) and for children older than 6 years (38 items); there are 14 common items (Buffart et al. 2006; van Dijk-Koot et al. 2009), presumably for linking. The scoring method for the PUFI is somewhat complex. Responses are scored on three scales: method of performance using a 6-point scale and ease of performance and usefulness of the prosthesis, both using 3-point scales (Buffart et al. 2006). Concurrent validity has been established with other measures and with parent response anchors (Buffart et al. 2006), but discriminative validity is poor (Buffart et al. 2006) and the validity of summed scores for non-prosthetic users has been questioned (van Dijk-Koot et al. 2009). Buffart et al. (2008) used the PUFI and AHA in combination to capture outcomes of hand\arm function (PUFI) and activity performance (AHA) and found a relationship between outcomes of the PUFI and functional performance as defined by the AHA.
Individual Patient-Reported Outcomes
The Canadian Occupational Performance Measure (COPM) (Law et al. 1990) and Goal Attainment Scaling (Bovend’Eeerdt et al. 2009) are unique from other standardized PRO instruments due to the individualized approach they use to establish goals. The COPM capitalizes on the semi-structured interviews that reflect usual interaction between an occupational therapist, who approaches practice from a client-centered framework, and his/her clients. Through semi-structured interviews, parents and children identify performance activities that are perceived as important by the parent, child, and/or society (e.g., activities that a child is expected to perform); performance is rated on a scale between 0 (cannot do) and 10 (can do very well) and the activities are used to establish goals for treatment. Changed scores between baseline and reassessment are calculated to evaluate outcomes of treatment; although individuals may differ in their idea of what constitutes meaningful change, research suggests that a change of two or more points reflects meaningful change (http://www.rehabmeasures.org). Psychometric properties of the COPM have been well established in adult and pediatric clinical samples (Cup et al. 2003; Dedding et al. 2004; Eyssen et al. 2011; Cusick et al. 2006). Although it was adapted for very young children (Cusick et al. 2007), the COPM focuses assessment of performance in self-care, productivity, and leisure (Law et al. 1990). The COPM has been used in pediatric studies (Carswell et al. 2004; McColl et al. 2005), several of which demonstrated its responsiveness to change (Mulcahey et al. 1995; Davis et al. 1999; Pollock et al. 2013; Brandao et al. 2013).
Goal Attainment Scaling (GAS ) is a technique for evaluating individual progress toward patient-defined goals that involves a sequential process that sets goals of treatment, assigns a weight for each goal based on the importance or priority of the patient, establishes a continuum of possible outcomes, assesses baseline function, provides intervention for a specified period of time, evaluates performance on each goal using specified possible outcomes, and evaluates the extent of goal attainment (Kiresuk et al. 1994). Similar to the COPM, one of the strengths of GAS is the ability to evaluate individualized change (Mailloux et al. 2007). However, unlike the COPM, Kiresuk et al. (1994) have demonstrated that GAS scores from multiple patients can be aggregated and compared. Whereas the COPM focuses on occupational performance within the domains of productivity, self-care, and leisure, an advantage of GAS is that goals can be established across the International Classification of Functioning, Disability and Health (ICF) domains. Multiple upper extremity studies have used the GAS to evaluate treatment outcomes (Ten Berge et al. 2012; Wesdock et al. 2008; Lowe et al. 2007; Steenbeek et al. 2007) and showed that it was more responsive than two widely used standardized pediatric instruments (Steenbeek et al. 2010).
Cusick and colleagues (2006) examined the utility of the COPM and GAS as an outcome measure for pediatric rehabilitation; children with CP received occupational therapy or occupational therapy and botulinum toxin A injection. Cusick et al. (2006) found that both the COPM and GAS had utility as outcome measures, with the GAS associated with more sensitivity to treatment and the COPM associated with less burden. While Steenbeek et al. (2010) showed good inter-rater reliability of GAS, Bovend’Eerdt et al. (2011) found it to be poor and suggested further work on improving reproducibility of GAS scoring prior to use in research and clinical trials. Table 2 provides an example of GAS for a child with a cervical spinal cord injury who had upper extremity reconstructive surgery for restoring grasp and pinch.
Table 2
Example item on GAS with a 16-year-old boy with C6 spinal cord injury who had the brachioradialis transferred to the flexor pollicis longus for pinch and the radial wrist extensor to the finger flexors for grasp. The “concern” (first column) drives the development of the goal (second column). The uniqueness of GAS is the description of the scale (columns 3–6) that is developed collaboratively among the physician, therapist, child, and parent. The score of “0” indicates that the goal was achieved; scores of −1 and −2 indicate that the goal was not achieved, but provides a mechanism to evaluate progress (score of −1). The scores of +1 and +2 indicate that the child exceeded the expectation and agreed upon goal (score 0). Pre- and post-tendon transfer scores are shown in columns 8 and 9, respectively. TT=tendon transfer