Literature DB >> 33216780

Intraday reliability, sensitivity, and minimum detectable change of national physical fitness measurement for preschool children in China.

Abstract

China General Administration of Sport has published and adopted the National Physical Fitness Measurement (NPFM-preschool children version) since 2000. However, studies on intraday reliability, sensitivity, and minimum detectable change (MDC) are lacking. This study aimed to investigate and compare the reliability, sensitivity, and MDC values of NPFM in preschool children between the ages of 3.5 to 6 years. Six items of NPFM including 10-m shuttle run, standing long jump, balance beam walking, sit-and-reach, tennis throwing, and double-leg timed hop, were tested for 209 Chinese kindergarten children in Beijing in the morning. Intraday relative reliability was tested using intraclass correlation coefficient (ICC3,1) with a 95% confidence interval while absolute reliability was expressed in standard error of measurement (SEM) and percentage of coefficient of variation (CV%). Test sensitivity was assessed by comparing the smallest worthwhile change (SWC) with SEM, while MDC values with 95% confidence interval (MDC95) were established. Measurements in most groups, except 10-m shuttle run test (ICC3,1: 0.56 to 0.74 [moderate]) in the 3.5 to 5.5-year-old groups, balance beam test in 4- and 5-year-old (ICC3,1: 0.33 to 0.35 [poor]) and 5.5-year-old (ICC3,1 = 0.68 [moderate]) groups, and double-leg timed hop test (ICC3,1 = 0.67 [moderate]) in the 4.5-year-old group, demonstrated good to excellent relative reliability (ICC3,1: 0.77 to 0.97). The balance beam walking test showed poor absolute reliability in all the groups (SEM%: 11.76 to 22.28 and CV%: 15.40 to 24.78). Both standing long jump and sit-and-reach tests demonstrated good sensitivity (SWC > SEM) in all subjects group, boys, and girls. Pairwise comparison revealed systematic bias with significantly better performance in the second trial (p<0.01) of all the tests with moderate to large effect size.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Year: 2020 PMID： 33216780 PMCID： PMC7678998 DOI： 10.1371/journal.pone.0242369

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Evaluation of physical fitness level is vital for recognizing health conditions and predicting the risk of chronic diseases for populations [1-3]. Therefore, many countries have developed and adopted a battery of national fitness tests with health-related fitness components, such as muscular strength, flexibility, cardiorespiratory endurance, and body composition [4-6]. Similarly, for preschool children, a battery of or protocol for comprehensive physical fitness tests is essential to monitor trends and severity of obesity issues and determine adequacy of physical activities [7]. Therefore, the China General Administration of Sport has published and adopted the National Physical Fitness Measurement (NPFM) since 2000 while its preschool children version was developed concurrently with six assessment items, namely, 10-m shuttle run (SRT), standing long jump (SLJ), balance beam walking, sit-and-reach, tennis throwing (TT), and double-leg timed hop (DTH) tests [8]. NPFM is a longitudinal study promoted by the Chinese government to observe health and fitness conditions from large samples of populations. The test results can be used to compare findings of preschoolers of similar ages from other countries. Similarly, the government uses the test results of children to understand the variation of physical fitness competence among cities, evaluate outcomes and performance of the “national fitness program” being promoted to Chinese citizens, and provide scientific evidence for updating such program with justifications and rationales. Apart from determining the fitness level, NPFM for preschoolers can also be used to identify motor performance, screen underdeveloped children for further evaluation, and enhance exercise motivation. In this regard, a battery of tests adopting reliable and useful testing items is crucial to provide meaningful results for further analyses. Therefore, reliable and valid measurements with sufficient sensitivity are vital. Previous studies showed excellent reliability of FITness testing in PREschool children (PREFIT) in Spain using Bland-Altman method, intra-class correlation coefficient (ICC) and the comparison of mean differences [6, 9]. Meanwhile, the systematic review from Ortega et al. [4] reported that 4 x 10 m shuttle-run test has provides reliable measures in speed and agility related fitness for preschoolers aged 4 to 5 years (ICC: 0.52 to 0.92) and one-leg-stance test is a popular and reliable test for assessing the balance of preschool children (ICC: 0.73 to 0.99). In addition, the standing long jump test used in testing 4- and 5-year-old preschool children showed acceptable relative reliability (ICC: 0.65 to 0.89). Regarding the studies using Chinese NPFM, the level of physical fitness and activity of preschool children in Shanghai was reported recently [7, 10]. However, investigations on the reliability, sensitivity, and minimum detectable change (MDC) values of testing items in NPFM are lacking. As preschool children undergo rapid development in motor skills and physical fitness [11, 12], Latorre Román et al. [5] demonstrated the remarkable variation in the physical fitness of preschool children of different ages and large variance of performance within groups. Therefore, the reliability and sensitivity of the test battery in NPFM that assess Chinese preschool children of different ages are speculated to be varied also with such immature motor development and unstable motor performance. This study aimed to investigate and compare the reliability, sensitivity, and MDC values of NPFM in preschool children between the ages of 3.5 and 6 years to solve these problems.

Materials and methods

Subjects

This study was approved by the institutional review board of Beijing Sports University and conducted according to the Declaration of Helsinki by strictly following the protocol of NPFM (preschool children version), which was published by the government of China [8]. Two hundred and nine Chinese kindergarten children (111 boys and 98 girls) were recruited on a voluntary basis. Anthropometric data, such as age, body height, and body mass, of different genders and age groups are listed in Table 1. Subjects were further divided into the following subgroups according to their chronological ages: ≤ 3.5 (n = 31) < 4, ≤ 4 (n = 22) < 4.5, ≤ 4.5 (n = 43) < 5, ≤ 5 (n = 24) < 5.5, ≤ 5.5 (n = 45) < 6, and ≤ 6 (n = 44) years old. The classification system was based on principles and instructions of NPFM [8]. Three-year-old preschool children were not included because tests were conducted in the second semester of their academic year. Therefore, the youngest group in this study was composed of 3.5-year-old children while the oldest group comprised students above 6 years old. Informed written consent containing experimental procedures, potential benefits, and explained risks was obtained from each child’s parents. Any subject with diagnosed illness or identified deformity that may potentially limit the completion of NPFM was excluded to enhance the testing accuracy and minimize the risk of injuries.

Table 1

Anthropometric data of different genders and age groups.

Group	Age (year)	Height (cm)	Weight (kg)
Group	mean±SD	mean±SD	mean±SD
All subjects (n = 209)	5.14±0.88	112.61±7.63	20.53±4.18
Boys (n = 111)	5.13±0.91	113.09±7.99	20.81±4.45
Girls (n = 98)	5.16±0.86	112.06±7.21	20.21±3.84
3.5-year-old (n = 31)	3.76±0.12	102.30±3.19	16.47±1.32
4-year-old (n = 22)	4.22±0.12	106.17±4.04	18.66±3.66
4.5-year-old (n = 43)	4.76±0.16	110.03±4.19	19.25±2.76
5-year-old (n = 24)	5.24±0.14	112.00±4.01	20.12±3.39
5.5-year-old (n = 45)	5.76±0.14	118.51±4.88	23.32±4.51
6-year-old (n = 44)	6.28±0.19	119.89±4.62	22.94±3.57

Procedures

NPFM was conducted by trained research assistants on a synthetic rubber surface at the outdoor playground of a kindergarten school in Beijing in the morning. Subjects performed six mandatory testing items in randomized order. According to the current NPFM guidelines [8], no previous familiarization session was given. After providing verbal instructions and demonstrations, each subject performed two trials for each measurement item with at least one minute of rest in between while all the tests were conducted by the same rater.

Double-leg timed hop test

Ten rectangular soft blocks (10 cm [length] × 5 cm [width] × 5 cm [height]) were placed in a straight line at 50 cm apart from each other and used as barriers. Prior to the start of DTH, posture and position of subjects were standardized as standing with their feet together at 20 cm behind the first block. Subjects were required to jump over all the barriers as fast as possible after the start signal was given. The time to complete jumping over all the blocks was recorded while any trial with foot stepping or kicking on the barrier was regarded as fail. Subjects had to redo the test for failed trials. The test results were measured in seconds [8].

Standing long jump test

Subjects stood behind the starting line as the ready position and were instructed to jump as far as possible with arms swinging and landing with both feet for the SLJ test. The distance was recorded in centimeters using a tape measure from the starting line to the heel of the rear landing foot [8].

Tennis throwing test

Subjects stood behind the starting line and threw a tennis ball forward as far as possible for the TT test. Any trial with the foot stepping on or over the starting line during or after throwing was regarded as fail and redoing the failed attempt was required. The testing results were measured from the starting line to the first landing point of the ball in meters [8].

10-m shuttle run test

An object with similar height to the majority of subjects was set at a distance of 10 m from the starting line to ensure minimum change of running posture. Each subject was instructed to reach out an arm and touch the object before turning. Subjects were required to run at full speed after the “action” signal was given, touch the target object, and run back to the starting line as fast as possible, with the results recorded in seconds [8].

Balance beam walking test

Subjects were required to walk along a 3 m-long, 10 cm-wide, and 30 cm-high balance beam as fast as possible with arms kept at a 90° abduction position. The completion time was recorded in seconds. In the event that a subject falls down from the beam during the walking process, the test was regarded as fail and a make-up trial was needed [8].

Sit-and-reach test

Subjects sat on the ground with bare feet together and knees straight. Before starting, the soles of their feet should press against the edge of the sit-and-reach box and such contact position was regarded as zero point. Subjects were required to bend their trunks forward and push the moveable marker of scale plate with their fingertips as far as possible without bending their knees. The distance from the start point to the place where the marker stopped was recorded in centimeters. Trials with a stopped marker that failed to pass the zero point were recorded as negative values [8].

Statistical analyses

The results were presented as mean and standard deviation (SD) while the intraday relative reliability was tested using intraclass correlation coefficient with two-way mixed-effects model and single measurement (ICC3,1) with a 95% confidence interval (95% CI) using SPSS 24.0 for Windows (SPSS Inc.; Chicago, IL). ICC values of less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and larger than 0.90 are regarded as poor, moderate, good, and excellent relative reliability, respectively [13]. Meanwhile, the standard error of measurement (SEM) or typical error according to Hopkins (2000) [14] and MDC with 95% CI (MDC95) were obtained using the following formulas: and MDC95 = , where SD used for calculating SEM is the standard deviation of the difference between trials [15]. SEM% is the percentage of mean cumulative test–retest scores [16]. Coefficient of variation expressed in the percentage of the mean score of individuals (CV%) together with SEM% was calculated to indicate the absolute reliability [17], and SEM% and CV% below 10% were deemed acceptable [16, 17]. Smallest worthwhile change (SWC) is calculated using 0.2 × SD, where SD represents the between-subject standard deviation of the best trial, to verify the usefulness of each test further. Test sensitivity was assessed by comparing the SWC and SEM, where SEM below SWC indicates “good” sensitivity, SEM similar to SWC is rated “satisfactory,” and SEM higher than SWC is deemed to have “marginal” sensitivity [18, 19]. Paired sample t-tests were used to determine the significant difference between trials and confirm the existence of systematic bias. Effect size (Cohen’s d) further provided the magnitude of difference while significance level for all statistical tests was set to p<0.05 and heteroscedasticity was determined.

Results

Reliability and sensitivity analysis for all subjects, boys, and girls

Heteroscedasticity was nonsignificant in all the groups (p: 0.11 to 0.98). Table 2 shows good to excellent ICCs (0.77 to 0.97) of all the measurements in the groups of all subjects, boys, and girls. However, the balance beam walking test demonstrated poor absolute reliability for the groups of all ages (SEM% = 18.05 and CV% = 20.43), boys (SEM% = 17.96 and CV% = 20.47), and girls (SEM% = 18.10% and CV% = 20.38%). MDC95 values in the balance beam walking test for groups of all subjects, boys, and girls showed a minimum threshold of 4.09, 3.99, and 4.18 s, respectively, which are beyond the random measurement error with a 95% confidence level.

Table 2

ICC, CV%, SEM, SWC, and MDC95 and classification of sensitivity of all subjects, boys, and girls.

Group	Testing Items	Mean±SD		ICC (95% CI)	CV%	SEM (%)	SWC	MDC₉₅	Sensitivity
Group	Testing Items	Trial 1	Trial 2	ICC (95% CI)	CV%	SEM (%)	SWC	MDC₉₅	Sensitivity
All subjects (n = 209)	10-m SRT (s)	8.29±1.38	7.77±1.26	0.84 (0.44–0.93)	6.27	0.42 (5.19)	0.25	1.16	Marginal
	SLJ (cm)	89.26±22.82	93.11±22.85	0.96 (0.87–0.98)	5.04	3.81 (4.18)	4.54	10.56	Good
	TT (m)	4.64±1.93	5.07±1.89	0.94 (0.75–0.98)	10.51	0.36 (7.47)	0.38	1.01	Satisfactory
	DTH (s)	7.32±2.38	6.62±2.15	0.90 (0.56–0.96)	8.79	0.53 (7.63)	0.42	1.47	Marginal
	Sit-and-reach (cm)	8.04±4.47	9.43±4.50	0.94 (0.19–0.98)	18.26	0.63 (7.17)	0.90	1.74	Good
	Balance beam walking (s)	8.88±4.39	7.45±4.26	0.84 (0.60–0.92)	20.43	1.47 (18.05)	0.80	4.09	Marginal
Boys (n = 111)	10-m SRT (s)	8.14±1.18	7.64±1.16	0.80 (0.40–0.91)	6.40	0.42 (5.32)	0.22	1.16	Marginal
	SLJ (cm)	91.43±23.53	95.38±23.63	0.96 (0.87–0.98)	5.06	3.94 (4.22)	4.68	10.93	Good
	TT (m)	5.19±2.00	5.58±1.96	0.96 (0.78–0.99)	7.78	0.30 (5.56)	0.40	0.83	Good
	DTH (s)	7.30±2.43	6.67±2.24	0.92 (0.63–0.97)	8.20	0.50 (7.13)	0.44	1.38	Marginal
	Sit-and-reach (cm)	6.01±3.56	7.75±3.84	0.87 (−0.01–0.97)	28.32	0.68 (9.89)	0.77	1.89	Good
	Balance beam walking (s)	8.82±3.73	7.23±3.54	0.77 (0.40–0.89)	20.47	1.44 (17.96)	0.65	3.99	Marginal
Girls (n = 98)	10-m SRT (s)	8.46±1.57	7.93±1.36	0.86 (0.45–0.95)	6.11	0.41 (5.06)	0.27	1.15	Marginal
	SLJ (cm)	86.80±21.86	90.53±21.76	0.96 (0.86–0.98)	5.02	3.67 (4.14)	4.33	10.18	Good
	TT (m)	4.03±1.64	4.51±1.63	0.90 (0.63–0.96)	13.60	0.42 (9.90)	0.33	1.71	Marginal
	DTH (s)	7.33±2.32	6.57±2.06	0.88 (0.47–0.95)	9.45	0.57 (8.16)	0.40	1.57	Marginal
	Sit-and-reach (cm)	10.35±4.29	11.33±4.46	0.97 (0.28–0.99)	6.86	0.41 (3.75)	0.89	1.13	Good
	Balance beam walking (s)	8.95±5.06	7.70±4.96	0.88 (0.74–0.94)	20.38	1.51 (18.10)	0.95	4.18	Marginal

Abbreviations: SRT, Shuttle Run Test; SLJ, Standing Long Jump; TT, Tennis Throwing; DTH, Double-leg Timed Hop; s, second; cm, centimeter; m, meter; SD, Standard Deviation; ICC, Intraclass Correlation Coefficient; CV%, Percentage of Coefficient of Variation; CI, Confidence Interval; SEM, Standard Error of Measurement; SWC, Smallest Worthwhile Change; MDC95, Minimum Detectable Change in 95% CI. SLJ demonstrated good sensitivity in the group of all subjects (SWC = 4.54 > SEM = 3.81), boys (SWC = 4.68 > SEM = 3.94), and girls (SWC = 4.33 > SEM = 3.67). Similarly, the sit-and-reach test showed good sensitivity in the group of all subjects (SWC = 0.90 > SEM = 0.63), boys (SWC = 0.77 > SEM = 0.68), and girls (SWC = 0.89 > SEM = 0.41). Only the boys group (SWC = 0.40 > SEM = 0.30) exhibited good sensitivity in the TT test, while satisfactory sensitivity was observed in all the subjects (SWC = 38 ≈ SEM = 0.36).

Reliability and sensitivity analysis for different age groups

Intraday reliability in ICC, CV%, SEM, SWC, and MDC95 data and classification of sensitivity in 3.5-, 4-, 4.5-, 5-, 5.5-, and 6-year-old subjects are presented in Table 3. The majority of measurements showed good to excellent relative reliability (ICC: 0.79 to 0.95), except the 10-m SRT (ICC: 0.67 to 0.73 [moderate]) in three groups (3.5-, 4-, and 5-year-old subjects), balance beam test (ICC: 0.33 to 0.68 [poor to moderate]) in 4-, 5-, and 5.5-year-old subjects, and DTH (ICC = 0.67 [moderate]) in 4.5-year-old subjects. However, according to SEM% and CV% values, the balance beam walking test demonstrated poor absolute reliability (SEM%: 11.25 to 22.28 and CV%: 15.40 to 24.78) for all the age groups.

Table 3

ICC, CV%, SEM, SWC, and MDC95 and classification of sensitivity in 3.5-, 4-, 4.5-, 5-, 5.5-, and 6-year-old subjects.

Group	Testing Items	Mean±SD		ICC (95% CI)	CV%	SEM (%)	SWC	MDC₉₅	Sensitivity
Group	Testing Items	Trial 1	Trial 2	ICC (95% CI)	CV%	SEM (%)	SWC	MDC₉₅	Sensitivity
3.5-year-old	10-m SRT (s)	9.85±1.35	9.09±1.26	0.67 (0.15–0.86)	7.79	0.61 (6.47)	0.24	1.70	Marginal
	SLJ (cm)	65.26±18.76	68.45±18.01	0.93 (0.83–0.97)	7.04	4.39 (6.57)	3.61	12.18	Marginal
	TT (m)	2.79±1.18	3.32±1.12	0.80 (0.25–0.93)	17.81	0.39 (12.63)	0.23	1.07	Marginal
	DTH (s)	9.73±2.71	8.54±2.43	0.85 (0.12–0.96)	10.73	0.66 (7.22)	0.49	1.83	Marginal
	Sit-and-reach (cm)	8.50±3.35	9.74±3.52	0.88 (0.39–0.96)	9.83	0.87 (9.59)	0.70	2.42	Marginal
	Balance beam walking (s)	13.23±5.90	11.32±6.46	0.83 (0.59–0.92)	23.87	2.30 (18.78)	1.23	6.39	Marginal
4-year-old	10-m SRT (s)	8.93±1.02	8.13±0.97	0.68 (−0.08–0.91)	7.18	0.33 (3.84)	0.19	0.91	Marginal
	SLJ (cm)	65.95±18.97	71.14±17.83	0.91 (0.60–0.97)	8.06	4.43 (6.47)	3.70	12.29	Marginal
	TT (m)	3.61±1.41	4.11±1.45	0.91 (0.26–0.98)	11.90	0.28 (7.35)	0.29	0.79	Satisfactory
	DTH (s)	9.64±2.66	8.93±2.58	0.88 (0.65–0.95)	10.05	0.80 (8.65)	0.50	2.22	Marginal
	Sit-and-reach (cm)	6.16±2.41	7.25±2.81	0.87 (0.15–0.97)	11.63	0.61 (9.09)	0.56	1.69	Marginal
	Balance beam walking (s)	10.86±2.90	9.60±2.78	0.33 (−0.06–0.65)	22.73	2.28 (22.28)	0.46	6.32	Marginal
4.5-year-old	10-m SRT (s)	8.87±0.97	8.39±0.95	0.73 (0.29–0.88)	5.88	0.42 (4.81)	0.19	1.15	Marginal
	SLJ (cm)	83.16±12.62	86.49±12.14	0.88 (0.69–0.95)	4.62	3.63 (4.28)	2.40	10.06	Marginal
	TT (m)	4.58±1.70	5.04±1.78	0.94 (0.53–0.98)	9.87	0.30 (6.20)	0.35	0.83	Good
	DTH (s)	6.89±1.31	5.96±1.21	0.67 (−0.05–0.89)	11.87	0.48 (7.52)	0.23	1.34	Marginal
	Sit-and-reach (cm)	9.46±4.27	10.70±4.32	0.95 (0.08–0.99)	11.06	0.43 (4.24)	0.86	1.18	Good
	Balance beam walking (s)	10.78±4.68	8.28±4.75	0.83 (−0.02–0.95)	24.08	1.07 (11.25)	0.94	2.97	Marginal
5-year-old	10-m SRT (s)	8.53±0.75	8.02±0.53	0.56 (−0.02–0.82)	5.40	0.34 (4.16)	0.11	0.95	Marginal
	SLJ (cm)	84.96±14.60	89.71±16.58	0.90 (0.53–0.97)	5.24	3.70 (4.24)	3.19	10.26	Marginal
	TT (m)	3.88±1.31	4.28±1.30	0.91 (0.50–0.97)	9.76	0.30 (7.31)	0.26	0.83	Marginal
	DTH (s)	6.95±3.23	6.15±2.90	0.94 (0.55–0.98)	11.47	0.55 (8.46)	0.58	1.54	Satisfactory
	Sit-and-reach (cm)	8.72±5.05	10.36±5.27	0.94 (0.06–0.99)	18.20	0.58 (6.10)	1.05	1.61	Good
	Balance beam walking (s)	7.16±1.85	5.70±1.56	0.35 (−0.05–0.66)	24.78	1.24 (19.35)	0.30	3.45	Marginal
5.5-year-old	10-m SRT (s)	7.52±0.84	7.12±0.86	0.74 (0.34–0.89)	6.11	0.37 (4.99)	0.17	1.01	Marginal
	SLJ (cm)	103.51±12.92	107.0±13.74	0.90 (0.70–0.96)	3.77	3.57 (3.39)	2.68	9.89	Marginal
	TT (m)	5.72±1.78	6.07±1.73	0.92 (0.81–0.97)	8.40	0.43 (7.22)	0.35	1.18	Marginal
	DTH (s)	6.25±0.87	5.80±0.88	0.79 (0.12–0.93)	6.29	0.28 (4.68)	0.17	0.78	Marginal
	Sit-and-reach (cm)	8.37±4.69	9.83±4.52	0.94 (0.09–0.99)	26.05	0.56 (6.19)	0.90	1.56	Good
	Balance beam walking (s)	6.48±1.78	5.68±1.59	0.68 (0.33–0.84)	16.06	0.84 (13.86)	0.31	2.34	Marginal
6-year-old	10-m SRT (s)	6.97±0.88	6.60±0.87	0.80 (0.37–0.92)	5.74	0.31 (4.59)	0.17	0.86	Marginal
	SLJ (cm)	111.55±14.03	115.57±14.34	0.90 (0.64–0.96)	3.70	3.59 (3.17)	2.83	9.96	Marginal
	TT (m)	5.85±1.72	6.23±1.67	0.92 (0.76–0.97)	7.87	0.40 (6.70)	0.33	1.12	Marginal
	DTH (s)	6.16±0.99	5.87±1.05	0.91 (0.56–0.97)	4.87	0.23 (3.77)	0.21	0.63	Satisfactory
	Sit-and-reach (cm)	6.57±5.05	8.14±5.01	0.94 (0.13–0.99)	26.60	0.65 (8.82)	1.00	1.80	Good
	Balance beam walking (s)	6.38±2.27	5.59±2.56	0.87 (0.57–0.95)	15.40	0.70 (11.76)	0.47	1.95	Marginal

Abbreviations: SRT, Shuttle Run Test; SLJ, Standing Long Jump; TT, Tennis Throwing; DTH, Double-leg Timed Hop; SD, Standard Deviation; ICC, Intraclass Correlation Coefficient; CI, Confidence Interval; CV%, Percentage of Coefficient of Variation; SEM, Standard Error of Measurement; SWC, Smallest Worthwhile Change; MDC95, Minimum Detectable Change in 95% CI. The comparison of SWC and SEM values showed that most measurements demonstrated only marginal sensitivity, except the TT test of 4.5-year-old subjects (SWC = 0.35 > SEM = 0.30) and the sit-and-reach test of 4.5- (SWC = 0.86 > SEM = 0.43), 5- (SWC = 1.05 > SEM = 0.58), 5.5- (SWC = 0.90 > SEM = 0.56), and 6-year-old (SWC = 1.00 > SEM = 0.65) subjects. Meanwhile, satisfactory sensitivity was observed in the TT test of 4-year-old subjects (SWC = 0.29 ≈ SEM = 0.28) and DTH in 5- (SWC = 0.58 ≈ SEM = 0.55) and 6-year-old (SWC = 0.21 ≈ SEM = 0.23) subjects.

Differences and effect size between trials of all measurements

The results of pairwise sample t-test (Table 4) showed a significant difference between trials for all the measurements of the 10-m SRT (p<0.01 and d = 0.87 [large]), SLJ (p<0.01 and d = 0.71 [moderate]), TT (p<0.01 and d = 0.84 [large]), DTH (p<0.01 and d = 0.92 [large]), sit-and-reach (p<0.01 and d = 1.57 [large]), and balance beam walking (p<0.01 and d = 0.69 [moderate]) tests.

Table 4

Differences in mean values between trials of measurements.

Group	Testing Items	Difference in Trials 2 and 1 (SD)	p	Effect Size Cohen’s d
All subjects	10-m SRT (s)	−0.52 (0.60)	< 0.01	−0.87	Large
	SLJ (cm)	3.85 (5.39)	< 0.01	0.71	Moderate
	TT (m)	0.43 (0.51)	< 0.01	0.84	Large
	DTH (s)	−0.70 (0.75)	< 0.01	−0.92	Large
	Sit-and-reach (cm)	1.39 (0.89)	< 0.01	1.57	Large
	Balance beam (s)	−1.43 (2.08)	< 0.01	−0.69	Moderate

Abbreviations: SRT, Shuttle Run Test; SLJ, Standing Long Jump; TT, Tennis Throwing; DTH, Double-leg Timed Hop; SD, Standard Deviation

Discussion

This study primarily aimed to set up the intraday reliability, MDC, and sensitivity of six key testing items of NPFM by comparing between trials. The systematic bias of observed differences, such as potential of the learning effect to lead to a higher degree of familiarity of the selected measurement, insufficient recovery from the previous trial that induces the fatigue effect to subsequent attempts, and different emotional statuses or motivation levels, can be detected when intertrial reliability is determined [17]. The findings shown in Table 2 indicated that all the testing items generally demonstrate a good to excellent relative reliability in preschool children. ICC is commonly used to assess the reliability of a measurement or testing method, wherein values over 0.90 are regarded as excellent relative test–retest reliability. Tests with excellent ICCs exhibit good stability and consistency of measurement over time and low measurement error [20]. However, previous studies reported limitations, such as inter subject variability that can potentially affect the result and overestimated ICC values in a typically heterogeneous population, in the use of ICC alone [21]. Therefore, measurements with excellent relative reliability do not necessarily ensure consistent intertrial performance. Calculations of SEM and CV% were further recommended to obtain within-subject variation in addition to measuring ICCs and confirm the absolute reliability [18, 22]. Analysis of the absolute reliability during performance-related tests in nonathletic settings demonstrated that CV% below 10% are regarded as acceptable agreement [17], while Fox et al. [16] specified the threshold of acceptable reliability as not more than 10% of SEM. In this regard, the balance beam walking test showed poor absolute reliability in boys, girls, and all the subjects. Further evaluation of the results in different age subgroups (Table 3) demonstrated that several measurements, including 10-m SRT (3.5 to 5.5-year-old subjects), DTH (4.5-year-old subjects), and balance beam walking test (4-, 5-, and 5.5-year-old subjects), failed to reach a satisfactory relative reliability level. Notably, SLJ, TT, and sit-and-reach tests that primarily measured the distance rather than the time can produce better intertrial relative reliability results in preschoolers. This finding may be related to the nature and complexity of required motor skills in measurements. Furthermore, the balance beam walking test for all the subdivided age groups and the TT test for 3.5-year-old subjects showed an unacceptable level of absolute reliability. Recent studies have reported that the complexity of tests directly alters the consistency of their testing results [23-25]. Only a limited or short distance of locomotion was required in the sit-and-reach, TT, and SLJ tests of our study. Conversely, 10-m SRT, DTH, and balance beam walking test required preschoolers to walk, run, or jump over remarkably longer distances and testing durations. Therefore, these measurements included additional repeated movements and potentially high demands on movement consistency. Moreover, subjects can start their test with preplanned or preprogrammed motor skills (open-loop control-oriented items) without the stress of time limits during sit-and-reach, TT, and SLJ tests. Conversely, 10-m SRT, DTH, and balance beam walking test required subjects to execute motor skills using closed-loop control and integrate sensory feedback for movement or postural corrections during processes [26]. Therefore, subsequent repeated movements must be completed continuously without pause or other preparation time once these tests have started. Testing items that use closed-loop motor control can potentially lead to increased inconsistency in the testing results and hence relatively poor test–retest reliability in preschool children. Apart from issues of test characteristics, previous studies showed that older preschool children demonstrate superior motor performance in both locomotion and object control [11, 12]. The comparison of relative and absolute reliability in our study clearly demonstrated that the oldest group (6-year-old subjects) generally showed a higher degree of relative and absolute reliability than the youngest group (3-year-old subjects). Gabbard [27] recently reported that refinement and maturation of fundamental motor skills only occur during late childhood (age of 6–12 years). Latorre Román et al. [5] presented high consistency of motor performance in the same testing items among older preschool children; therefore, maturity of preschool children can be a key factor that affects intraday reliability [5]. Furthermore, 10-m SRT, sit-and-reach, and balance beam walking tests are more reliable when preschool girls are tested, while TT and DTH are more reliable when preschool boys are examined. Although preschool boys and girls showed similar object control and locomotor skills in some studies [28, 29], Hardy et al. [30] found that girls performed better than boys in locomotor skills. Regarding balance performance, girls demonstrated better postural control and hence superior performance in balance tasks than boys [31-33]. Previous studies also showed that girls outperformed boys in flexibility throughout childhood until adolescence [3, 32]. The comparison of TT ability exhibited consistent results with a previous study such that superior performance in male children was observed [34]. Although investigations on DTH are lacking, recent studies indicated that boys performed significantly better than girls in leap, SLJ, and sideway jump tests [35, 36]. These findings are consistent with our study, wherein boys showed better performance in both relative and absolute reliability in DTH. The improved intertrial relative reliability of certain testing items of genders may be explained via two aspects. (1) The more superior motor skills and development demonstrated by boys or girls in certain testing items can lead to both high and consistent motor performance. (2) The learning effect available for skills that are already well performed is related to the diminishing gain or decreased margins. Apart from the relative and absolute reliability, estimating the MDC with 95% confidence interval (MDC95) was recommended in recent studies [20]. Determining whether the observed change is due to the real effect from intervention or measurement error is unclear without prior knowledge of the MDC value although a high degree of test–retest reliability is provided. Our results demonstrated very large MDC95 values for all subjects in the balance beam walking test at 4.09 s, which is 54.9% of the performance of the better trial (7.45 s). Hence, preschool children must achieve a reduction of at least 55% in their balance beam walking time to show meaningful or real improvement with 95% confidence for excluding errors induced during the measurement. In this regard, further investigations on the source of measurement errors or reasons for such unreliable performance during the balance beam walking test for preschool children are necessary. Otherwise, the government should consider devising another test to replace the balance beam walking assessment and produce improved reliability and usefulness and valid results for testing dynamic balance. Apart from reliability data and MDC95 values, practitioners also intend to determine threshold values beyond zero that can represent the minimum change required for practically meaningful results using SWC. SWC and SEM values are commonly compared to express and understand test sensitivity [17]. Briefly, Liow and Hopkins [37] established thresholds to determine whether a test has “good sensitivity” and detect changes if SEM is smaller than SWC; the test has “satisfactory sensitivity” if SEM is equal to SWC, while the test only has “marginal sensitivity” if SEM is larger than SWC. The analysis of NPFM sensitivity exhibited that the effectiveness of each testing item in NPFM to detect real and practically meaningful change in the performance of individuals can be verified. The sit-and-reach test in our study showed good sensitivity in all the groups, except for 3.5- and 4-year-old subjects. Despite the gender and age consideration, SWC of the sit-and-reach test for all the preschool children was 0.90, while SEM and MDC95 were 0.63 and 1.74 cm, respectively. Therefore, any observed change beyond 0.90 cm can be regarded as practically meaningful and exceeds the typical error of measurement. Practitioners have 95% confidence to consider the change as real rather than a measurement error when the observed change is over 1.74 cm. By comparison, SLJ only showed good sensitivity when it was used in the group of all subjects, boys, and girls but only marginal sensitivity was observed in all the subdivided age groups. Similarly, the TT test only showed good sensitivity in boys and 4.5-year-old subjects and satisfactory sensitivity in overall and 4-year-old subjects. Moreover, 10-m SRT, DTH, and balance beam walking test showed marginal sensitivity in most groups. Among the testing items of NPFM, only SLJ, TT, and sit-and-reach tests were considered simple tests using open-loop control and showed good or satisfactory sensitivity in several subject groups. Therefore, typical errors with relatively low SEM and high SWC values in these three testing items will unlikely mask the detectable and meaningful improvement when used in particular preschool groups [38]. Paired sample t-test revealed significant differences between trials while clear improvements with moderate to large effects were observed on the second trial of all the tests, thereby showing considerable systematic bias. Given that original NPFM guidelines require preschool children to remain resting and avoid unnecessary vigorous activities before conducting testing items, relevant information regarding warm-up or familiarization sessions is unavailable. Our study only provided instructions and demonstrations to reflect the actual reliability and sensitivity performance of NPFM and conform with the current NPFM guidelines. In this regard, previous studies reported that the induced residual learning effect can reach 60 days [39, 40]. A recent study showed that motor test performance in preschool children peaked at the fourth or fifth session [41]. Therefore, the clear improvement of our second trial may be related to the carryover learning or warm-up effect induced from the first trial, especially when preschoolers were not fully familiar with the performance of motor tasks. Tomac and Hraski [41] recommended using five trials for each testing item for preschool children to remove the potential learning effect from the first few attempts without provoking transformational effects. Therefore, practitioners and researchers of future studies should provide at least four and optimally five relevant familiarization sessions before using NPFM when conducting fitness tests on preschool children, with each test having five trials to maximize the consistency. Although our study did not compare differences between tests with or without warm-up sessions, a standardized pretest warm-up protocol should be added in NPFM guidelines and implemented in the future for both safety and performance reasons. A simple pretest warm-up protocol for preschoolers adopted in a recent study can be directly referenced or used with proper modification, including five minutes of low-intensity running, followed by another five minutes of general exercises, such as skipping, leg lifts, lateral running, and front-to-behind arm rotations, to cover all body regions and simulate movements of testing items in NPFM [5]. The results of this study provided researchers and preschool teachers empirical evidence regarding the test–retest reliability of measurements in NPFM. The provided SWC and MDC95 values can give practitioners concrete information regarding minimum differences required to reflect true performance changes. However, limitations still exist in this study. First, older preschoolers will have relatively more experience in performing testing items than younger groups, which had insufficient pretest familiarization sessions, because NPFM is conducted on preschool children on a yearly basis. Second, learning or practicing effects were very likely induced during the initial trial of most testing items due to our strict adherence to the original NPFM protocol of not requiring any warm-up or familiarization period. Third, learning or practicing effects induced in each group can vary due to gender, age, and maturity differences among subjects. Finally, the 3-year-old group was investigated because the timing of our study mismatched the academic year. In conclusion, all the six measurement items in NPFM provided good relative reliability when conducted on the same day with repeated measures. The balance beam walking test showed low absolute reliability (>10%) in both SEM% and CV%. Systematic bias was observed with significantly improved performance during the second trial of all the tests. (XLSX) Click here for additional data file. 7 Sep 2020 PONE-D-20-20616 The intraday reliability, sensitivity and minimum detectable change of National Physical Fitness Measurement for Preschool Children in China PLOS ONE Dear Dr. Ho, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct 22 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Subas Neupane Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ Additional Editor Comments (if provided): Two reviewers have provided comments on your manuscript. Both the reviewers have good points, please consider revising the manuscript addressing each of the comments raised. Beside that, the English language of the manuscript should be checked. Another minor issue is, in results section the sub-headings are too long, please make them short using only the relevant text. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Thank you for inviting me to review the manuscript on “The intraday reliability, sensitivity and minimum detectable change of National Physical Fitness Measurement for Preschool Children in China”. This research investigated the reliability, sensitivity and minimum detectable change values of NPFM in preschool children aged between 3.5 to 6 years. Overall the manuscript is well-written. I have a few minor comments that I would like the authors to consider. 1. In Abstract, Please provide more detail on the reliability and sensitivity analysis method. 2. Line 32, the keywords of “muscle strength; balance test” should be changed, ie. test-retest reliability... 3. In the Procedures, please clarify no previous familiarization was given for any test, although it was mentioned in line 342. how much studies were identified in the review and how many provided data. Are there any differences in design or populations of those who provide data versus those who do not? 4. I feel confused there were two age groups (3.5-year, 4-year and 4.5-year vs. 5-year, 5.5-year and 6-year) in analyzing the intraday reliability, SEM, SWC, MDC95 and classification of sensitivity, on page 12 lines 208-237, why showed the results in table 3 & table 4 ? 5. I’m also confused the sentence, “To further improve the test-retest reliability of NPFM in preschoolers of different age groups or genders, researchers and practitioners should provide sufficient warm-up and practice opportunity to minimize learning effects”, What is meant by it. Which results could be deduced such the conclusion or advice in this manuscript? Reviewer #2: Line 17 : please change "National Physical Fitness Measurement (preschool children version)” to “National Physical Fitness Measurement (NPFM - preschool children version)” Line 23: please mention the model of ICC that you used Line 23: Change “(ICC = 0.77 to 0.97)" to “(ICC…: 0.77 to 0.97)" Line 24: Change “(moderate: ICC = 0.56 to 0.74)” to “(ICC: 0.56 to 0.74 [moderate])” Line 25-26: Change “subject (poor: ICC 0.33 to 0.35), 5.5-year subject (moderate: ICCs=0.68) and double-leg timed hop test (moderate: ICC = 0.67) in 4.5-year.” to “subject (ICC: 0.33 to 0.35 [poor]), 5.5-year subject (ICC=0.68 [moderate]) and double-leg timed hop test (ICC = 0.67 [moderate]) in 4.5-year.” Line 26-27: based on which results/statistical index?? What about the absolute reliability results? Line 28-31: try to generalize your conclusion not a simple repetition of results. Line 58-72: It is a classic description of reliability and sensitivity statistic tools, so please move this paragraph to discussion section or remove it. It should be better to highlight the meaning and the importance of the absolute and relative reliability and the internal, external sensitivity. Line 99-100: please edit the form to “3.5≤ (n= 31)<4 years-old, 4≤ (n = 22) <4.5-years-old…….” Table 1: please insert the seize of each group, for example change “All ages” to “All ages (n=209) Line 113: it a simple randomization or counterbalanced? Line 114: without familiarization session? Line 113-116: indoor or outdoor? At the same time of day? Line 115: As a general testing instruction for young children, it must do more than trial for each test. Line 140-148: for 10-meter shuttle run test and Balance beam walking tests, are you sur that subjects have a complete recovery after only 1 min of rest? Line 181-189: Are you checked to normality of data distribution? I think so that you don’t need to apply a log transformation with data normally distributed, and also with medium sample seize (greater than 20), it is recommended to combine t-student with effect seize Cohen d than use a non-clinical magnitude-based inference statistics. Lines 195-198, 213-216 and 231-236: you focus only to interpret the ICC results, what about SEM and MDC values? For example, if MDC95 of 10m Shuttle run (s) equal to 1.01, how interpret this result?? Same for SEM values Table 3 and Table 4: please combine table 3 with table 4 Tables 2-4: You mentioned a P signification values, but you don’t mentioned which statistical tool that you used? Line 255-257: Based only ICC results you cannot conclude that tests has a good reliability. Line 320-324: General interpretation with lack of explanation of the meaning and the exact utility of SEM and SWC. Line 323: Which type of “detect true changes”? Line 331-334: there is a lack of warm-up protocol description. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Wissem Dhahbi [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Commends for the manuscript.docx Click here for additional data file. 2 Oct 2020 We thank the two reviewers for their time and valuable suggestions. In this revised version we have answered all the questions raised by the reviewers and edited the manuscript with substantial revision on English syntax accordingly. We hope that this revised manuscript meets the standard for publication in PLOS ONE. Below please find our point-to-point responses to reviewers. Reviewer: 1 1. In Abstract, Please provide more detail on the reliability and sensitivity analysis method. Authors’ reply: According to the reviewers’ suggestion, abstract was revised as “Intraday relative reliability was tested using intraclass correlation coefficient (ICC3,1) with a 95% confidence interval while absolute reliability was expressed in standard error of measurement and percentage of coefficient of variation (CV%). Test sensitivity was assessed by comparing the smallest worthwhile change (SWC) with standard error of measurement (SEM), while MDC values with 95% confidence interval (MDC95) were established.” (Line 22-27). In addition, more contents regarding SEM and SWC for showing the sensitivity result in “The balance beam walking test showed poor absolute reliability in all the groups (SEM%: 11.76 to 22.28 and CV%: 15.40 to 24.78). Both standing long jump and sit-and-reach tests demonstrated good sensitivity (SWC > SEM) in all subjects group, boys, and girls.” as shown (Line 32-35). 2. Line 32, the keywords of “muscle strength; balance test” should be changed, ie. test-retest reliability... Authors’ reply: According to the reviewers’ suggestion, keywords “muscle strength; balance test” had been changed to “test-retest reliability” (Line 42) 3. In the Procedures, please clarify no previous familiarization was given for any test, although it was mentioned in line 342. According to the reviewer’s suggestion, relevant content was added as “According to the current NPFM guidelines [8], no previous familiarization session was given.” (Line115-116) how much studies were identified in the review and how many provided data. Are there any differences in design or populations of those who provide data versus those who do not? Authors’ reply: In our literature review, about 12 papers regarding fitness battery/protocol for children were identified. Reference [1-3, 11,32] was more about youth, adolescents or children but not preschoolers. [4] by Ortega et al. (2014) provided systematic review regarding the field-based physical fitness-test battery for preschool children using PREFIT battery. [5-7] were some cross-sectional study testing physical fitness in preschool children in Spain, Colombia and China. [9] was the reliability and feasibility study using PREFIT battery in Spain. [12] was about the motor skill performance (not purely physical fitness test battery) on 3- and 4-year old preschoolers in America. [33] provided preliminary performance results of PREFIT battery conducted in Preschool children (aged 3.00 to 6.25) in Spain with percentiles classified. The systematic review by Ortega et al. (2014) provided most comprehensive information regarding reliability. It has reported 21 relevant articles examining reliability. For example, some studies cited in their paper showed reliable results using 1/2-mile walk/run test to assess 5 years old preschoolers with (r > 0.73) while another study by Niederer et al. (cited in Ortega’s article) showed good reliability of 20-m shuttle-run test in Swiss preschool children aged 4-6 years (r = 0.84). However all these studies only covered preschoolers aged 4-6 years but not those below 4. Apart from the r value reflecting the correlation coefficient, another most commonly used reliability measure was ICC. Ortega et al. also reported the result regarding standing long jump from other articles that Krombholz (2011) observed r=0.68 in using standing long jump for those 3- to 7-year-old but the results were obtained with 8 months apart. Meanwhile Ortega also reported another standing-long jump result using ICC (0.65-0.89) in 4 to 5 year old children showing acceptable ICC. In addition, Ortega et al. also reported the reliable results from one-leg-stance (ICC: 0.73 to 0.99; r = 0.84-0.97) in preschool children of different ages from several other papers. They also cited another paper from Oja and Jurimae (1997) to show acceptable reliabilities of 4 x 10 m shuttle run using ICC and Cronbach’s alpha for boys and girls aged 4 to 5 years. Apart from Ortega et al. Amado-Pacheco et al. (2019) performed Fuprecol kids study with 90 preschool children between 3 to 5 years old using inter-day comparison approach (two testing sessions with two weeks apart). They performed a well-known PREFIT 20 m shuttle run test for cardiorespiratory performance, standing long jump and handgrip for strength and musculoskeletal performance, 4 x 10 m shuttle run for speed and agility and sit and reach for flexibility. This research group assess reliability by using mean differences comparison, ICC values, Bland-Altman plots and technical error of measurement (TEM). They have showed -0.27 cm of boys between trials in sit-and-reach and 0.59 cm (p<0.01) increase of performance in girls. In general they reported excellent ICC values for standing broad jump (0.99), 4 x 10m shuttle run (0.95) and sit and reach (0.96). In addition, another study conducted by Cadenas-Sanchez et al. (2016) also used inter-day comparison (2 weeks apart) to assess 161 Spanish preschoolers aged 3 to 5 years with PREFIT 20 m shuttle run, handgrip strength, standing long jump, 4 x 10 m shuttle run and one-leg-stance tests. They used Bland-Alman method and paired sample t-test to check if error was significantly different from reference point. They have shown significantly shorter standing long jump distance but longer one-leg stance performance between days. Therefore, the methods used for assessing reliability were mixed mainly including correlation coefficient, ICC and Bland-Altman plots while most widely studied fitness test battery was PREFIT in European regions. Most studies focused on 3 to 5 years old or 4 to 6 years old and therefore, data of the entire spectrum (3 to 6 years) in preschoolers might not be complete in each study. We have mildly revised the introduction section by adding a brief summary of current review to also make the transition of sentences/paragraphs more smooth. “Previous studies showed excellent reliability of FITness testing in PREschool children (PREFIT) in Spain using Bland-Altman method, intra-class correlation coefficient (ICC) and the comparison of mean differences [6, 9]. Meanwhile, the systematic review from Ortega et al. [4] reported that 4 x 10 m shuttle-run test has provides reliable measures in speed and agility related fitness for preschoolers aged 4 to 5 years (ICC: 0.52 to 0.92) and one-leg-stance test is a popular and reliable test for assessing the balance of preschool children (ICC: 0.73 to 0.99). In addition, the standing long jump test used in testing 4- and 5-year-old preschool children showed acceptable relative reliability (ICC: 0.65 to 0.89). Regarding the studies using Chinese NPFM, the level of physical fitness and activity of preschool children in Shanghai was reported recently [7, 10].” Line (69-79) 4. I feel confused there were two age groups (3.5-year, 4-year and 4.5-year vs. 5-year, 5.5-year and 6-year) in analyzing the intraday reliability, SEM, SWC, MDC95 and classification of sensitivity, on page 12 lines 208-237, why showed the results in table 3 & table 4 ? Authors’ reply: We agreed with the suggestions from both reviewers as the key of this paper was not to compare the differences between younger and older preschoolers. Now Table 3 and table 4 are combined together as Table 3 only now (Line 240-243) 5. I’m also confused the sentence, “To further improve the test-retest reliability of NPFM in preschoolers of different age groups or genders, researchers and practitioners should provide sufficient warm-up and practice opportunity to minimize learning effects”, What is meant by it. Which results could be deduced such the conclusion or advice in this manuscript? Yes, we agree that the current study did not directly investigate the warm up effect. Whereas per the request from another reviewer, paired sample t-test was used to better show the existence of systematic bias. In the discussion, although we proposed the observed significant improvement in the 2nd trial could be induced by learning or warm-up effect (Line 397), to avoid such potential confusion you mentioned, we added contents to clarify that a standardized warm up protocol should be warranted for both performance and safety reasons as “Although our study did not compare differences between tests with or without warm-up sessions, a standardized pretest warm-up protocol should be added in NPFM guidelines and implemented in the future for both safety and performance reasons. A simple pretest warm-up protocol for preschoolers adopted in a recent study can be directly referenced or used with proper modification, including five minutes of low-intensity running, followed by another five minutes of general exercises, such as skipping, leg lifts, lateral running, and front-to-behind arm rotations, to cover all body regions and simulate movements of testing items in NPFM [5].” Line (409-417) Reviewer: 2 Line 17 : please change "National Physical Fitness Measurement (preschool children version)” to “National Physical Fitness Measurement (NPFM - preschool children version)” Authors’ reply: We have amended the abstract according to the suggestion. The revised content is “China General Administration of Sport has published and adopted the National Physical Fitness Measurement (NPFM - preschool children version) since 2000.” shown (Line 16-17) Line 23: please mention the model of ICC that you used Authors’ reply: We have added the adopted model of ICC back to the content of abstract as “Intraday relative reliability was tested using intraclass correlation coefficient (ICC3,1) with a 95% confidence interval while absolute reliability was expressed in standard error of measurement and percentage of coefficient of variation (CV%).” (Line 22-24). It was also shown clearly in the abstract throughout as “Measurements in most groups, except 10-m shuttle run test (ICC3,1: 0.56 to 0.74 [moderate]) in the 3.5 to 5.5-year-old groups, balance beam test in 4- and 5-year-old (ICC3,1: 0.33 to 0.35 [poor]) and 5.5-year-old (ICC3,1=0.68 [moderate]) groups, and double-leg timed hop test (ICC3,1=0.67 [moderate]) in the 4.5-year-old group, demonstrated good to excellent relative reliability (ICC3,1: 0.77 to 0.97).” (Line 27-32) Line 23: Change “(ICC = 0.77 to 0.97)" to “(ICC…: 0.77 to 0.97)" Authors’ reply: According to the reviewers’ suggestion “(ICC = 0.77 to 0.97)” was changed to “(ICC3.1: 0.77 to 0.97)” (Line 32) Line 24: Change “(moderate: ICC = 0.56 to 0.74)” to “(ICC: 0.56 to 0.74 [moderate])” Authors’ reply: According to the reviewers’ suggestion “(moderate: ICC = 0.56 to 0.74)” was changed to “(ICC3.1: 0.56 to 0.74 [moderate])” (Line 28 ) Line 25-26: Change “subject (poor: ICC 0.33 to 0.35), 5.5-year subject (moderate: ICCs=0.68) and double-leg timed hop test (moderate: ICC = 0.67) in 4.5-year.” to “subject (ICC: 0.33 to 0.35 [poor]), 5.5-year subject (ICC=0.68 [moderate]) and double-leg timed hop test (ICC = 0.67 [moderate]) in 4.5-year.” Authors’ reply: According to the reviewers’ suggestion “subject (poor: ICC 0.33 to 0.35), 5.5-year subject (moderate: ICCs=0.68) and double-leg timed hop test (moderate: ICC = 0.67) in 4.5-year.” was changed to “Measurements in most groups, except 10-m shuttle run test (ICC3,1: 0.56 to 0.74 [moderate]) in the 3.5 to 5.5-year-old groups, balance beam test in 4- and 5-year-old (ICC3,1: 0.33 to 0.35 [poor]) and 5.5-year-old (ICC3,1=0.68 [moderate]) groups, and double-leg timed hop test (ICC3,1=0.67 [moderate]) in the 4.5-year-old group, demonstrated good to excellent relative reliability (ICC3,1: 0.77 to 0.97).” (Line 27-32 ) Line 26-27: based on which results/statistical index?? What about the absolute reliability results? Authors’ reply: To better clarify, the sentence is now revised as “Both standing long jump and sit-and-reach tests demonstrated good sensitivity (SWC > SEM) in all subjects group, boys, and girls.” Line (33-35). Meanwhile, the absolute reliability in terms of SEM% and CV% of the worst testing item was also highlighted as “The balance beam walking test showed poor absolute reliability in all the groups (SEM%: 11.76 to 22.28 and CV%: 15.40 to 24.78).” Line (32-33). Line 28-31: try to generalize your conclusion not a simple repetition of results. Authors’ reply: We based on the result of the use of pairwise comparison showed systematic bias between trials and also the newly added discussion part concerning the recommended number of familiarization sessions and testing trials to revise the conclusion in a more concrete and specific approach as “Pairwise comparison revealed systematic bias with significantly better performance in the second trial (p<0.01) of all the tests with moderate to large effect size. Hence, NPFM guidelines should be revised by adding adequate familiarization sessions and standardized warm-up protocols as well as increasing the number of testing trials. SWC and MDC95 values of NPFM tests should be considered to realize true performance changes.” Line (35-40) Line 58-72: It is a classic description of reliability and sensitivity statistic tools, so please move this paragraph to discussion section or remove it. It should be better to highlight the meaning and the importance of the absolute and relative reliability and the internal, external sensitivity. Authors’ reply: According to the suggestion of reviewer, we have moved the contents of those sentences back to discussion. In the initial part of the discussion, we have explained the possible sources leading to systematic bias as “This study primarily aimed to set up the intraday reliability, MDC, and sensitivity of six key testing items of NPFM by comparing between trials. The systematic bias of observed differences, such as potential of the learning effect to lead to a higher degree of familiarity of the selected measurement, insufficient recovery from the previous trial that induces the fatigue effect to subsequent attempts, and different emotional statuses or motivation levels, can be detected when intertrial reliability is determined [17].” (Line 263-268). Meanwhile, the original contents existed in the introduction part was further revised by explaining the limitation of ICC and introducing the use of absolute reliability as “ICC is commonly used to assess the reliability of a measurement or testing method, wherein values over 0.90 are regarded as excellent relative test–retest reliability. Tests with excellent ICCs exhibit good stability and consistency of measurement over time and low measurement error [20]. However, previous studies reported limitations, such as inter subject variability that can potentially affect the result and overestimated ICC values in a typically heterogeneous population, in the use of ICC alone [21]. Therefore, measurements with excellent relative reliability do not necessarily ensure consistent intertrial performance. Calculations of SEM and CV% were further recommended to obtain within-subject variation in addition to measuring ICCs and confirm the absolute reliability [18, 22]. Analysis of the absolute reliability during performance-related tests in nonathletic settings demonstrated that CV% below 10% are regarded as acceptable agreement [17], while Fox et al. [16] specified the threshold of acceptable reliability as not more than 10% of SEM.” Line (272-284). To better elaborate the importance of MDC95 and SWC, the discussion part was revised by using our findings to explain the interpretation of MDC and SWC as “Apart from the relative and absolute reliability, estimating the MDC with 95% confidence interval (MDC95) was recommended in recent studies [20]. Determining whether the observed change is due to the real effect from intervention or measurement error is unclear without prior knowledge of the MDC value although a high degree of test–retest reliability is provided. Our results demonstrated very large MDC95 values for all subjects in the balance beam walking test at 4.09 s, which is 54.9% of the performance of the better trial (7.45 s). Hence, preschool children must achieve a reduction of at least 55% in their balance beam walking time to show meaningful or real improvement with 95% confidence for excluding errors induced during the measurement.” Line (349-358). In addition, a paragraph was added to elaborate the use of SWC and SEM values for acquiring the sensitivity as “Apart from reliability data and MDC95 values, practitioners also intend to determine threshold values beyond zero that can represent the minimum change required for practically meaningful results using SWC. SWC and SEM values are commonly compared to express and understand test sensitivity [17]. Briefly, Liow and Hopkins [37] established thresholds to determine whether a test has “good sensitivity” and detect changes if SEM is smaller than SWC; the test has “satisfactory sensitivity” if SEM is equal to SWC, while the test only has “marginal sensitivity” if SEM is larger than SWC. The analysis of NPFM sensitivity exhibited that the effectiveness of each testing item in NPFM to detect real and practically meaningful change in the performance of individuals can be verified.” Line (364-373). Through the elaboration from our findings, reader can explicitly know the importance of a sensitive test through “Despite the gender and age consideration, SWC of the sit-and-reach test for all the preschool children was 0.90, while SEM and MDC95 were 0.63 and 1.74 cm, respectively. Therefore, any observed change beyond 0.90 cm can be regarded as practically meaningful and exceeds the typical error of measurement. Practitioners have 95% confidence to consider the change as real rather than a measurement error when the observed change is over 1.74 cm. By comparison, SLJ only showed good sensitivity when it was used in the group of all subjects, boys, and girls but only marginal sensitivity was observed in all the subdivided age groups. Similarly, the TT test only showed good sensitivity in boys and 4.5-year-old subjects and satisfactory sensitivity in overall and 4-year-old subjects. Moreover, 10-m SRT, DTH, and balance beam walking test showed marginal sensitivity in most groups.” Line (374-385) Line 99-100: please edit the form to “3.5≤ (n= 31)<4 years-old, 4≤ (n = 22) <4.5-years-old…….” Authors’ reply: amended per suggestion as “Subjects were further divided into the following subgroups according to their chronological ages: ≤ 3.5 (n=31) < 4, ≤ 4 (n=22) < 4.5, ≤ 4.5 (n=43) < 5, ≤ 5 (n=24) < 5.5, ≤ 5.5 (n=45) < 6, and ≤ 6 (n=44) years old.” (Line 98-100) Table 1: please insert the seize of each group, for example change “All ages” to “All ages (n=209) Authors’ reply: We have amended per suggestion as shown in Table 1 (Line 110). To avoid confusing readers “all ages” as all six subgroups by different ages, we have changed “all age” to “all subjects” Line 113: it a simple randomization or counterbalanced? Authors’ reply: Since we had 6 different testing items that could produce 720 possible sequence. A complete/ideal counterbalanced is not possible and therefore we have adopted randomization (mentioned in line 115) Line 114: without familiarization session? Authors’ reply: Authors fully understand the importance of using familiarization session to enhance the test retest reliability. However, to truly reflect the current practice of NPFM adopted in China, we strictly followed the protocol such that we could based on any observed systematic bias to make suggestion in our discussion section. Per the request of another reviewer, the relevant sentence was added as “According to the current NPFM guidelines [8], no previous familiarization session was given.” Line (115-116). Meanwhile, in our discussion we have added a paragraph to make use the recent paper from Tomac and Hraski [41] to counter propose the need of using multiple familiarization sessions as “Given that original NPFM guidelines require preschool children to remain resting and avoid unnecessary vigorous activities before conducting testing items, relevant information regarding warm-up or familiarization sessions is unavailable. Our study only provided instructions and demonstrations to reflect the actual reliability and sensitivity performance of NPFM and conform with the current NPFM guidelines. In this regard, previous studies reported that the induced residual learning effect can reach 60 days [39, 40]. A recent study showed that motor test performance in preschool children peaked at the fourth or fifth session [41]. Therefore, the clear improvement of our second trial may be related to the carryover learning or warm-up effect induced from the first trial, especially when preschoolers were not fully familiar with the performance of motor tasks. Tomac and Hraski [41] recommended using five trials for each testing item for preschool children to remove the potential learning effect from the first few attempts without provoking transformational effects. Therefore, practitioners and researchers of future studies should provide at least four and optimally five relevant familiarization sessions before using NPFM when conducting fitness tests on preschool children, with each test having five trials to maximize the consistency.” Line (393-409). Line 113-116: indoor or outdoor? At the same time of day? Authors’ reply: We have supplemented relevant information in the procedures section as “NPFM was conducted by trained research assistants on a synthetic rubber surface at the outdoor playground of a kindergarten school in Beijing in the morning.” Line (113-114) Line 115: As a general testing instruction for young children, it must do more than trial for each test. Authors’ reply: Similar to the concern of not using familiarization session, authors understand the importance of multiple trials for getting a more steady and reliable results. However, the current NPFM stipulates the use of two trials for each testing item. Therefore, our study by strictly following the current practice of NPFM is a good opportunity to reflect the potential weakness of the current testing protocols. Therefore, in our discussion we have a paragraph to counter propose the need of using optimally five trials instead of two to enhance the reliability. “Tomac and Hraski [41] recommended using five trials for each testing item for preschool children to remove the potential learning effect from the first few attempts without provoking transformational effects. Therefore, practitioners and researchers of future studies should provide at least four and optimally five relevant familiarization sessions before using NPFM when conducting fitness tests on preschool children, with each test having five trials to maximize the consistency.” Line (403-409). We hope these can help clarify that it was not our originally intention to conduct any sub-quality/optimal fitness test whereas the primary aim of this paper was to reflect what the current NPFM looks like so that we can make recommendation to relevant organizations. Line 140-148: for 10-meter shuttle run test and Balance beam walking tests, are you sur that subjects have a complete recovery after only 1 min of rest? Authors’ reply: For explosive strength or power related test, we know that at least 1:10 or even 1:15 work-rest ratio is required to warrant complete recovery. For the Balance beam walking, the walking speed was much slower than normal walking which should have no potential issue regarding complete recovery. For the 10-meter shuttle run test, we have made literature search for similar motor or fitness test in assessing the agility for preschool children. Interestingly, only “Martinez-tellez et al., (2015). Health-related physical fitness is associated with total and central body fat in preschool children aged 3 to 5 years. Pediatric Obesity, 11(6), 468-474.” this study mentioned the use of 1-2 min rest between trials in 4x10 m shuttle run whereas all the later studies (which those we cited in our manuscript) referenced from this paper but all of them did not mention the inter-trial resting period. Since Martinez-tellez et al used 4x10 while our study using NPFM used 2x10 meters, to take reference from and in line with this existed resting standard for similar test on similar population, we used at least 1 minute (as the duration and distance was about half of those in Martinez-tellez et al.). Authors understand, the current recovery duration may not be 100% whereas, our pairwise comparison between 1st and 2nd trials did show improvement but not worse performance from incomplete recovery. Therefore, we believe the issue of incomplete recovery was minimum or negligible in the current study. Line 181-189: Are you checked to normality of data distribution? I think so that you don’t need to apply a log transformation with data normally distributed, and also with medium sample seize (greater than 20), it is recommended to combine t-student with effect seize Cohen d than use a non-clinical magnitude-based inference statistics. Authors’ reply: We have used Shapiro-Wilk test and qq plot to observe the normality. Although quite a number of groups violated the SW test, most groups in qq plot were on or very closed to the reference line with little or some deviation at the tail. Meanwhile, as all our groups have more than 20 sample size, we believe that we do not require using non-parametric methods or additional log-transformation to yield proper results. Per the suggestion from the review, we have changed the pairwise comparison from non-clinical MBI to paired sample t-test with Cohen’s d as effect size calculation as shown in “The results of pairwise sample t-test (Table 4) showed a significant difference between trials for all the measurements of the 10-m SRT (p<0.01 and d=0.87 [large]), SLJ (p<0.01 and d=0.71 [moderate]), TT (p<0.01 and d=0.84 [large]), DTH (p<0.01 and d=0.92 [large]), sit-and-reach (p<0.01 and d=1.57 [large]), and balance beam walking (p<0.01 and d=0.69 [moderate]) tests.” Line 251-258, Table 4. Lines 195-198, 213-216 and 231-236: you focus only to interpret the ICC results, what about SEM and MDC values? For example, if MDC95 of 10m Shuttle run (s) equal to 1.01, how interpret this result?? Same for SEM values Table 3 and Table 4: please combine table 3 with table 4 Authors’ reply: Per the request of both reviewers, tables 3 and 4 are combined to one single table 3 as shown in line 240-243 now. We agree that only reporting ICC (relative reliability) may lead to biased or limited interpretation and therefore, the result section was revised with added content while CV% was also expressed to further strengthen the absolute reliability. “Table 2 shows good to excellent ICCs (0.77 to 0.97) of all the measurements in the groups of all subjects, boys, and girls. However, the balance beam walking test demonstrated poor absolute reliability for the groups of all ages (SEM%=18.05 and CV%=20.43), boys (SEM%=17.96 and CV%=20.47), and girls (SEM%=18.10% and CV%=20.38%). MDC95 values in the balance beam walking test for groups of all subjects, boys, and girls showed a minimum threshold of 4.09, 3.99, and 4.18 s, respectively, which are beyond the random measurement error with a 95% confidence level. SLJ demonstrated good sensitivity in the group of all subjects (SWC=4.54 > SEM=3.81), boys (SWC=4.68 > SEM=3.94), and girls (SWC=4.33 > SEM=3.67). Similarly, the sit-and-reach test showed good sensitivity in the group of all subjects (SWC=0.90 > SEM=0.63), boys (SWC=0.77 > SEM=0.68), and girls (SWC=0.89 > SEM=0.41). Only the boys group (SWC=0.40 > SEM=0.30) exhibited good sensitivity in the TT test, while satisfactory sensitivity was observed in all the subjects (SWC=38 ≈ SEM=0.36).” Line (197-211). Similarly, the SEM, SWC and CV% were supplemented in “Intraday reliability in ICC, CV%, SEM, SWC, and MDC95 data and classification of sensitivity in 3.5-, 4-, 4.5-, 5-, 5.5-, and 6-year-old subjects are presented in Table 3. The majority of measurements showed good to excellent relative reliability (ICC: 0.79 to 0.95), except the 10-m SRT (ICC: 0.67 to 0.73 [moderate]) in three groups (3.5-, 4-, and 5-year-old subjects), balance beam test (ICC: 0.33 to 0.68 [poor to moderate]) in 4-, 5-, and 5.5-year-old subjects, and DTH (ICC=0.67 [moderate]) in 4.5-year-old subjects. However, according to SEM% and CV% values, the balance beam walking test demonstrated poor absolute reliability (SEM%: 11.25 to 22.28 and CV%: 15.40 to 24.78) for all the age groups. The comparison of SWC and SEM values showed that most measurements demonstrated only marginal sensitivity, except the TT test of 4.5-year-old subjects (SWC=0.35 > SEM=0.30) and the sit-and-reach test of 4.5- (SWC=0.86 > SEM=0.43), 5- (SWC=1.05 > SEM=0.58), 5.5- (SWC=0.90 > SEM=0.56), and 6-year-old (SWC=1.00 > SEM=0.65) subjects. Meanwhile, satisfactory sensitivity was observed in the TT test of 4-year-old subjects (SWC=0.29 ≈ SEM=0.28) and DTH in 5- (SWC=0.58 ≈ SEM=0.55) and 6-year-old (SWC=0.21 ≈ SEM=0.23) subjects.” (Line 224-239). To further interpret and elaborate the SEM, MDC and SWC results, the section from line 278-284 provided the importance and threshold of absolute reliability as “Therefore, measurements with excellent relative reliability do not necessarily ensure consistent intertrial performance. Calculations of SEM and CV% were further recommended to obtain within-subject variation in addition to measuring ICCs and confirm the absolute reliability [18, 22]. Analysis of the absolute reliability during performance-related tests in nonathletic settings demonstrated that CV% below 10% are regarded as acceptable agreement [17], while Fox et al. [16] specified the threshold of acceptable reliability as not more than 10% of SEM.” Furthermore, the interpretation of large SEM and CV values regarded as poor absolute reliability for balance beam walking test was discussed in line 286 to 287 “In this regard, the balance beam walking test showed poor absolute reliability (SEM%: 17.96 to 18.10 and CV%: 20.38 to 20.47) in boys, girls, and all the subjects.”. Similarly, the unacceptable absolute reliability for Tennis Throwing for 3.5-year-old was shown in line 295 to 298 as “Furthermore, the balance beam walking test for all the subdivided age groups (SEM%: 11.25 to 22.28 and CV%: 15.40 to 24.78) and the TT test for 3.5-year-old subjects (SEM%=12.63 and CV%=17.81) showed an unacceptable level of absolute reliability. The interpretation of MDC and SWC as well as the sensitivity level were further elaborated from line 349-385 as “Apart from the relative and absolute reliability, estimating the MDC with 95% confidence interval (MDC95) was recommended in recent studies [20]. Determining whether the observed change is due to the real effect from intervention or measurement error is unclear without prior knowledge of the MDC value although a high degree of test–retest reliability is provided. Our results demonstrated very large MDC95 values for all subjects in the balance beam walking test at 4.09 s, which is 54.9% of the performance of the better trial (7.45 s). Hence, preschool children must achieve a reduction of at least 55% in their balance beam walking time to show meaningful or real improvement with 95% confidence for excluding errors induced during the measurement. In this regard, further investigations on the source of measurement errors or reasons for such unreliable performance during the balance beam walking test for preschool children are necessary. Otherwise, the government should consider devising another test to replace the balance beam walking assessment and produce improved reliability and usefulness and valid results for testing dynamic balance. Apart from reliability data and MDC95 values, practitioners also intend to determine threshold values beyond zero that can represent the minimum change required for practically meaningful results using SWC. SWC and SEM values are commonly compared to express and understand test sensitivity [17]. Briefly, Liow and Hopkins [37] established thresholds to determine whether a test has “good sensitivity” and detect changes if SEM is smaller than SWC; the test has “satisfactory sensitivity” if SEM is equal to SWC, while the test only has “marginal sensitivity” if SEM is larger than SWC. The analysis of NPFM sensitivity exhibited that the effectiveness of each testing item in NPFM to detect real and practically meaningful change in the performance of individuals can be verified. The sit-and-reach test in our study showed good sensitivity in all the groups, except for 3.5- and 4-year-old subjects. Despite the gender and age consideration, SWC of the sit-and-reach test for all the preschool children was 0.90, while SEM and MDC95 were 0.63 and 1.74 cm, respectively. Therefore, any observed change beyond 0.90 cm can be regarded as practically meaningful and exceeds the typical error of measurement. Practitioners have 95% confidence to consider the change as real rather than a measurement error when the observed change is over 1.74 cm. By comparison, SLJ only showed good sensitivity when it was used in the group of all subjects, boys, and girls but only marginal sensitivity was observed in all the subdivided age groups. Similarly, the TT test only showed good sensitivity in boys and 4.5-year-old subjects and satisfactory sensitivity in overall and 4-year-old subjects. Moreover, 10-m SRT, DTH, and balance beam walking test showed marginal sensitivity in most groups.” Tables 2-4: You mentioned a P signification values, but you don’t mentioned which statistical tool that you used? Authors’ reply: We have removed all those p values as those were the significance of ICC values which are necessary and not the key interests of our study. Line 255-257: Based only ICC results you cannot conclude that tests has a good reliability. Authors’ reply: We have revised the sentence to “The findings shown in Table 2 indicated that all the testing items generally demonstrate a good to excellent relative reliability in preschool children (ICC: 0.77 to 0.97).” by adding “relative” (Line 270-272) to better clarify. In addition, the importance, threshold and values of absolute reliability using SEM and CV% were emphasized from Line 275-287 as “However, previous studies reported limitations, such as inter subject variability that can potentially affect the result and overestimated ICC values in a typically heterogeneous population, in the use of ICC alone [21]. Therefore, measurements with excellent relative reliability do not necessarily ensure consistent intertrial performance. Calculations of SEM and CV% were further recommended to obtain within-subject variation in addition to measuring ICCs and confirm the absolute reliability [18, 22]. Analysis of the absolute reliability during performance-related tests in nonathletic settings demonstrated that CV% below 10% are regarded as acceptable agreement [17], while Fox et al. [16] specified the threshold of acceptable reliability as not more than 10% of SEM. In this regard, the balance beam walking test showed poor absolute reliability (SEM%: 17.96 to 18.10 and CV%: 20.38 to 20.47) in boys, girls, and all the subjects. We believe these contents help to avoid any misleading to dogmatic conclusion. Line 320-324: General interpretation with lack of explanation of the meaning and the exact utility of SEM and SWC. Authors’ reply: We have revised and added contents as “Apart from reliability data and MDC95 values, practitioners also intend to determine threshold values beyond zero that can represent the minimum change required for practically meaningful results using SWC. SWC and SEM values are commonly compared to express and understand test sensitivity [17]. Briefly, Liow and Hopkins [37] established thresholds to determine whether a test has “good sensitivity” and detect changes if SEM is smaller than SWC; the test has “satisfactory sensitivity” if SEM is equal to SWC, while the test only has “marginal sensitivity” if SEM is larger than SWC. The analysis of NPFM sensitivity exhibited that the effectiveness of each testing item in NPFM to detect real and practically meaningful change in the performance of individuals can be verified. The sit-and-reach test in our study showed good sensitivity in all the groups, except for 3.5- and 4-year-old subjects. Despite the gender and age consideration, SWC of the sit-and-reach test for all the preschool children was 0.90, while SEM and MDC95 were 0.63 and 1.74 cm, respectively. Therefore, any observed change beyond 0.90 cm can be regarded as practically meaningful and exceeds the typical error of measurement. Practitioners have 95% confidence to consider the change as real rather than a measurement error when the observed change is over 1.74 cm. By comparison, SLJ only showed good sensitivity when it was used in the group of all subjects, boys, and girls but only marginal sensitivity was observed in all the subdivided age groups. Similarly, the TT test only showed good sensitivity in boys and 4.5-year-old subjects and satisfactory sensitivity in overall and 4-year-old subjects. Moreover, 10-m SRT, DTH, and balance beam walking test showed marginal sensitivity in most groups. Among the testing items of NPFM, only SLJ, TT, and sit-and-reach tests were considered simple tests using open-loop control and showed good or satisfactory sensitivity in several subject groups. Therefore, typical errors with relatively low SEM and high SWC values in these three testing items will unlikely mask the detectable and meaningful improvement when used in particular preschool groups [38].” Line (364-389). Line 323: Which type of “detect true changes”? Authors’ reply: We have revised the whole paragraph as shown in the response in the previous question. Therefore, the confusing or vague wordings of “detect true changes” did not exist anymore. Instead we added the whole paragraph from line 364-389 to more explicitly explain and interpret why the sit-and-reach test was sensitive and useful using SEM, SWC and MDC values. Line 331-334: there is a lack of warm-up protocol description. Authors’ reply: Per the request of reviewer, we have referenced from a recent study regarding fitness/motor test for preschool children and suggested a structured and complete warm up protocol as “Although our study did not compare differences between tests with or without warm-up sessions, a standardized pretest warm-up protocol should be added in NPFM guidelines and implemented in the future for both safety and performance reasons. A simple pretest warm-up protocol for preschoolers adopted in a recent study can be directly referenced or used with proper modification, including five minutes of low-intensity running, followed by another five minutes of general exercises, such as skipping, leg lifts, lateral running, and front-to-behind arm rotations, to cover all body regions and simulate movements of testing items in NPFM [5].” Line (409-417) 28 Oct 2020 PONE-D-20-20616R1 Intraday reliability, sensitivity and minimum detectable change of National Physical Fitness Measurement for Preschool Children in China PLOS ONE Dear Dr. Ho, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== ACADEMIC EDITOR: In methods part of the abstract, please also mention, when the study was conducted. Also, please state the Six items of NPFM in the methods. In main text, please state what is the meaning of (ICC 3, 1) as the readers may not be familiar with it. The heading under results in the result section (line 193-195) and another one in line 220-22 are too long. Please make it short and clear. In discussion, please do not repeat the statistical results. The conclusion presented in the abstract and in the main texts should be aligned. ============================== Please submit your revised manuscript by Dec 12 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Subas Neupane Academic Editor PLOS ONE Additional Editor Comments (if provided): In methods part of abstract, please also mention, when the study was conducted. Also, please state the Six items of NPFM in the methods. In main text, please state what is the meaning of (ICC 3, 1) as the readers may not be familiar with it. The heading under results in the result section (line 193-195) and another one in line 220-22 are too long. Please make it short and clear. In discussion, please do not repeat the statistical results. The conclusion presented in the abstract and in the main texts should be aligned. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I agree with Hraski and the author's suggestion in line 405-406, which recommended using five trials for each testing item for preschool children to remove the potential learning effect from the first few attempts. However, I think the author's statement is difficult to understand and accept in abstract(line 37 to 40), which is “Hence, NPFM guidelines should be revised by adding adequate familiarization sessions and standardized warm-up protocols as well as increasing the number of testing trials. SWC and MDC95 values of NPFM tests should be considered to realize true performance changes”. The main purpose of this manuscript was to investigate the reliability, sensitivity and minimum detectable change values of NPFM in preschool children, there was few result supported the author's point. So, it should be deleted in the abstract. In conclusion (line 437 to 444)，these sentences also should be deleted which were similar to the discussion section. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Commends for the manuscript _10-09-20.docx Click here for additional data file. 29 Oct 2020 We thank the reviewers and the academic editor for their time and valuable suggestions. In this revised version we have answered all the questions raised by the reviewers and the editor. We hope that this revised manuscript meets the standard for publication in PLOS ONE. Below please find our point-to-point responses to reviewers. Reviewer: 1 I agree with Hraski and the author's suggestion in line 405-406, which recommended using five trials for each testing item for preschool children to remove the potential learning effect from the first few attempts. However, I think the author's statement is difficult to understand and accept in abstract(line 37 to 40), which is “Hence, NPFM guidelines should be revised by adding adequate familiarization sessions and standardized warm-up protocols as well as increasing the number of testing trials. SWC and MDC95 values of NPFM tests should be considered to realize true performance changes”. The main purpose of this manuscript was to investigate the reliability, sensitivity and minimum detectable change values of NPFM in preschool children, there was few result supported the author's point. So, it should be deleted in the abstract. In conclusion (line 437 to 444)，these sentences also should be deleted which were similar to the discussion section. Response from authors: Deleted the line 37 to 40 of abstract and also line 437 to 444 of conclusion accordingly Academic Editor: In methods part of abstract, please also mention, when the study was conducted. Response from authors: Added the time of test back to abstract Also, please state the Six items of NPFM in the methods. Response from authors: Added In main text, please state what is the meaning of (ICC 3, 1) as the readers may not be familiar with it. Response from authors: full form of ICC3,1 added as line 168 “intraclass correlation coefficient with two-way mixed-effects model and single measurement (ICC3,1)” The heading under results in the result section (line 193-195) and another one in line 220-22 are too long. Please make it short and clear. Response from authors: the heading is trimmed as requested In discussion, please do not repeat the statistical results. Response from authors: most statistical results originally included in the brackets were deleted per request. Only those essential in the main text for discussion purpose is retained. The conclusion presented in the abstract and in the main texts should be aligned. Response from authors: per the request of reviewer, we have deleted the last sentence to avoid redundant/duplicated sentences. Submitted filename: Response to Reviewer 2.docx Click here for additional data file. 2 Nov 2020 Intraday reliability, sensitivity, and minimum detectable change of National Physical Fitness Measurement for Preschool Children in China PONE-D-20-20616R2 Dear Dr. Ho, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Subas Neupane Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 10 Nov 2020 PONE-D-20-20616R2 Intraday reliability, sensitivity, and minimum detectable change of National Physical Fitness Measurement for Preschool Children in China Dear Dr. Ho: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Subas Neupane Guest Editor PLOS ONE

33 in total

1. Velocity specificity of weight training for kayak sprint performance.

Authors: David K Liow; William G Hopkins
Journal: Med Sci Sports Exerc Date: 2003-07 Impact factor: 5.411

Review 2. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM.

Authors: Joseph P Weir
Journal: J Strength Cond Res Date: 2005-02 Impact factor: 3.775

3. Sport specific fitness status in junior water polo players--Playing position approach.

Authors: K Idrizovic; O Uljevic; M Spasic; D Sekulic; M Kondric
Journal: J Sports Med Phys Fitness Date: 2014-11-04 Impact factor: 1.637

4. Analyzing the relationship between anthropometric and motor indices with basketball specific pre-planned and non-planned agility performances.

Authors: Miran Pehar; Nedim Sisic; Damir Sekulic; Milan Coh; Ognjen Uljevic; Miodrag Spasic; Ante Krolo; Kemal Idrizovic
Journal: J Sports Med Phys Fitness Date: 2017-05-09 Impact factor: 1.637

Review 5. Systematic review and proposal of a field-based physical fitness-test battery in preschool children: the PREFIT battery.

Authors: Francisco B Ortega; Cristina Cadenas-Sánchez; Guillermo Sánchez-Delgado; José Mora-González; Borja Martínez-Téllez; Enrique G Artero; Jose Castro-Piñero; Idoia Labayen; Palma Chillón; Marie Löf; Jonatan R Ruiz
Journal: Sports Med Date: 2015-04 Impact factor: 11.136

6. Intraday reliability and sensitivity of four functional ability tests in older women.

Authors: Susan Dewhurst; Theodoros M Bampouras
Journal: Am J Phys Med Rehabil Date: 2014-08 Impact factor: 2.159

Review 7. European normative values for physical fitness in children and adolescents aged 9-17 years: results from 2 779 165 Eurofit performances representing 30 countries.

Authors: Grant R Tomkinson; Kevin D Carver; Frazer Atkinson; Nathan D Daniell; Lucy K Lewis; John S Fitzgerald; Justin J Lang; Francisco B Ortega
Journal: Br J Sports Med Date: 2017-11-30 Impact factor: 13.800

Review 8. Correlates of Gross Motor Competence in Children and Adolescents: A Systematic Review and Meta-Analysis.

Authors: Lisa M Barnett; Samuel K Lai; Sanne L C Veldman; Louise L Hardy; Dylan P Cliff; Philip J Morgan; Avigdor Zask; David R Lubans; Sarah P Shultz; Nicola D Ridgers; Elaine Rush; Helen L Brown; Anthony D Okely
Journal: Sports Med Date: 2016-11 Impact factor: 11.136

9. Relationship between Physical Activity and Physical Fitness in Preschool Children: A Cross-Sectional Study.

Authors: Hui Fang; Minghui Quan; Tang Zhou; Shunli Sun; Jiayi Zhang; Hanbin Zhang; Zhenbo Cao; Guanggao Zhao; Ru Wang; Peijie Chen
Journal: Biomed Res Int Date: 2017-11-21 Impact factor: 3.411

10. Association between physical activity, sedentary behavior, and fitness with health related quality of life in healthy children and adolescents: A protocol for a systematic review and meta-analysis.

Authors: Alberto Bermejo-Cantarero; Celia Álvarez-Bueno; Vicente Martinez-Vizcaino; Antonio García-Hermoso; Ana Isabel Torres-Costoso; Mairena Sánchez-López
Journal: Medicine (Baltimore) Date: 2017-03 Impact factor: 1.889

5 in total

1. Study of the Reliability of Field Test Methods for Physical Fitness in Children Aged 2-3 Years.

Authors: Dandan Ke; Duona Wang; Hui Huang; Xiangying Hu; Jun Sasaki; Hezhong Liu; Xiaofei Wang; Dajiang Lu; Jian Wang; Gengsheng He
Journal: Int J Environ Res Public Health Date: 2022-06-20 Impact factor: 4.614

2. Chronological and Skeletal Age in Relation to Physical Fitness Performance in Preschool Children.

Authors: Dandan Ke; Dajiang Lu; Guang Cai; Xiaofei Wang; Jing Zhang; Koya Suzuki
Journal: Front Pediatr Date: 2021-05-14 Impact factor: 3.418

Review 3. Field-based physical fitness assessment in preschool children: A scoping review.

Authors: Dandan Ke; Remili Maimaitijiang; Shaoshuai Shen; Hidetada Kishi; Yusuke Kurokawa; Koya Suzuki
Journal: Front Pediatr Date: 2022-08-02 Impact factor: 3.569

4. Functional Training Focused on Motor Development Enhances Gross Motor, Physical Fitness, and Sensory Integration in 5-6-Year-Old Healthy Chinese Children.

Authors: Tao Fu; Diruo Zhang; Wei Wang; Hui Geng; Yao Lv; Ruiheng Shen; Te Bu
Journal: Front Pediatr Date: 2022-07-11 Impact factor: 3.569

5. The effects of the home-based exercise during COVID-19 school closure on the physical fitness of preschool children in China.

Authors: Zhenwen Liang; Cheng Deng; Dan Li; Wai Leung Ambrose Lo; Qiuhua Yu; Zhuoming Chen
Journal: Front Pediatr Date: 2022-08-30 Impact factor: 3.569

5 in total