Domain selection and familywise error rate for functional data: A unified framework

Functional data are smooth, often continuous, random curves, which can be seen as an extreme case of multivariate data with infinite dimensionality. Just as componentwise inference for multivariate data naturally performs feature selection, subsetwise inference for functional data performs domain selection. In this paper, we present a unified testing framework for domain selection on populations of functional data. In detail, 𝑝 -values of hypothesis tests performed on pointwise evaluations of functional data are suitably adjusted for providing control of the familywise error rate (FWER) over a family of subsets of the domain. We show that several state-of-the-art domain selection methods fit within this framework and differ from each other by the choice of the family over which the control of the FWER is provided. In the existing literature, these families are always defined a priori. In this work, we also propose a novel approach, coined thresholdwise testing, in which the family of subsets is instead built in a data-driven fashion. The method seamlessly generalizes to multidimensional domains in contrast to methods based on a priori defined families. We provide theoretical results with respect to consistency and control of the FWER for the methods within the unified framework. We illustrate the performance of the methods within the unified framework on simulated and real data examples and compare their performance with other existing methods.


INTRODUCTION
Functional data analysis (FDA) is a field of statistics that pertains to the study of datasets in which the sample unit is a smooth curve. Such data arise as the results of many experimental studies, including engineering, biology, medicine, and biomechanics. Examples of the two latter ones (diffusion magnetic resonance imaging data and kinematic data) are going to be addressed in this paper.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2022 The Authors. Biometrics published by Wiley Periodicals LLC on behalf of International Biometric Society.
For functional data, besides estimation, clustering, and prediction, it is of critical importance to design appropriate statistical methodologies for inference such as testing hypotheses on populations of functional data, which is the objective of the present work. For example, suppose that random functions are observed for two populations, and we want to test if the mean functions 1 and 2 are the same, testing 0 ∶ 1 (⋅) = 2 (⋅) versus 1 ∶ 1 (⋅) ≠ 2 (⋅).
In our examples, we consider the knee kinematics of two groups of patients and look for the differences in their performance, as well as compare two diffusion models for structural connectivity in the brain. Multiple methods have been devised in the literature to form global tests for this setting and more general scenarios, both parametrically (e.g., Horváth and Kokoszka, 2012;Staicu et al., 2014) and nonparametrically (e.g., Cardot et al., 2004;Corain et al., 2014). With the help of such methods, we could statistically identify the existence of significant differences between the populations but their results do not tell us in which part of the domain (time of the movement, or part of the brain) the differences appear. Therefore, if 0 of a global test is rejected, we want to identify the parts of the domain where significant differences occur, performing the socalled local inference. In this paper, we focus on local inference for functional data, which we refer to as domain selection. Few attempts have been made in this direction in the literature. A first and natural approach pertains to discretizing the functional domain and performing pointwise inference. For instance, Fan and Zhang (2000) and Reiss et al. (2010) derive pointwise confidence bands for functional data. This however only provides a pointwise control of errors arising in statistical hypothesis testing. Similar to the multivariate case, devising testing approaches that use multiple (or even infinite) numbers of hypotheses affects the performance of the test by increasing the overall probability of making wrong rejections. Out of the multiple concepts that can be used for controlling this overall probability, the most well known are the familywise error rate (FWER), which is the probability of rejecting at least one true null hypothesis, and false discovery rate (FDR), which quantifies the expected proportion of false discoveries (i.e., the expected ratio between the number of wrong rejections and the total number of rejections). Both measures are extensions to the multivariate setting of the type I error, though with different experimental meanings. While the control of the FWER is related to a deterministic (although unknown) partition of the domain related to where the null hypothesis is true, the control of the FDR is instead related to a random (but observed) partition of the domain related to the rejections of the null hypothesis. Controlling the FWER is stronger than controlling the FDR: if we devise a method for which the FWER is controlled at a specific level for any ground truth of the null hypothesis, then the FDR will also be controlled at that level. Additionally, if the null hypothesis is true on the whole domain, the two measures coincide. Simulation studies in biomechanics and brain imaging research fields comparing different approaches for FWER as well as comparing it to FDR elucidate differences and similarities in detection regions depending on the approach used. For kinematic data, we refer to, for example, Naouma and Pataky (2019) and Pataky et al. (2021), while discussions in the context of brain imaging can be found in, for example, Logan and Rowe (2004).
In our work, we focus on the problem of testing functional data by providing adjusted -values controlling the FWER. Some examples of methods that instead of control the FDR in the context of functional data can be found in Perone Pacifico et al. (2004) and Olsen et al. (2021). We base our approach on properly adjusting pointwisevalues in order to account for the multiplicity of tests that are jointly performed when analyzing the whole domain. This issue has, in multivariate analysis, given birth to many adjustment procedures (see, e.g., Marcus et al., 1976;Holm, 1979;Holmes et al., 1996;Winkler et al., 2014). However, functional data differ from multivariate data in that functional data feature unique properties such as smoothness and domain continuity, which can be used to improve upon classic methods for performing domain selection. Vsevolozhskaya et al. (2014) propose a method for domain selection that relies on the availability of a partition of the domain that allows to perform dimensionality reduction. They perform functional tests on the elements of the partition and resort to a closed testing procedure (Marcus et al., 1976) to adjust the resulting -values and achieve strong control of the FWER for the family generated by unions of the elements of the partition. The resulting inference heavily depends on the partition itself. In addition, the coarseness of the partition defines the depth to which local inference is performed, and the approach is of practical relevance only for relatively small predefined partitions. We refer to this method as partition-closed testing (PCT). Another approach-introduced for functionaltests in Pini and Vantini (2017) and extended to functionalon-scalar linear models in Abramowicz et al. (2018)-is interval-wise testing (IWT). The procedure simultaneously tests a family of hypotheses generated by all intervals of the domain. This is however of practical use only for functional data defined on one-dimensional domains as it is unclear how to define "multidimensional intervals" and would be computationally overdemanding due to the curse of dimensionality. IWT provides control of the FWER for the family of all intervals: if the null hypothesis is true on more complex subsets of the domain (e.g., a union of intervals), IWT fails to control the FWER. The adjustment made on the pointwise and setwise p-values is only one of the possible approaches presented in the literature. Some works recently focused on providing simultaneous confidence bands for functional data: Degras (2017) develop asymptotic confidence bands, Rathnayake and Choudhary (2016) focus on parametric confidence bands, and Crainiceanu et al. (2012) and Park et al. (2017) use bootstrap confidence bands. Confidence sets based on random field theory have also been considered in, for example, Telschow and Schwartzman (2022) and Liebl and Reimherr (2020). It is also worth to notice that besides FWER and FDR, additional performance measures have been introduced (e.g., false discovery exceedance, false cluster rates, and false nondiscovery proportions). We do not discuss them further in this paper, but refer the reader to, for example, Perone Pacifico et al. (2004) for further information.
In our paper, we focus on methods that aim at providing control of the FWER. We start by formalizing the concepts in Section 2 and introduce a general framework for performing local inference in Section 3. The basic principles of the methods are based on standard pointwise inferential procedures and their setwise counterparts for a chosen family of subsets. The framework is related to a wide class of inferential problems (e.g., comparisons of population means, hypothesis tests for coefficients in models), as we utilize general concepts of null and alternative hypotheses. Furthermore, it can be applied either to a parametric or a nonparametric analysis. Using the properties of the underlying tests, we formulate and prove finite sample and asymptotic properties for methods within this framework in Sections 4 and 5 and Web Appendices A and B, respectively. In Section 4, we show how well-established methods from the literature on inference for functional data can be described in the light of our proposed unified framework. In Section 5, we present a novel method with asymptotic control of the FWER. The control is provided for the family generated by domain discretization corresponding to the resolution of the observed functional data. The computational burden of the new method is independent of the dimension and complexity of the data domain. Further, simulation studies designed to exemplify the properties of the described methods and to compare them with alternative methods existing in the literature are presented in Section 6. Real data applications are presented in Section 7, while Section 8 contains conclusions. Additional definitions and results are presented in Web Appendices C-I.

DEFINITIONS AND THE INFERENTIAL PROBLEM
Consider a space of continuous random functions defined on domain , where is a compact subset of ℝ , ≥ 1. Let us consider a general inferential problem based on a sample of independent functional observations. Without loss of generality, assume that we aim at testing a functional null hypothesis 0 against an alternative hypothesis  0 and  1 and controls the type I error along with the domain. Formally, assume to observe a random sample of continuous functions ( ), ∈ , = 1, … , possibly with attached functional or scalar covariates. For all ∈ , we denote by 0 and 1 the restrictions of 0 and 1 to point . Assume that we can obtain a test statistic ( ) for testing 0 against 1 at point , where 0 is rejected for large values of ( ). Let ( ) denote the -value of the test at point based on ( ) and data { ( )} =1 . Depending on the assumptions of the generative process of functional data and on the sample size, ( ) can be computed with parametric, asymptotic, or nonparametric tests.

Pointwise and setwise test properties
Below, we define some of the properties that are typically required for pointwise tests.
Definition 2.1. We say that the pointwise test of 0 against 1 based on the statistic ( ) with -value ( ) is • valid if for all ∈ (0, 1) and any ∈ ℕ + the probability of rejecting 0 at level when it is true is smaller or equal to , that is, • asymptotically valid if for all ∈ (0, 1) the probability of rejecting 0 at level when it is true is asymptotically smaller or equal to , that is, ∈  0 ⇒ lim →∞ ℙ[ ( ) ≤ ]≤ , • consistent if for alls ∈ (0, 1) the probability of rejecting 0 at level when 0 is false is asymptotically one, that is, ∈  1 ⇒ lim →∞ ℙ[ ( ) ≤ ] = 1.
Remark 2.1. In Definition 2.1, we specify in general terms → ∞. However, depending on the test that is performed, some more specific assumptions about the sample size may be required. For example, when performing a test comparing two populations, both sample sizes are required to go to infinity and not only the total sample size .
Note that according to Definition 2.1, we allow for valid tests-with an error smaller than -rather than exact tests-with an error equal to -which is related to the use of permutation tests in our paper. We now introduce the following hypotheses defined on any set ⊂ : 0 is the hypothesis that 0 is true for all ∈ while 1 is the alternative that 1 is true for some ∈ . Assume that tests of 0 against 1 are performed using the following statistic: where the integral is defined in a Lebesgue sense and | | denotes the Lebesgue measure of . Let be the corresponding -value. We now provide the definitions of validity and consistency for the test on .

Definition 2.2.
For any ⊆ such that | | > 0, we say that the test of 0 against 1 , based on the statistic in (1) with a -value is • valid if for any ∈ (0, 1) and for any ∈ ℕ + , In the nonparametric permutation test framework, it is straightforward to build valid and consistent tests on sets from the corresponding pointwise tests under mild assumptions. Specifically, following Pesarin and Salmaso (2010, pp. 122-124), we know that if permutation tests are used and we use the same permutations for all points of the set, the (asymptotic) validity of the pointwise tests implies (asymptotic) validity of the tests on sets. Further, if for all ∈ , ( ) is nonnegative and stochastically greater under 1 than under 0 , we have that consistency of the pointwise tests implies consistency of the tests on sets.

Domain selection
Suppose that we use the pointwise -value ( ) for selecting the parts of the domain imputable for the rejection of 0 by performing thresholding at level ∈ (0, 1), that is, the parts where (⋅) < . The probability that this selected region-or part of it-is wrongly selected is not controlled, since ( ) is computed pointwise and cannot guarantee any control of the probability of committing at least one type I error over the whole domain. In multivariate statistical analysis, -values are adjusted to provide global control of the type I error rate. Selection of the variables responsible for the rejection of the null hypothesis is performed by thresholding properly adjusted -values instead of the original unadjusted ones. A type of adjustment strategy is controlling the FWER, that is, the probability of rejecting at least one true null hypothesis. There are two classical types of control of the FWER that have been defined in the literature: weak control of the FWER holds if the FWER is controlled when all null hypotheses are true, while strong control of the FWER holds if the FWER is controlled for any configuration of true and false null hypotheses. We introduce an analogous concept in FDA. We define strong control of the FWER of a test procedure based on an adjusted -value function˜( ), ∈ (cf. Equation (6)) as follows.
Definition 2.3. We say that a test procedure has a strong control of the FWER if for any ∈ ℕ + its adjusted -value function˜( ), ∈ is such that, for all ∈ (0, 1), Here cl( 0 ) denotes the closure of the set  0 . In some cases, we cannot control over all possible configurations of  0 and  1 , only have specific types of them. We therefore define such a type of intermediate control. Consider a family of domain subsets , in which elements can be expressed as finite unions of closed compact subregions of .

Definition 2.4. We say that a test procedure has a control of the FWER restricted to family if for all
When is the family of all possible subsets of , the control defined as above coincides with the strong one. Finally, analogously to the multivariate framework, if a procedure has a control of the FWER restricted to = { }, we say that it has a weak control of the FWER. Some situations may only have an asymptotic control of the FWER , that is, control of the FWER when → ∞. In the following, we formalize it for the restricted FWER.
Definition 2.5. We say that a test procedure has an asymptotic control of the FWER restricted to family if its adjusted -value function˜( ), ∈ is such that, for all ∈ (0, 1), Finally, we define the consistency of an inferential procedure, assuring that it asymptotically detects the parts of the domain where 1 holds, that is,  1 . Definition 2.6. We say that the test procedure is consistent if its adjusted -value function˜( ), ∈ is such that, for all ∈ (0, 1), where Int( 1 ) denotes the interior of set  1 .
Remark 2.2. Since tests on subsets are performed using an integrated pointwise test statistic, deviations from the null hypothesis at only one point or a set of null Lebesgue measures cannot be detected. In particular, the boundary of the set  1 cannot be detected, since it has a null measure. Hence, strong control of the FWER is extended beyond  0 to the closure of the set  0 , while consistency can be reached only for the interior of  1 .

A UNIFIED FRAMEWORK
In this section, we describe a unified framework for testing local functional hypotheses on , given a set of independent random functions. We present a class of methods that can be used to adjust the pointwise -values ( ) to provide a control of the FWER over specific families. Consider a nonempty (possibly infinite) family  of Lebesguemeasurable subsets of the domain of nonnull measure, such that ∪ ∈ = . The testing procedure that we propose is based on performing tests on the restrictions of 0 and 1 to all subsets of the family and adjusting the -value according to the results of such tests. First, we formally describe the testing procedure for a general  and provide a characterization of the inferential properties of the methods depending on the choice of  . Then, we describe several methods that can be obtained for some particular choices of  . The unified framework consists of the following steps (presented graphically in Web Appendix I): 1. Computation of -values for all subsets. For all ∈  , compute the -value of the test of 0 against 1 , based on the test statistic in (1).

Computation of the adjusted -value function. For all ∈
, compute the adjusted -value, 3. Domain selection. Select the subsets of where 0 is rejected at level ∈ (0, 1) as In the following sections, we consider two types of families  : a predefined type, where all subsets belonging to  are defined a priori, and a data-driven type, where the subsets belonging to the family depend on the data at hand. For clarity, we denote the predefined families by  − and the data-driven ones by  .

PREDEFINED FAMILIES
In this section, we state properties of the test procedure described in Section 3 for predefined families, with proofs given in Web Appendices A and B.
Theorem 4.1. Let  − be a predefined nonempty family of Lebesgue-measurable subsets of domain . Let˜( ;  − ), ∈ , be the adjusted -value function in (6). If the tests of 0 against 1 are valid (asymptotically valid) for all ∈  − , then, the test procedure based on˜( ;  − ), ∈ , has a control (asymptotic control) of the FWER restricted to the family  − . Theorem 4.2. Let  be a nonempty family of Lebesguemeasurable subsets of the domain . Assume that the cardinality of family  is finite. Further, assume that all ∈  are either compact sets or a finite union of compact sets. If the tests of 0 against 1 are consistent for all ∈  , the test procedure based on˜( ,  ) in (6) is consistent.
Theorem 4.1 states that if the family is fixed, the probability of wrongly detecting a set where the null hypothesis is actually true is bounded by for every set included in the family  − . Theorem 4.2 states the conditions under which the test procedure is consistent. Observe that the latter result is valid for both predefined and data-driven families  .
The remainder of this section discusses test procedures for particular choices of predefined families  − , and theoretical properties of corresponding adjustment procedures. We focus on the case when = [ , ], leaving the discussion about higher dimensions to Section 7.2.

Global testing
Suppose that the family consists only of the whole domain,  ∶= { }. The corresponding test procedure performs one test over and assigns its -value to all points of , with˜( ;  ) ≡ , for all ∈ . From Theorem 4.1. it follows that if the test on is valid, this method has a weak control of the FWER. The consistency of the procedure follows directly from the consistency of the test. However, a global test cannot provide strong control of the FWER. Further, since the adjusted -value function is constant, it cannot be used to select specific parts of the domain responsible for the rejection of the null hypothesis.

Borelwise testing
The Borelwise testing procedure (BWT) is based on the choice  ∶= ( ), where ( ) denotes all Borel sets of nonzero measure of . Borel subsets of zero measure are excluded since the test statistic (1) is not definite on such sets. The resulting procedure is the continuous extension of the closed testing procedure (see, e.g., Marcus et al., 1976) that has been proposed in multivariate analysis. If all tests are valid, Theorem 4.1 implies that the BWT has a strong control over the FWER. The adjusted -value function for this method is constant, with˜( ;  ) ≥ max ∈ ( ) (Proposition 1 in Web Appendix B). Hence, the BWT is not consistent and cannot be used for domain selection.

Partition-closed testing
Assume that interest lies in performing tests on an a priori selected partition of the original domain. Let { } =1 for some finite ∈ ℕ + define the sets of the partition, satisfying ⊆ , ∩ ′ = ∅ for all ≠ ′ , and ⋃ =1 = . Assume that is Lebesgue-measurable for all . Then, the partition-closed testing procedure (PCT; Vsevolozhskaya et al., 2014) is the inferential procedure based on a family containing all possible unions between sets , with  , = {∪ ∈ } ⊆{1,…, } . From Theorem 4.1, it follows that the PCT procedure has a control of the FWER restricted to family  , when the tests are valid. For every finite , the PCT method is consistent, by Theorem 4.2 if the tests on subsets are consistent. Since the method is based on performing tests on unions of sets , the adjusted -value˜( ;  , ) is a stepwise constant function attaining the same value for all points belonging to the same element of the partition. If for some , we reject the null hypothesis on , we only know that presents a statistically significant deviation from the null hypothesis in at least one of its points. With this method, it is not possible to decide which set of points within this subset that are responsible for the rejection of 0 . The practical use of the method is highly dependent on the choice of { } =1 . Consider two uniform partitions of the domain , the first of size 0 , 0 ∈ ℕ + and the second of size 1 = 0 , for an arbitrary ∈ ℕ + , > 1. By definition, the adjusted -value function for the PCT method based on the partition of size 1 cannot be smaller than the one corresponding to size 0 . Moreover, if at any 0 ∈ the unadjusted -value function is above the significance level, the corresponding adjusted -value function increases with , and at some point exceeds the significance level on the whole domain, resulting in no domain selection. Note that if the measure of all elements of the partition goes to zero (as → ∞) the PCT and BWT methods coincide, and for = 1 the PCT method coincides with the global testing.

Intervalwise testing
IWT (Pini and Vantini, 2017) is based on performing a test on every interval of the (one-dimensional) domain. The method fits under the unified framework with family  = {[ 1 , 2 ] ∶ 2 > 1 } 1 , 2 ∈ . By Theorem 4.1, the test procedure has a control of the FWER restricted to  when valid tests are used. The attained intervalwise control of the FWER is in-between the weak and the strong control. Further, the pointwise test statistic is a continuous function, and the test statistic (1) is continuous with respect to the limits of integration. This implies that˜( ;  ) is continuous on , providing us with a tool for domain selection. Similar methods can be defined by replacing intervals with more complex subsets. An apparently straightforward extension of IWT would be families that also include countable unions of intervals. However, such a generalization does not lead to a method with desired properties. Indeed, for a fixed integer , consider the testing procedure based on the family  = {∪ =1 [ 1 , 2 ] ∶ 2 > 1 } 1 , 2 ∈ , =1,…, , that is, the family of all possible unions of at most disjoint intervals. It can be shown (see Proposition 2 in Web Appendix B) that the adjusted -value func-tion˜( ;  ) is such that, for all ≥ 2,˜( ;  ) is constant on and such that˜( ;  ) ≥ max ∈ ( ), making the method unsuitable for domain selection. Furthermore, for all < ∞,˜( ;  ) is not provided with a finitesample strong control of the FWER.

DATA-DRIVEN FAMILIES
Section 4 shows that in the case of predefined families it is not possible to guarantee both the possibility of performing domain selection and strong control of the FWER. In the following, we show that, with data-driven families, it is possible to identify families that provide an asymptotically strong control of the FWER while allowing for domain selection.

Thresholdwise testing
The thresholdwise testing (TWT) performs tests on a family,  , , which is constructed based on the unadjusted -value function, and thus data dependent. The family is constructed in the following way: Analogously to the PCT, consider a partition of the domain { } =1 and the corresponding family of subsets  , . We introduce the discretized version of the unadjusted -value function as , ( ) = * , where * is such that ∈ * , and thus , ( ), ∈ is piecewise constant. The next step is to determine the family of subsets on which the tests are being performed. In the PCT case, the family is  , and we would perform 2 tests. For the TWT procedure, we define a much smaller family  , which is data dependent. It consists of the sublevel and superlevel sets of the discretized unadjusted -value function. Formally, From the construction of , ( ), ∈ , it is straightforward to see that  , ⊂  , and that the maximum number of elements in  , is 2 . With such a choice, the adjusted -value function˜( ;  , ) as defined in (6)  . Here the supremum in definition (6) is replaced by a maximum since the discretized unadjusted -value function is a piecewise constant on a finite partition and hence attains only a finite number of levels.
For finite , and when the tests are valid, TWT has a weak control of the FWER, since ∈  , . Naturally, given the data the TWT procedure with valid tests also provides a finite sample control of the FWER restricted to  , . However, by definition, the partition is data dependent as the sets over which we control the error change between samples. The strength of the TWT is that control of the FWER restricted to  , is attained asymptotically, for asymptotically valid and consistent tests (see Theorem 5.1). The proof of the theorem is given in Web Appendix A.
Theorem 5.1. Let  , be the TWT family, based on the partition { } =1 . Assume that for all ∈  , , the tests of 0 against 1 are asymptotically valid and consistent. Then, the test procedure based on the adjusted -value func-tion˜( ,  , ) has an asymptotic control of the FWER restricted to  , .
The conditions of Theorem 4.2 are met if the tests are consistent for all ∈  , , since for finite the family is finite, and all subsets in the family are composed of a finite union of compact sets. This implies that the TWT procedure is consistent. The resolution of the domain selection process is related to the coarseness of the partition { } =1 , similarly to PCT. In both cases, the largest subset we can provide a control of is 0, which is the biggest set included in  0 that can be constructed as a union of elements of the partition. For finite , 0, is possibly smaller than  0 , so the control provided by TWT is weaker than the asymptotic strong control of the FWER. In practice, however, by refining the partition, the difference can be made arbitrarily small. In general, one would like to increase the value of in order to have a good approximation of the functional data and of the set  0 where the FWER is controlled, even though increasing the size of the partition can in principle decrease the power of the method, since a larger number of tests would be involved in the maximization. The effect of changing the partition size is explored in a simulation study described in Web Appendix F. It illustrates that when is sufficiently large to well approximate  0 by 0, , the method continues to have similar power for larger .
As discussed earlier, increasing has a significantly negative effect on power and domain selection capability for the PCT method, due to the exponential number of tests performed. It is illustrated by the simulation study in Section 6 that compares the performance of all the methods within the unified framework described in Sections 4 and 5, as the sample size grows. The study confirms the already mentioned pros and cons of the methods and shows how the power (sensitivity) of all other methods except BWT increases with the sample size. The TWT procedure is by construction more powerful than PCT, since , ⊂ , and the number of tests increase linearly, making it suitable for high-resolution domain selection. The computational costs of TWT are not affected by the dimensionality of the domain. In the case of multidimensional domains, one only has to ensure that the partition can approximate the sets  0 and  1 . This makes TWT naturally suited to deal with functional data defined on multidimensional domains or even on smooth manifolds (cf. Section 7.2). Alternative data-driven families can be constructed using preimages of the unadjusted -value function, corresponding to a suitable family of subsets of the codomain [0,1]. Such families can be shown to share the same asymptotic properties as the TWT method, and are discussed in Web Appendix C.

SIMULATION STUDIES
This simulation study has two aims. First, the performance of the methods within the general framework is compared in a finite sample setting. Second, we compare the performance of the TWT method with some additional methods provided in the literature.

Simulation model
For both simulation studies, the inferential problem at hand is the comparison of means of two functional populations and we utilize the same underlying model. We consider equal size samples of two groups: ( ) = ( ) + ( ) = 1, … , , = 1, 2, ∈ = [0, 1].
The error functions ( ) have zero mean and are independent between individuals and populations. We simulate them by simulating the coefficients of a cubic B-spline basis expansion with 81 basis functions and ( ), where ∼ ( , Σ), ( ), = 1, … , 80 are B-spline basis functions and ( ) is a standard deviation function. We assume that the basis coefficients are correlated according to a squared exponential covariance function, that is, ( 1 − 2 80 ) 2 ) , 1 , 2 = 1, … , 81.
In all simulations, we use 1 ( ) = 0 while we consider multiple scenarios for 2 ( ) =˜2( ), with varying repre-senting the effect size and˜2( ) representing the prototype for the mean. All prototypes are obtained using the same cubic B-spline basis, whose coefficients are sequences of zeroes and ones. In the first simulation study, we consider a division of the domain into two equisized parts,  0 and  1 , using two scenarios. In the first scenario, (1.A),  0 is an interval, while in the second scenario, (1.B),  0 and  1 are composed of eight alternating intervals. In the second simulation study, we consider three scenarios (2.A, 2.B, and 2.C), where the domain is divided into two intervals  0 and  1 , and we vary the proportion of the domain corresponding to  0 . A summary of the parameters and their values for both studies is presented in Table 1. We test the two sample mean equality hypothesis using permutation tests with the pointwise test statistics ( ) = ( 1 ( ) − 2 ( )) 2 . We compare the performance of the methods by estimating FWER, FDR, and sensitivity by their empirical correspondence based on 1000 simulated experiments. For details on the definition and used estimates, as well as details of implementations, we refer to Web Appendix D.

6.1
Simulation study 1: Comparison of the methods within the unified framework Figure 1 presents the dynamics of the estimated measures for = 2 as a function of , with = 0.05. As expected, the sensitivity of all the methods, except BWT, increases as increases. BWT is the only procedure always controlling the FWER. In practice, though, BWT does not detect any significant differences and hence is not of practical use. The IWT and PCT procedures control the FWER only if the underlying partition into  0 and  1 can be captured by the corresponding family of subsets, so the provided control is not strong. In scenario A.1, since the null hypothesis is true on an interval, IWT results in a finite sample control of the FWER. The interval can also be constructed using a partition defined by the PCT method with = 4 and 10, but not with = 5. In scenario A.2, none of the PCT partitions result in a separation of  0 and  1 and therefore no control is provided. TWT is the only method that possibly allows the selection of portions of the domain and provides asymptotically strong control of the FWER. This control is here reached for a reasonably small sample size (i.e., ≈ 30), which further supports its possible usefulness in statistical practice. Finally, as expected from theory, FDR is controlled by all procedures controlling the FWER. Since FDR is generally lower than FWER, in a few cases, procedures not controlling FWER control the FDR instead (e.g., IWT in scenario A.2), even though this is not supported by theory and could be a consequence of the parameter choice. The results presented are inherently dependent on the effect size used in the simulation studies. In Web Appendix E, we present the effect of changing the effect size on the performance of the method. As expected, increasing the effect size speeds up the convergence of TWT to the asymptotic strong control of the FWER, while lowering the value of this parameter implies higher sample sizes are required for attaining the control. F I G U R E 2 Results for simulation study 2 with effect size = 2 and constant standard deviation function. Examples of = 15 sample functions from both populations (distinguished by color) are presented together with their corresponding mean functions (first column). Effect of increased sample size on the estimated FWER (second column), FDR (third column), and sensitivity (fourth column) for the compared methods in the three scenarios with different portions of  0 and  1 . Scenarios 2.A, 2.B, and 2.C correspond to 25%, 50%, and 75% of the domain corresponding to  0 , respectively. Line colors correspond to different methods, while line types correspond to different values of parameter in the RFT method. The dashed horizontal line corresponds to the nominal level = 0.05. This figure appears in color in the electronic version of this article, and any mention of color refers to that version

Simulation study 2: Comparison with alternative methods
In this study, we compare the performance of TWT, being a member of our framework, with some alternative methods presented in the literature. We consider a method introduced in Cox and Lee (2008) aiming at control of the FWER using the permutational distribution of the minimum pvalue (p-min). We also consider two methods controlling the FDR: the functional Benjamini-Hochberg (fBH) method introduced in Olsen et al. (2021) and the method proposed in Perone Pacifico et al. (2004) based on random field theory (henceforth denoted RFT). The RFT method includes a parameter ∈ (0, ) which, while keeping the FDR control at level , affects the power of the resulting procedure. In the simulation studies, we compare the performance of the method for two distinct values of this parameter (0.1 and 0.9 ). Here we present the effect of varying the size of  0 and  1 , with the effect size = 2 and constant variance. Figure 2 shows that in this scenario as increases the strong FWER control is attained asymptotically by all methods except fBH, while the FDR is controlled by all methods. The RTF method is sensitive to the choice of the parameter , and even though the FDR is always controlled, the FWER control is not guaranteed for the higher value of for smaller samples. The p-min method controls the FWER in all cases. However, recent studies (Mrkvička et al., 2022) have shown that for high-dimensional data the power of the method decreases drastically. After reaching FWER control, TWT shows a similar sensitivity as p-min. Additional scenarios are considered in Web Appendix E, where we study the effect of variance heterogeneity and effect size. In general, we see the expected effect of the signal-to-noise ratio on all of the methods and the main conclusions remain unchanged.

7
REAL DATA APPLICATIONS

Knee kinematic data
Our simulation study of methods within the unified framework is complemented with the analysis of onedimensional kinematic data, elucidating how the detected regions can differ when different methods are applied.
The results together with a discussion are presented in Web Appendix G.

Analysis of diffusion magnetic resonance imaging data
In what follows, we compare the detected regions of the methods presented in the second simulation study on diffusion magnetic resonance imaging (MRI) data. A brain image is a complex spatial domain since it is a subspace of ℝ 3 with a complex shape. In this application, the complex domain is defined by the voxels (threedimensional pixels of the imaged brain) that are intersected by the so-called corpus callosum (CC), which is the set of axons connecting the two hemispheres of our brain. The CC axons form a bundle that defines a two-dimensional manifold of ℝ 3 (see Web Figure 10 for an example).
The CC axons are intrinsically an anisotropic environment since axons can be broadly viewed as cylinders. In particular, in this study we focus on fractional anisotropy (FA), an index measuring the degree of anisotropy along brain tracts, which has been widely adopted as a proxy for quantifying axonal damage (Horsfield and Jones, 2002;Assaf and Pasternak, 2008). FA is typically quantified with two approaches: the first proposed approach is a singletensor model (STM; Basser et al., 1994) consisting of a single anisotropic component, and a more complex approach is a multicompartment model (MCM; Panagiotaki et al., 2012) incorporating an additional isotropic component related to free water.
Here we propose to demonstrate that improving upon STM by using MCM does result in a significantly lower population variance of FA. To achieve this goal, we pro-cessed diffusion MRI data of 30 healthy subjects from the Human Connectome Project (Van Essen et al., 2013) to obtain a reconstruction of the CC of each subject using both STM and MCM. We chose the CC because its reconstruction is relatively easy. Finally, we defined the common domain of the CC as the set of all the voxels of size 1.25 mm 3 that were intersected by the CCs of all the 30 healthy subjects, which provided us with a domain of 950 voxels in three dimensions lying on a two-dimensional manifold. For more details on STM and MCM models and the method we used for fitting them, see Web Appendix H.
To test the stability of FA, we hypothesize that the population variance should be lower when using the more complex MCM over the STM. We therefore perform a paired one-tailed permutation test using the variance ratio as the test statistic. Domain selection is of paramount importance in brain applications where we need spatial localization of the differences. We can achieve domain selection via TWT based on a discretized unadjusted -value function evaluated on the CC voxels. For completeness, we also included all methods evaluated in the second simulation study, that is, p-min, RFT with the two choices = 0.1 and = 0.9 , and fBH. All methods were performed on the same discrete evaluation of the data on 950 voxels, and p-values were evaluated using 5000 permutations. Figure 3 reports the regions of the brain where significant differences are observed by the considered methods at = 0.05. First of all, note that the p-min method does not detect any significant difference. This is due to the drastic decrease in the power of this method for high-dimensional data (Mrkvička et al., 2022). Only when increasing the number of permutations to 10,000, the method starts detecting some differences. We would expect to obtain more significant differences when increasing the number of permutations, at a cost of a significant increase in computational time. TWT detects instead a large region, which is comparable with respect to the one detected by fBH, and it is substantially larger than the one detected by the RFT method with both choices of . This latter result could be related to the lower power of the RFT method with respect to fBH also observed in simulation study 2. Finally, note that even though the regions detected with TWT-adjusted p-values and unadjusted p-values are very similar, TWT is performing a substantial adjustment of pvalues, which can be seen in Web Figure 11.
The TWT approach identifies two symmetric areas (one in each brain hemisphere) where the FA variance cannot be claimed to be significantly lower in the MCM with respect to the STM. This is very interesting from a neurological perspective because these two areas are precisely the regions where the CC tract crosses with two other wellknown tracts, namely the superior longitudinal fasciculus and the pyramidal tract. This shows that in these regions F I G U R E 3 Voxels where the null hypothesis of equality of the variances of the two populations is rejected (green) and not rejected (gray) according to the different methods. This figure appears in color in the electronic version of this article, and any mention of color refers to that version the introduction of the free water-related isotropic component is not sufficient to reduce the population variance. Hence, the addition of a second anisotropic component possibly would be needed to model the additional tracts.
The running time was about 4 min for TWT, 3 min for each of the RFT methods, and 7 min for p-min; the timing was evaluated on a 2.6-GHz Quad-core i7 processor, with 16 GB 2133 MHz LPDDR3 RAM and 512 Gb SSD hard drive.

CONCLUSIONS
In this paper, we introduce a general framework for local inference for functional data, where subsetwise test procedures on the functional data perform domain selection while controlling the FWER. We investigate the properties of the test procedures (methods) within the framework. The test procedures are based on two types of families (of subsets of the domain): predefined, appearing in the existing literature, and data driven proposed in this paper. We show that some serious practical limitations of the methods based on the predefined families can be overcome with the data-driven families. The possibility of selecting significant regions in a possibly complex domain, while retaining asymptotic FWER control restricted to a family generated by the predefined data resolution, is presented and illustrated in two application-focused examples.

A C K N O W L E D G M E N T S
This work was supported by the Swedish Research Council (grant numbers 2016-02763 and 340-2013-5203). We are grateful to the associate editor and a reviewer for their valuable input. We are also grateful to Charlotte K. Häger for providing the kinematic data in Section 7.1. Data in Section 7.2 were provided by the Human Connectome Project (Van Essen et al., 2013).

D ATA AVA I L A B I L I T Y S TAT E M E N T
Knee kinematic data supporting the findings of this paper (Section 7.1) are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. MRI data supporting the findings of this paper (Section 7.2) are openly available in the Human Connectome Project, WU-Minn Consortium (principal investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.

S U P P O R T I N G I N F O R M AT I O N
Web Appendices, and Figures referenced in Sections 4-7 are available with this paper at the Biometrics website on Wiley Online Library. R code implementing the proposed TWT method is available at github https://github. com/astamm/fdatest. R code for reproducing the simulated results is also available at the Biometrics website on Wiley Online Library.