Screening of Annual Rainfall Time-Series Data in Kala Oya Basin: Case Study in Sri Lanka

Hydrometeorological data screening is essential analysis conducted prior to the use of such data for modelling and designing of water development schemes as inconsistencies can occur during the data collection or data entering. There are many methods adopted to perform screening of data by various publishers, but this paper presents a prominent method which has been used for the rainfall annual time-series of Kala Oya basin. Adequate numbers of rainfall gauging stations and outliers have been checked using statistical analysis and percentage significance level outlier constant for normal distribution. Trend analysis and absence of persistency have been estimated by the MannKendall test and the first serial correlation coefficient method. Homogeneity of the region was checked by doing two statistical tests based on L-Moments of the time-series data. The first criterion was Discordancy measure (Di) and the second criterion was Heterogeneity measure (Hi) which have been based on L-moments of the time-series data. The results confirmed that the selected annual time-series data can be considered as homogeneous and consistent with minor deviations.


Introduction
A longer time-series gives a greater chance of getting time-series as non-stationary, nonconsistent or non-homogeneous. Hydrological time-series data can consist of errors or unconformities which cause inconsistencies and non-homogeneities due to causes of natural or manmade phenomena. Data-screening prior to modelling and designing of water development schemes is an essential requirement. Unscreened data may create false estimations, hence designs. Decisions made based on these false estimations or designs create irreversible damage to the environment, wildlife and humans. The importance of data-screening of time-series has been addressed by several researchers. Hosking and Wallis [3] reported that homogeneity measures provide an initial screening of data and indicate sites where the data may merit close examination. The aim of this paper is to elaborate on the complete datascreening procedure and apply it to the Kala Oya basin, Sri Lanka. Seventeen rainfall gauging stations were selected for this study and, for each station, annual rainfall time-series was taken into the data-screening process. Selected time-series data for this study is from 1985 to 2018. Reliable data compensate reliable hydrological studies and enhance the quality of the results. The effort of this study is to obtain reliable rainfall time-series for the Kala Oya basin hydrological studies.

Study Area
Kala Oya stream originates from central mountains of Dambulla at an elevation about 870 metres above mean sea level. Kala Oya is the third longest river in Sri Lanka. The river flow generates from the central province and flows through the north-central province and falls into the sea from north-western province at a place called Gangewadiya which belongs to Wilpattu national wildlife park. Kala Oya watershed is located in four administrative districts, namely Anuradhapura, Matale, Kurunagala and Puttlam. Kala Oya basin location details are presented in Figure 1.

Methodology
The methodology adopted to screen the data starts from the selection of an adequate number of stations for the study. Time-series data exploratory analysis has been done by graphically plotting the time-series. Annual rainfall time-series high and low outliers have been calculated by using percentage significance level outlier constant for normal distribution.

Figure 1 -Kala Oya Basin Detail
Modified Mann-Kendall test has been used to detect the short term trend of annual timeseries. Absence of persistence has been examined by the first serial correlation coefficient method. If all the above criteria are satisfied, then homogeneity of the region has been analysed by doing two statistical tests based on L-Moments of the time-series. L-Moments have been estimated by probability weighted moments. If the selected stations time-series data satisfy all criteria, then the data can be used for basin hydrological studies. If the data series are not satisfying the criteria, then the time-series need to be changed or need to be corrected using a reliable correction procedure. Once changed, the time-series should undergo the screening procedure again.
Step by step approach forms the complete datascreening procedure, which is illustrated by the flow chart in Figure 2

3.1
Check for Adequacy Data screening basic procedure begins with the check for adequacy of rain gauging stations inside the selected basin. The total number of stations inside the Kala Oya basin is nine. From the first adequacy check, it was identified that nine stations are not sufficient for the study. Hence numbers of stations have been increased by adding a 10 km buffer zone. The selected stations and buffer zone for the study basin are presented in Figure 3. Subramanya [1] provides a statistical method to obtain an optimal number of stations that should exist for a study area. The statistical test variables can be written as: Where Cv is the coefficient of variation. εex the expected error (in percentage), σm-1 the standard deviation of annual time-series data, the mean precipitation, m the number of stations selected and N is the optimal number of stations required for the study.

3.2
Outlier and Unconformity Check An outlying observation may be due to an extreme climatic event. If that is true, the value should be retained and used for the studies as other observations. Then again, an outlying observation may be the result of an error in collecting or recording the numeric value. In such cases, it may be desirable to check the outliers to ascertain the aberrant values. The aberrant value may even eventually be rejected to maintain the reliability of the data series. Outlying observation or "outlier" is one that appears to deviate markedly from other members of the sample in which it occurs [2]. Outliers of the annual rainfall time-series data have been checked by estimating the high and low outliers using percentage significance level outlier constant (K) for normal distribution. The constant K has been selected according to the data length of each time-series. The test criterion, High and Low outlier criteria can be written as: Xmean is the mean value of the data series. Stdx is the standard deviation of the data series. K is the constant selected according to data length suggested by Grubbs and Beck [2]. When data length is 33, the K value has been selected as 2.79 and varied according to the data length. Trend Analysis Mann-Kendall test (MK) with tie correction has been used to detect the monotonic trend of annual rainfall time-series [4]. For the observed annual rainfall time-series data of X= x1, x2, x3, …., xn MK statistic was estimated as follows.
where,  = xj -xi For n ≥ 8, S is normally distributed as: Tie correction define as: ZM is standard normally distributed with zero mean and unit variance. If the calculated ZM statistics lies within the limit of -1.96 and 1.96, then it is considered the null hypothesis of having no trend at 5% significance level.

3.4
Absence of Persistence Analysis Time-series of yearly and seasonal totals are usually independent. But extreme rainfall events may create aberrant values. Such observations should be discarded. If a reliable correction procedure is available it may sometimes be corrected and retained. Hence it is essential to test the time-series for independence. The serial-correlation coefficient can assist with confirming the independence of a time-series. For this study, it is sufficient to compute lag 1 serial-correlation coefficient, i.e. the correlation between adjacent observations in a time-series. The first serial-correlation coefficients for the 17 stations annual timeseries have been estimated. The estimated correlation coefficient r1 was checked for the 5% significance level.
where r1 upper and lower limits defined as, r1 is standard normally distributed with zero mean and unit variance. If the values of the r1 statistics calculated lie within the limit of r1upper and r1lower, and then it is considered the null hypothesis of having no persistence at 5% significance level.

Homogeneity Check
For a basin hydrological study, all the sites located in the basin must have a homogeneous time-series. For these reasons, the homogeneity of the time-series needs to be tested. Hasking and Wallis [3], [5] introduced two statistical tests based on L-moments of time-series data to check the homogeneity of the data series. The first statistic is discordancy measure (Di) [3], [5], [6], [7]. The discordancy statistical test, discordancy index Di is defined by: where i is the site number, ui is the L-moment vector of i th site and ui is the vector which includes sample L-moments of t, t3 and t4. Then ui is defined as: is the average L-moment vector of ns (number of sites) and can be written as: SD is the sample covariance matrix and can be defined by: For any site, if Di ≥ 3, it is considered as disharmonious [3], [5], [6], [7]. The second statistic is heterogeneity measure (Hi) [3], [5], [6], [7]. Heterogeneity measure is defined by three statistics, H1, H2, and H3. The statistic Hi is estimated as follows: Where, ns is the number of sites and ni is the recorded length of each site. is the average of t i values defined as: Where and are the mean and standard deviation of artificially developed data using four-parameter Kappa distribution. H1, H2, and H3 statistics estimate the degree of heterogeneity in a group of sites as reasonably homogeneous if Hi < 1, the region is fairly homogeneous if 1 ≤ Hi ≤ 2 and if Hi > 2 the region is absolutely heterogeneous [3], [5], [6], [7].

Check for Adequacy
During the check for adequacy of rainfall stations, it was identified that when the number of stations is selected as nine (which was actual stations inside the basin), the standard error estimated using equations 1 and 2 becomes 8.83% (σm-1 obtained as 336.97 and Cv obtained as 26.49). When the εex is 8.83 the optimum number of stations calculated is 17. Accordingly, a 10 km buffer zone from the boundary of the basin has been considered to account for more stations for the study. Also, another 2 stations closer to the 10 km buffer zone boundary have been selected to have a uniform distribution of the stations inside the study area and to increase the total number of stations to 17. Then again, the equations 1, 2, and 3 are used to estimate the optimal number of stations (N). Hence σm-1 was obtained as 52.88 and εex obtained as 4.16 with the standard error of 1.0%.

4.2
Outlier and Unconformity Check Estimated high and low outlier values for 17 stations were plotted with the annual timeseries data to detect the outliers. High and low outlier margins estimated for Mahailuppallama time-series data were graphically plotted with annual time-series data and shown in Figure 4. The graphs plotted with high and low outliers with the time-series data confirmed that there are no unconformities or considerable outliers detected for all the 17 stations annual timeseries. Table 2 shows the high and low outliers estimated for 17 annual time-series data with maximum and minimum values.

4.3
Trend Analysis Mann-Kendall test with tie correction has been applied to 17 stations annual rainfall time-series data to detect the monotonic trend. The test results are given in Table 3. The test results confirmed that 15 stations time-series data have no trend whereas two stations (Puttalam and Mahagalkadawala) time-series data have a trend. ZM statistics obtained for Puttalam and Mahagalkadawala are 2.88 and 2.37, respectively. Since these values are positive, two stations time-series data has an upward trend. But these values are closer to the 5% significance level of 1.96. If we increase the significance level to 0.5% then these stations also can consider as no trend.

4.4
Absence of Persistence Analysis The first serial correlation coefficient for lag one was estimated to check the absence of the persistence of time-series data and the test results for 17 stations are presented in Table 1.

Homogeneity Check
Homogeneity of the 17 time-series data have been checked by estimating the discordancy measure and heterogeneity measure. The second statistic of homogeneity check is heterogeneity measure. The heterogeneity statistics of H1, H2 and H3 have been estimated for the 17 rainfall gauging stations annual timeseries using L-moments and the results are presented in Table 5. Since the heterogeneity statistic H1, H2 and H3 values are less than 1, the region can be considered as reasonably homogeneous.

Conclusions
All statistical check results confirmed that 17 stations annual time-series data are statistically homogeneous with minor deviations and can be used for the Kala Oya basin hydrological studies with satisfaction. The data screening method introduced by this paper can be used for the other river basins of Sri Lanka to check the hydrometeorological data consistency and homogeneity before using it for modelling and designing of water development schemes.