Key Issues of Data and Data Checking for Hydrological Analyses-Case Study of Rainfall Data in the Attanagalu Oya Basin of Sri Lanka

Inconsistencies and non-homogeneities in the hydrological and meteorological time series could be identified by incorporating statistical tests that detect trends and change points. Inconsistency which reflects systematic errors during recording and the non homogeneity that arises from either natural or man made changes to the gauging environment are both important for adequate time series analysis. It has also been identified that statistical tests together with physical or historical evidence and justifications from metadata need to be incorporated for a very detailed study. A case study was carried out for the rainfall data of Attanagalu Oya basin in the western province of Sri Lanka with a data set consisting of six stations having daily rainfall data for 30 years. According to Pettitt test, a significant change around 1977 & 1985 at Karasnagala and Pasyala could be found. However Pasyala is the most significant station for the change of rainfall pattern, which was confirmed by t-test. Knowledge of Meta data was found very important in order to make necessary corrections to shifts identified through Double Mass Analysis. This paper shows that statistical tests and rational judgements would enable suitable corrections even though it is common to find that most of the hydrological and meteorological data are either flagged for quality or poorly documented.


Introduction
Water resources development and management is heavily dependent on hydrological and meteorological data. In order to make sure that the results obtained from these data are reliable for practical applications, such data should be, homogeneous and consistent either to carryout frequency analyses or to simulate a hydrological system [1]. In hydrologic analysis it is customary to search for long datasets since such data ensures that the sample taken represents the system performance. However, longer the time series the greater are the chances that the data series is neither stationary, consistent nor homogeneous. It is also necessary to identify the spatial representation of the data used in an analysis. In case of precipitation, spatial distribution of rain gauges is often non-representative since they are mostly located in the valleys where easy access is the main criteria. It has also been identified that in many mountainous catchments, the higher elevations receive more precipitation than the regions in the valley [2]. As such, prior to a responsible hydrological analysis, a suitable spatial and temporal analysis of data needs to be carried out through an efficient screening procedure.
As there are many organizations having different objectives perform data collection, there is also a necessity to check such observation data series for consistency and homogeneity. It is common to use statistical tests, either parametric or non-parametric, in order to detect the non-homogeneity in a time series. The choice between the two families of tests is based on the expected distribution of data involved. If data set is normally distributed, parametric tests are usually selected. If data set is expected to be nonnormally distributed, non-parametric tests are preferred. Also it has been identified that some homogeneity tests depend on meta-data while the others are purely statistical. The presence of a single significant test result is considered as a weak evidence of change. In case of more results that are significant and not very similar, then they need to be taken as stronger evidence of change [3]. However it should be emphasized that application of more than one test to data may

Study Area & Data Availability
Attanagalu Oya basin which drains to the western coast of Sri Lanka (between 79° 50' & 80° 7'E and 6° 59' & 7° 17' N) is having a catchment area of 727km 2 . The spatial coverage of the basin shows that it spreads over two provinces namely the Western and Sabaragamuwa and flows through the Gampaha and Kegalle administrative districts. The basin has an elevation of about 300m MSL as its highest. There are several large streams that combine to drain Attanagalu-Oya and they are namely, Kimbulapitiya Oya, Mapalan Oya, Dee-eli Oya and Uruwal Oya ( Figure 1).
There are 18 rainfall gauging stations located either within the basin boundary or just outside the boundary. The rain gauging network maintained by Department of Meteorology consists of 17 stations, of which 16 do not possess automatic recording facilities but maintain daily records. The other one at Katunayaka in the vicinity of the catchment is a recording type. There is a recording type rain gauge at Karasnagala, maintained by Irrigation Department. . Based on the data availability and spatial coverage, daily data of six stations were selected for the study. This study considered data from 1970 to 2001. Station names and details of missing rainfall data during the said period are shown in Table 1.

Methodology
The following tests were carried out in this study with the use of the SPELL-Stat software [4]. where y H , y L are high and low outlier thresholds in log and y is the mean, n is the sample size, s y is the standard deviation and K n is the parameter given in Chow et al.(1988) [6], for sample sizes varying from 10 to 140.
The serial correlation coefficient verifies the independence of a time series which in turn helps to ensure that each of the data have an equal probability of occurrence. If a time series is completely random, the population autocorrelation function will be zero for all lags other than zero. If all the data sets are perfectly correlated to each other then its value is unity. Sample serial correlation coefficients will deviate slightly from zero only because of sampling effects.
In case of hydrological analysis, it is usually sufficient to compute the first lag serial correlation coefficient, i.e. the correlation between adjacent observations in a time series [1]. A confidence level of 95% was used for calculations.
Presence of serial correlation may also complicate the detection and evaluation of trends in hydrological time series. When a data set shows a drift towards higher (or lower values) over the period of record, the drift may be an indication of an underlying change or long term persistence. It could probably be that the data are dependent on some processes which are serially correlated. Several approaches have been suggested for removing the serial correlation from a data set prior to

ENGINEER 3
applying the non-parametric tests. One of the most common approaches is the Pre-Whitening of the time series. The Pre-Whitening approach involves in the calculation of serial correlation and the removal of correlation if the calculated serial correlation is found significant at a level of 5% [3].
It is important to make sure that there is no correlation with the order in which the data have been collected and with an increase or a decrease in the magnitude of those data. It is also important that the selected testing periods are of sufficient length for test to be reliable [1].
A study of rainfall trends in Sri Lanka [7], which chose both Mann-Kendall rank statistic and the Spearman rank statistic, concluded that both tests have similar power in detecting a trend. In the present work, Spearman's rank correlation method is used to verify the absence of trend at a significance level of 5%.

Standard Normal Homogeneity Test (SNHT) &
Pettitt test were chosen to identify any sudden shifts in the mean of the data sets thereby enabling the identification of change points. A critical probability level of 80% was chosen for acceptance of significant change points in the Pettitt test whereas critical confidence level of 90% was used in the SNHT [3].
Instability of the variance was tested to identify the existence of a non-stationarity of the time series. Ratio of the variances of two split, nonoverlapping, sub sets of time series was selected as the test statistic. The region for test statistic of t F was taken as, F {v 1 , v 2 , 2.5%} < F t < F {v 1 , v 2 , 97.5%}; where, v 1 = n 1 -1 (the number of degrees of freedom for the numerator), v 2 = n 2 -1 (the number of degrees of freedom for the denominator), and n 1, n 2 equals the number of data in each sub set [1].
The t-test for stability of the mean was conducted after carrying out the F-test using same two non overlapping time series subsets. The test statistic t t [1] was taken to be bounded as, t {v, 2.5%} < t t < F {v, 97.5%} where, v = (n 1 -1) + (n 2 -1) (the degrees of freedom) including n 1 and n 2 data in each sub set.
In order to identify the employability of the parametric test procedure, the time series was tested for normality by computing probability of exceedence based on the Blom equation [8].
Estimation of the data X est with standard variates was used to determine the variability of the quantile with 95% confidence limit.
Homogeneity of the time series was inspected with the method of cumulative residuals. The estimated cumulative residuals and the ellipse that relate with the probability level were plotted against years to find whether the cumulative residuals fall within the ellipse [9] Double mass analysis was performed using plots of cumulative values of a station under investigation against the cumulative values of the particular station or cumulative values of the average of other stations over the same period of time. To identify the Relative Consistency of time series, detection of nonhomogeneities was performed by identifying inflection points in the double mass plot. In case of significant changes, the annual values of an earlier portion of the record were adjusted to be consistent with the latter portion [10].

Results and Discussion
The present work conducted for rainfall data sets of six stations indicated the variation of results from different statistical tests which are commonly used for hydrological data testing. Annual rainfall data were plotted in order to find the presence of any abrupt changes. During the considered period of 30 years, abrupt changes or any dubious data were not apparent for all six stations. Results of statistical tests pertaining to each station are shown in Table 2. Missing data were filled with the use of single & multiple regression analysis. Computed best fit coefficients of determination (Table 3) were considered for data filling. Generation of missing data was carried out relative to a common data period (Table 3) in which the data were assumed as homogeneous for the computations.
The co-efficient of determination with regression analyses is relatively good for the stations at Pasyala (0.91) and Vincit (0.88), whereas other stations showed to have relatively low values (Table 3). It was assumed that the period considered for regression (i.e. 01.05.1981 -31.03.1982) is homogeneous. Selecting a homogeneous period is entirely dependant on the available metadata. In this study, the considered homogeneous period for regression is less than one year. Therefore, it was felt reasonable to assume that a minimum or no changes could occur to the station during the selected period. Based on these facts, the above assumption could be treated as realistic.

ENGINEER 4
Minimum values of annual rainfall which were lower than low outlier were corrected with the low outlier. High outliers showed higher values in case of the maximum values of annual rainfall (Table 4). Tabular comparison of annual rainfall showed that the minimum values should be filled with the low outlier except for the minimum value of Pasyala. Annual rainfall comparisons with values shown in Figure 4, were used to identify any abrupt changes. From this data set, annual rainfall at Pasyala station which has more issues than other stations, and with t-test confirmed change points, is selected to discuss the issues related to rainfall data. Statistical analysis & homogeneity test results of Pasyala annual rainfall are shown in Figures 2 (a-f)  In order to identify the possibility of data use, an alternative option was considered. Since there is no strong evidence that the mean state of rainfall in Sri Lanka has changed significantly over the past decades, it was assumed that the effect of climate change had not significantly affected the rainfall of Attanagalu Oya. Accordingly the application of Double Mass curve for correction of change points was considered realistic and the same was utilized for Pasyala in order to correct the change which was present at 1985. Reduction of the trend could be observed after Double Mass correction. The change at Pasyala in 1977 was insignificant after carrying out Double Mass correction for 1985 (Table 2, Figure 2).

Homogeneity test shows that the annual rainfall at Henerathgoda, Karasnagala and
Vincit is homogeneous at 85% non-exceedence probability level whereas for Halgahapitiya, Katunayake and Pasyala it is at 90% nonexceedence probability level. Homogeneity test results, before & after Double Mass corrections for Pasyala are shown in Figure 3. It could be observed that Pasyala rainfall data set is homogeneous at 90% non-exceedence probability level even after the Double Mass correction. Homogeneity test showed that the acceptable probability level of Pasyala data set is 90% since all residuals were found to be within the 90% probability ellipse after Double Mass correction ( Figure 3). As the t-test results did not confirm the results of the Pettitt test, rest of the stations were not subjected to correction.
Correlogram shows that 1 st lag serial correlation for all datasets had fallen within the 95% confidence limit. Therefore, all these annual rainfall time series are with a satisfactory level of randomness and independence. As a result, pre-whitening of annual rainfall time series was not necessary prior to performing statistical tests. In order to select the need of parametric or non-parametric testing, normality testing was carried out and it was identified that all stations follow the normal distribution pattern except Katunayake which exceeds the 95% confidence limits.
It can be observed that a decreasing rainfall pattern is prevailing in Attanagalu Oya basin. Also the change of rainfall pattern around 1977 & 1985 is common for some stations. Mass curve shows that the change in the slope is not significant. Therefore, it suggests that the data from each station are satisfying consistency. As such, likely reasons for the changes around above years are mainly due to man made changes to the environment and most probably due to change of instruments.
Some of the homogeneity tests depend on meta-data while the other tests are purely statistical. The presence of a single significant test result could be identified as weak evidence of change. If more tests not similar to one another lead to significant test results, then it provides stronger evidence of change. Carrying out similar tests which would provide multiplesignificance is not an extra proof of change. However, application of more than one test to the data may make interpretation of results complex. Since the differences in assumptions of the tests and the possible influence of change in the catchment condition, it is usually difficult

ENGINEER 5
to compare and in particular to combine the results of different tests.
Meta data plays a major role when making firm conclusions with regards to data checking. In Sri Lanka most gauging stations are maintained without proper documentation of meta data and it is common knowledge that the most of the hydrological and meteorological data are poorly documented and quality flagged. This is a big challenge faced by hydrologists when attempts are taken to analyze rainfall data. It is known that for situations where no meta data are available, hydrologists need to consider regional and global changes to rainfall during that period. However it is difficult to address micro climatic changes without meta data. In many stations of Sri Lanka it is not a difficult task to obtain a 30 year long rainfall dataset. These data are bound to be with missing data periods, non-homogeneity and other inconsistencies. Hydrologists need to identify the purpose of data and perform checks to ensure reliability of results produced with such data. The present study presents an attempt taken to identify measures that can be taken when data checking is carried out in case of a Sri Lankan situation.