Original Article 
Corresponding author: Ralitsa Raycheva ( dirdriem@gmail.com ) © 2022 Ralitsa Raycheva, Ani Kevorkyan, Yordanka Stoilova.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Raycheva R, Kevorkyan A, Stoilova Y (2022) Stochastic modelling of scalar time series of varicella incidence for a period of 92 years (19282019). Folia Medica 64(4): 624632. https://doi.org/10.3897/folmed.64.e65957

Introduction: Varicella is an acute, highly contagious disease, characterised by generalised vesicular exanthema caused by the initial infection with varicella zoster virus (VZV) which usually affects children aged 2 to 8 years.
Aim: To analyse the changes of varicella incidence in Bulgaria over the period of 1928–2019.
Materials and methods: The time series analysis is based on the official data for varicella incidence (per 100,000) in Bulgaria for ninetytwo years (1928–2019), obtained from three major sources. We utilized the method to construct a time series model of overall incidence (1928–2019) using time series modeller in SPSS v. 25. We followed all three steps of the standard ARIMA methodology to establish the model – identification, parameter estimation, and diagnostic checking.
Results: Stochastic scalar time series modelling of the varicella incidence from 1928 to 2019 was performed. The stochastic ARIMA (0,1,1) was identified to be the most appropriate model. The decomposition of varicella incidence time series into a stochastic trend and a stationary component was reasoned based on the model defined. In addition, we assessed the importance of the longterm and immediate effect of one shock. The longterm forecast was also under discussion.
Conclusions: The ARIMA model (0,1,1) in our study is an adequate tool for presenting the varicella incidence trend and is suitable to forecast near future disease dynamics with acceptable error tolerance.
ARIMA models, dynamics, epidemiological forecasting, stationary series, trends
Varicella is an acute, highly contagious disease characterised by generalised vesicular exanthema caused by the initial infection with varicella zoster virus (VZV), which usually affects children aged 28 years.^{[1]} Although the infection is usually mild with selflimited evolution, all nonimmune individuals are at risk of infection and complications. Groups at higher risk for severe complications are pregnant women, preterm infants, and immunocompromised patients of all ages. In addition, WHO estimates that the annual global burden of varicella is approximately 140 million cases of which 4.2 million with severe complications and 4200 deaths.^{[2]} The almost universal natural infection, especially in the absence of vaccine prophylaxis^{[3]}, determines the need for accurate data collection for assessment of the actual size of the epidemic process. Robust epidemiological data is essential for setting the priorities while planning the immunization strategies on national level. A limited number of industrialised countries have introduced varicella vaccination into their childhood immunisation programs, which has led to substantial reductions in varicellarelated incidence and mortality.^{[2]}
Varicella is not on the list of reportable infectious diseases and subject of registration in the European Union, thus the overall burden of disease is not well known and a countrylevel estimate is challenging.^{[4]}
Bulgaria is among the countries with national mandatory surveillance system covering the total country population since 1940^{[5]}, but data on incidence dates back to 1928. The analysis of the key epidemiological indicators, including the prevalence of acute infectious diseases (excluding tuberculosis, AIDS and sexually transmitted infections) in Bulgaria, reveals that varicella usually takes the first place in their distribution. In the last eleven years (2008–2019), the proportion ranges from 39.18% (2008 – 22693/57916) with an incidence of 297.02 (per 100,000)^{[6]} to 47.73% (2017 – 25007/52393) with an incidence of 352.12 (per 100,000) ^{[7]}. The notable exception is 2010, when severe measles outbreak^{[8]} displaced varicella from the first place of acute contagious diseases distribution^{[9]}. In Bulgaria, according to the current Ordinance for Registration of Infectious Diseases^{[10]}, varicella cases are classified as possible, probable, and confirmed, predominantly possible.
Ninetytwo years of national surveillance of varicella incidence as one of the major intensive epidemiological indicators enables us to outline the trends and make an epidemiological forecast for varicella infection in this country. It should be noted that the stochastic modelling of different infectious disease parameters, although not often, is still available.^{[11–14]}
In this context, our study is a continuation of the present trends for new experimental data accumulation with help of the modern achievements of statistics in the study of nonstationary time series.
The time series analysis is based on official data for varicella incidence (per population of 100,000) in Bulgaria for ninetytwo years from 1928 to 2019. The information was collected from three major sources: The National Center for Public Health and Analysis (NCPHA), the National Center for Infectious and Parasitic Diseases (NCIPD) and data from the chronicles of Bulgarian epidemiology in the 20th century.^{[15]}
We used the method of time series analysis to construct a time series model of overall incidence (1928–2019) using a Time Series Modeller in SPSS v. 25. We established autoregressive integrated moving average (ARIMA) model for prediction.^{[16]} A pvalue<0.05 was considered statistically significant.
The ARIMA model was based on the yearly incidence varicella rates (per 100,000 population) from 1928 to 2019 in Bulgaria. We followed all three steps of the standard ARIMA methodology to establish the model: identification, parameter estimation, and diagnostic checking. We analysed the graphs for the autocorrelation function (ACF) and partial autocorrelation function (PACF) to identify the possible values of nonseasonal autoregressive order (p), nonseasonal differencing (d), and nonseasonal moving average order (q). Three test procedures were used to determine whether the time series after differencing was stationary or not. In the second step, the ARIMA model parameters was estimated. In the last step, we diagnosed the residual error sequence for whitenoise sequence using the LjungBox Q test. Finally, the SPSS Expert Modeller has identified and estimated the best fitting ARIMA.
We visually inspected Fig.
Fig.
Thus, there is enough evidence for a more indepth analysis of nonstationarity of the surveyed time series. Appropriate stochastic modelling, on the other hand, necessitates a study of the nature of eventual nonstationarity. The contemporary time series analysis elaborates four options: trend stationarity, a unit root (a difference stationary process), structural changes, and long memory. While in the trend stationarity, the original series becomes stationary after proper linear detrending, in the case of a unit root process, this happens after the differentiation of the series. Structural changes in an autoregressive model are of interest as the time series properties of the model, such as stationarity, may be different before and after the change.^{[18]} Structural changes mean stationarity around a linear trend, but with one or several points of structural changes in which the trend parameters undergo a jump. Long memory processes are positioned between stationarity and unit root. These are time series that become stationary after applying the fractional differentiation filter. When the differentiation indicator is a number between 0.5 and 1, the original series is nonstationary, because its dispersion increases over time.
First, we conducted DickeyFuller test to test the null hypothesis that a unit root is present in a time series of varicella incidence.^{[19]} The testing methodology follows Pfaff.^{[20]} The test runs in three steps in its entirety. The procedure starts with the regression of differentiated base series on a linear trend, on the original series at a delayed one unit, and on a final number of endogenous variables. Endogenous variables are actually the differentiated base series at a delay of 1, 2, etc. They are included to provide noncorrelation of the regression residues. In our case, two endogenous variables are sufficient to satisfy the requirement. We tested the null hypotheses that the time series of varicella incidence is stationary after differentiation. We concluded that the time series was stationary after differentiation without trend and drift in datagenerating process (tau criteria=14.44; p<0.1). However, the question remains: whether only one round of differencing is sufficient to obtain stationary time series. To investigate this issue, we repeated the already conducted test procedure, but not with the original, but with the differentiated time series. It turns out that the regression residuals are uncorrelated even with one endogenous variable. Even in the first step of the procedure, the null hypothesis for postdifferentiation stationarity is rejected at significance levels below 1%. According to the following methodology, this result is sufficient to conclude that the one round differentiated time series of varicella incidence is stationary.
Although the DickeyFuller test proves that the time series of varicella incidence is stationary after differentiation; however, as is well known, this test has a relatively small power.^{[20]} Other researchers^{[21]} suggest a different type of test procedure where the null hypothesis is stationarity against the alternative stationarity after differentiation. Such an approach is consistent with the conservative strategy that it is more natural the hypothesis of interest, in this case postdifferentiation, to be the alternative. The test model assumes that the baseline time series is a superposition of trend, random walk, and Gaussian noise. The null hypothesis states that the dispersion of shocks in random walk is zero and therefore the order is trendstationary. The alternative is nonzero dispersion or stationarity after differentiation. The options to consider are two: stationarity around a linear trend and stationarity to a constant average level. The asymptotic distributions of the relevant statistics are nontrivial and their percentiles are tabulated by the authors.^{[21]}
The test procedure requires fixed length of the Bartlett (triangular) window, used to calculate the one time step error dispersion. It turns out that the length of this window is short (= 3) delays. The results of the test itself are as follows: In the case of null hypothesis about stationarity around the constant, the value of the corresponding statistics is 2.21 and this hypothesis can be rejected at a significance level of 0.01%. Thus, the stationarity test we conducted rejected the null hypotheses of stationarity around the trend with a slope or around a constant.
It is possible that a time series will remain stationary around a trend with abrupt structural changes. We tested the time series of varicella incidence for this option by applying the socalled stationarity test after differentiation.^{[20,22]} The null hypothesis is stationarity after differentiation. The alternative is stationarity around a trend with one point of structural change. The options are three: the first – the jump is only in the trend constant; the second – the jump is only in the slope, and the third – the jump is in both the constant and the slope. The point of structural change itself is not predetermined, but assigned in the test procedure, so as to minimally favour the null hypothesis of random walk with drift. Test regressions include the already commented endogenous variables.
The results of this incidence time series test are as follows: in the case of the alternative for structural change only in the trend constant, two endogenous variables are sufficient for the noncorrelation of the residues. The year of the possible structural change is 1985. The value of the relevant statistics is 10.65 and the null hypothesis for postdifferentiation stationarity is rejected at a 1% significance level. The conclusion is that the hypothesis of stationarity after differentiation of the time series of varicella incidence is rejected against the alternatives for stationarity with different types of structural change in the linear trend in the first stage of the analysis.
In the end, we will focus on the fourth option of nonstationarity – time series to be stationary after the filtration with fractional differentiation filter. The test procedure^{[23]} is similar to the rejection of the null hypothesis in the stationarity after differentiation tests, e.g. DickeyFuller, and at the same time resembles the rejection of the null hypothesis in the stationary tests, for example the KwiatkowskiPhillipsSchmidtShin test. Herein, we confirmed this assumption. As it was proven, DickeyFuller’s test rejected the null hypothesis for stationarity after differentiation. Hence, the fractional differentiation process seems unlikely compared with the likelihood of stationarity after differentiation.
In conclusion, the applied test procedures confirm the conclusion that the nonstationarity of varicella incidence time series is stationarity after differentiation type. The differentiation of the time series leads to stationarity, which seems to be not only without a slope, but also without drift.
Oneround differentiation of varicella incidence time series was sufficient to ensure stationarity. This result suggests BoxJenkins modelling with the appropriate integrated autoregressive moving average (ARIMA).^{[24,25]} The procedure starts with a preliminary analysis of the differentiated time series, its autocorrelation function, and partial autocorrelation function. Fig.
The visual examination of the differentiated time series leads to the conclusion that it is stable in terms of variations in dispersion over time. Fig.
Time series (top), autocorrelation function (middle) and partial autocorrelation function (below) of the differentiated time series of varicella incidence.
At this stage of the analysis, a question of crucial importance is how to define the “best fit” among the already outlined set of models. To respond to the challenge, we took advantage of the opportunity offered by IBM’s Statistical Package for Social Sciences (SPSS). IBM SPSS Forecasting Analysis contains a Time Series Module in which Expert Modeller automatically identifies and estimates the best fitting ARIMA or exponential smoothing model for one or more dependent variable series. Although users can specify a custom ARIMA manually, Expert Modeller eliminates a great deal of the trial and error associated with doing so. As a result of the application of Expert Modeller, the ARIMA model (0,1,1) was favoured. The prediction equation for ARIMA (0,1,1) model can be written in a number of mathematically equivalent forms, one of which is the socalled “error correction” form, in which the previous forecast is adjusted in the direction of the error it made:
Ŷ_{t} = Ŷ_{t1} + αe_{t1},
because e_{t1} = Ŷ_{t1} − Ŷ_{t1} by definition, this can be rewritten as:
Ŷ_{t} = Y_{t1} − (1α)e_{t1} = Y_{t1} − θ_{1}e_{t1}
The estimation of the coefficients of the selected model using the maximum likelihood method results in:
Ŷ_{t} = Y_{t1} – 0.404e_{t1}
Stochastic ARIMA models suggest normal distribution and uncorrelated residues with a constant dispersion over time. The normal distribution was checked with the ShapiroWilk test. The value of Wstatistics in the ShapiroWilk test is 0.99 and the corresponding pvalue is 0.483. The hypothesis of normal distribution is not rejected. The final conclusion is that the residues are normally distributed. Fig.
Fig.
As a general rule, the longterm forecasts of stationarity after differentiation processes are prone to failure due to the indefinite accumulation of forecast errors over time. Here we will demonstrate a relatively extended forecast, but with a limited horizon outside the sample. In Fig.
Year  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  
Observed  528.97  315.28  343.88  453.94  353.12  339.90  434.19  
ARIMA (0,1,1)  Forecast  368.32  369.47  370.61  371.76  372.91  374.06  375.21  376.36  377.50  378.65 
95% LCL  250.02  230.58  214.47  200.59  188.36  177.39  167.43  158.31  149.91  142.11  
95% UCL  502.96  531.49  556.68  579.64  600.96  621.02  640.06  658.26  675.75  692.63 
The proper identification procedures we applied indicated that ARIMA (0,1,1) model fits best our original time series of the annual varicella incidence for the period from 1928 to 2019. In trendstationary models, characterised by given common properties, the time series returns to the trend, in particular, at stationarity – to the average.^{[27]} The longterm forecast befits the trend. Unanticipated disturbances, commonly called shocks, naturally fade over time. On the contrary, if the time series is stationary after differentiation, the theory requires distinctly opposite interpretation.^{[27]} The time series does not return permanently to the trend or average. The dispersion of the prediction error grows over time. The shocks have longlasting effect.
Our model is stationary after differentiation without drift. Hence, as long as the observed time series values do not tend to zero, the varicella incidence should not be steadily decreasing. The dispersion growth over time impacts the forecast accuracy. The effect of random shocks on incidence is longlasting, but is smaller than the immediate ψ(1)=0.09.
Varicella incidence for the time period 19282019 demonstrates upward trend with an average of 280.14 (per 100,000). In the subperiod 19281951, the incidence was lower (between 40.40 and 65.60 per 100,000) with an average of 36.20 (per 100,000), most likely related to partial registration of the infection and incidence during that period. In 1952, the indicator for the first time exceeded 100 per 100,000 population, and in 1955 it exceeded 200 per 100,000 population (a rate below which the incidence did not fall throughout the overall subperiod from 1955 to 2019). During the 19552019 subperiod, the lowest registered rate was 230.40 (per 100,000) in 1956 and 1961, and the highest rate recorded was 614.29 (per 100,000) in 1982. Based on this evidence, the logical assumption is that the incidence rate will continue to increase in future decades. In the absence of varicella vaccination as a regular component for epidemic process modelling, we observed typical epidemics cycles at short intervals of 25 years.^{[2]} However, using the ARIMA model (0,1,1), the established trend demonstrates that the varicella incidence registered in Bulgaria increases over a longer interval of time – about 10 years (Fig.
Most of the available articles and reports studying varicella infection in Bulgaria are descriptive with analyses of rates (incidence, mortality, and lethality) and proportions (age distribution of the patients, number of hospitalised patients, etc.) and clinical focus on the severity of illness and complications. To our knowledge, the presented ARIMA model is the first in this country to outline the dynamics of the epidemic process over a period of ninetytwo years. Moreover, the established trend provides robust epidemiological forecast to inform health authorities and support evidenced based decisionmaking for the inclusion of varicella vaccination into the immunisation schedule of the country. The ARIMA model (0,1,1) in our study is an adequate tool for presenting the varicella incidence trend and is suitable to forecast near future disease dynamics, with acceptable error tolerance.